You are viewing the RapidMiner Studio documentation for version 9.4 -Check here for latest version
Important Terms
The following lists the first terms you need to know when using RapidMiner Studio. Following the terms are a description of theRapidMiner data typesandoperator port descriptions.
The information elements describing a scenario.Attributesare the table columns of a data set.
的服务le set included in this Getting Started guide has the attributes gender, age, payment method, last interaction, and churn.
The process of predicting which category (or class) an example belongs to, based on existing data for which category membership is known. A category is defined as the possible values for a label. (Similarly,regressionis the process for predicting numerical results.) That is, withclassificationyou construct a model that, when trained, uses the learned rules to predict the category of new data.
Each example in the data set falls into the category of either churning or not churning. The prediction of which category each example falls into, for those examples missing the label data, is derived from the rules learned during training.
Data set
Thetraining setis the data used to discover predictive relationships and train models. Thetest setis the data used to test the accuracy and meaningfulness of a model's representation of the predictive relationship (typically discovered using the training set). Thenew data setis the data with missing labels; the rules derived from the training set are applied to predict outcome for the new data set.
In this tutorial, you train and test your model using thecustomer-churn-datadata set. Originally an Excel file,customer-churn-databecame an available data set when you imported into RapidMiner.
Characterized by its attributes, anexamplehas concrete values that can be compared with other examples. Examples are the table rows of a data set.
的服务le setcustomer-churn-dataincludes 993 examples (also known as rows). They are identified by a row number that RapidMiner prepends.
Example Set
The table created from the attributes (columns) and examples (rows). Also known as data or data set.
的服务le set used here iscustomer-churn-data, which originated from the filecustomer-churn-data.xslx.
The identifying attribute in relation to the current question. The goal is to know or learn this attribute's (thelabel's) value, or learn rules for deriving it from the regular attributes, for each row in the example set. Sometimes referred to as thetargetattribute or variable, it is the thing to predict for new examples that are not yet characterized. There can be only one label per data set.
Churnis the attribute of interest in this tutorial’s data set. Setting the role of the Churn attribute tolabelallows you to predict, for each example, whether the customer will cancel.
The data mining method or prediction instruction. Amodel发现规则和/或预测unkn解释道own situations for current and future examples.
In this tutorial you created a model that predicts whether a customer will cancel. Your evaluation (validation) of the model returns accuracy percentages.
The building blocks, grouped by function, used to create RapidMiner processes. Anoperatorhas input and output ports; the action performed on the input ultimately leads to what is supplied to the output. Operatorparameterscontrol those actions. There are more than 1500 operators available in RapidMiner. Operators, in theOperatorspanel of theDesignview, are both browsable and searchable.
In this tutorial you connect theRetrieveoperator (which “retrieves” the data set) to theFilter Examplesoperator. The resulting labeled data set is connected to theDecision Treeoperator to determine the set of rules RapidMiner will use to generate its predictions.
Each view has its own set ofpanels, or tools, related to the view. They can be moved, sized, and hidden to suit. You can access additional panels from theView > Show Panelpull-down menu:
See the graphic with callouts to identify panels. The following lists the default panels for each view:
- Design: Operators, Repository, Process, Parameters, Help
- Results: Repository, Result History
- Hadoop Data (if the extension is installed): Hadoop Data, Hadoop Metadata, Hadoop Data Log
The setting(s) whose value(s) determine the characteristics or behavior of an operator. RapidMiner presents parameters in theParameterspanel of theDesignview. There are regular parameters and expert parameters. The expert parameters are indicated by italic names and are displayed or hidden by clicking theShow/Hide advanced parameterslink at the bottom of the panel.
As part of the Wisdom of Crowds capabilities, RapidMiner Studio provides parameter recommendations based on the knowledge and best practices of other RapidMiner users. The recommender helps configure operators by providing recommendations on which parameters to change and by suggesting appropriate parameter values.
This tutorial uses the filtering parameters of theFilter Examplesoperator to create a training data set.
The point through which data moves, represented by a semicircle labeled icon on the sides or operators and theDesignview. See thelist of port abbreviationsbelow.
To see your filtered example set, connect the Output (out) port of theRetrieveoperator to the ExampleSet (exa) port ofFilter Examples. Then, connect the ExampleSet (exa) port onFilter Examplesto the Results (res) port at the right of theProcessview and clickRun.
The most probable value for a target attribute;predictions派生的数据挖掘。如果you have rules and data, you can predict an outcome.
The process in this tutorial may predict, for example: If the customer is male, over 54 years of age, and paid by credit card, then the probability of this customer canceling is high.
A set of interconnected operators represented by a flow design, where each operator manipulates your data. Aprocessmight, for example, load a data set, transform the data, compute a model, and apply the model to another data set.
This tutorial creates a process that retrieves a data set from the repository, filters the data to create a training set, applies a decision tree operator to derive rules for predictions, applies the model to unlabeled data, and runs validation to evaluate the model.
Process view
The working area for building processes. This isthe canvasin theDesignview where you drag operators or where, when you double-click a process, the operators of that process appear.
When building your process, you first dragged your data set,customer-churn-data, onto theProcesspanel. Next you added aFilter Examplesoperator and connected them.
The storage mechanism for data and RapidMiner processes. Best practice recommends you use therepositoryfor data storage instead of reading directly from a file or database. If you use aReadoperator, meta data will not be available to RapidMiner, limiting the available functions.
By default, RapidMiner Studio comes configured with a variety of sample data sets and process in theSamplesdirectory of your repository. When this tutorial is complete, yourLocal Repositorywill include a new data set in new processes. From theRepositorypanel you can also access theCloud Repository.
The identifying tag for or function of an attribute.Rolestell RapidMiner of special meaning or treatment for an attribute. RapidMiner has several pre-defined roles and supports the ability to create your own roles. Thelabelrole is of utmost importance in defining the target for a prediction. Any attribute without a role assigned is known as aregularattribute.
Apply thelabelrole to thechurnattribute. If the data set included row numbers, assign that attribute theidrole. All other attributes are not assigned a role and are thereforeregular attributes.
The process of finding predictive relationships. The outcome of this learning process is the model.
Assigning thelabelrole to theChurn属性创建了一个决策树那t considers the age, gender, payment method, and last purchase to create rules for the new data.
A "work area" in which you access a specific functionality. There are two pre-definedviews. Some extensions can add their own views (for example, the Radoop Extension). You can also create your own view by clickingNew theViewmenu.
See thegraphic with calloutsto locate each view:
- Design: Canvas and tools for building and managing processes.
- Results: Visualization, in many varied formats, of design process results.
- Hadoop Data: Access to Radoop-related work.
RapidMiner data types
The following terms describe the data types RapidMiner assigns to attributes. Defining a data type specifies the kind of values allowed for an attribute. RapidMiner supports the natural division of numbers, texts, and dates. Numeric is the label for numbers, nominal for texts or strings, and date_time for dates.
Parent of all possible types ("any type").
Exactly two values (for example true/false or yes/no).
Date without time (for example 23.12.2014).
Both date and time (for example 23.12.2014 17:59).
Nominal data type (rarely used) that allows for more granular distinction. Can be used to mark a column as "only containing file paths."
A whole number (for example, 23, -5, or 11,024,768).
All kinds of text values; includes polynomial and binomial.
All kinds of number values; includes date, time, integer, and real numbers.
Many different string values (for example red, green, blue, yellow).
A fractional number (for example 11.23 or -0.0001).
Nominal data type that allows for more granular distinction (to differentiate from polynomial).
Time without date (for example 17:59).
Operator port information
The following table lists each port abbreviation and provides a brief description.
Port Abbreviation | Meaning | Description |
ano | Anova | 方差分析矩阵方差分析显著性检验 |
ann | Annotation | Annotations extracted from the input object |
arc | Archive | Archive file generated during execution of the operator |
ass | Association | Association rules that have been discovered in a frequent item set |
att | Attribute | Attribute weights (in and out) |
ave | Average | Performance measures; estimate of performance using the model built on the complete delivered data set |
clu | Cluster model | Cluster model created when clustering an example set |
clu | Clustered set | Example set given to the clustering operator; may contain an attribute with a cluster role (describes the cluster of each example) |
col | Collection | Collection of objects |
con | Condition | Any object can be supplied; the condition specified in parameters is tested on this object |
cov | Covariance | Covariance matrix |
dic | Dictionary | Example set used for replacing 'from' values with 'to' values in a given example set |
dis | Distance measure | 相似Measure object |
doc | Document | Document or document set |
err | Error | Standard error output |
est | Estimated performance | Performance vector of the SVM model which gives an estimation of statistical performance of this model |
exa | Example set | Example set |
费尔 | File | File object |
fla | Flat | Flat collection or flat clustering model |
for | Formula | Formula result |
fre | Frequent | Frequent item or item sets for association rule learning |
gro | Grouped | Grouped models, attributes, items |
hie | Hierarchical | Hierarchical clustering model |
inp | Input | Input source, can take various objects |
ite | Item sets | Frequent item sets (groups of items that often appear together in the data) |
joi | Join | Join of the left and right example sets |
lab | Labeled data | Model that was given in input is applied on the example set and the updated example set is delivered from this port |
lef | 左 | 左input port expecting an example set, which is used as the left example set for a join |
lif | Lift chart | Lift Pareto chart for the given model and example set |
mat | Matrix | Correlations matrix of all attributes of the input example set |
mer | Merged | Merged example set |
mod | Model | Default model from this output port |
obj | Object | IO object |
ori | Original | Input example set is passed without changing to this port |
out | Output | Output port |
par | Parameter set | Set of parameters that can be applied on an operator |
pat | Patterns | GSP algorithm is applied on the given example set; resultant sequential patterns set is delivered through this port |
per | Performance | Performance Vector for selected attributes |
pre | Preprocessing | Preprocessing model with information regarding the operator's parameters in the current process |
ran | Random forest | Model of a random forest |
ref | Reference | Provided reference data or reference set |
req | Request set | Provided example set |
res | Result set | Distance or similarity between examples of the request set and reference set |
rig | Right | Right input port expecting an example set, which is used as the right example set for a join |
roc | ROC curve | Calculated ROC curves for included models |
rul | Rules | Association rules that have been discovered in a frequent item set |
sec | Second | Input take an example set derived from the output of the Generate ID operator in an attached example process |
seg | Segment | Segment of an image |
sel | Selected | Object specified by the index parameter is returned through this port |
ses | Session | Session example set |
sig | Significance | Significance test results of performance vector comparison is delivered through this port |
sim | 相似 | Calculated similarity between each example of the given example set with every other example of the same set |
sin | Single | Single object of the given collection, which is processed in the inner part of the operator |
sta | Stacking | Stacking examples or model |
sto | Stored | Through this port, the input object is passed without changing to the output |
sub | Subtrahend | Expects an example set; example set must have ID attribute |
sup | Superset | Superset of input example sets |
thr | Through | Objects are passed through without changing |
thr | Threshold | Threshold output of the Select Recall operator |
tra | Training | Training data to train a model (example set) |
uni | Union | Union of the input example sets |
unl | Unlabeled | Examples that are not labelled and therefore not used when training a model |
unm | Unmatched | Examples that did not match a specified pattern in the original example set |
unr | Unrelated | Examples that were unrelated to a specified pattern in the original example set |
vis | Visualization | Self-organizing map (SOM) visualization |
wei | Weights | Attribute weights |
wor | Word | Expects or outputs a word list |
xsl | XSLT | EXtensible Stylesheet Language (XSLT) document |