Categories

Versions

You are viewing the RapidMiner Studio documentation for version 9.4 -Check here for latest version

Important Terms

The following lists the first terms you need to know when using RapidMiner Studio. Following the terms are a description of theRapidMiner data typesandoperator port descriptions.

Attribute

The information elements describing a scenario.Attributesare the table columns of a data set.

的服务le set included in this Getting Started guide has the attributes gender, age, payment method, last interaction, and churn.

Classification

The process of predicting which category (or class) an example belongs to, based on existing data for which category membership is known. A category is defined as the possible values for a label. (Similarly,regressionis the process for predicting numerical results.) That is, withclassificationyou construct a model that, when trained, uses the learned rules to predict the category of new data.

Each example in the data set falls into the category of either churning or not churning. The prediction of which category each example falls into, for those examples missing the label data, is derived from the rules learned during training.

Data set

Thetraining setis the data used to discover predictive relationships and train models. Thetest setis the data used to test the accuracy and meaningfulness of a model's representation of the predictive relationship (typically discovered using the training set). Thenew data setis the data with missing labels; the rules derived from the training set are applied to predict outcome for the new data set.

In this tutorial, you train and test your model using thecustomer-churn-datadata set. Originally an Excel file,customer-churn-databecame an available data set when you imported into RapidMiner.

Example

Characterized by its attributes, anexamplehas concrete values that can be compared with other examples. Examples are the table rows of a data set.

的服务le setcustomer-churn-dataincludes 993 examples (also known as rows). They are identified by a row number that RapidMiner prepends.

Example Set

The table created from the attributes (columns) and examples (rows). Also known as data or data set.

的服务le set used here iscustomer-churn-data, which originated from the filecustomer-churn-data.xslx.

Label

The identifying attribute in relation to the current question. The goal is to know or learn this attribute's (thelabel's) value, or learn rules for deriving it from the regular attributes, for each row in the example set. Sometimes referred to as thetargetattribute or variable, it is the thing to predict for new examples that are not yet characterized. There can be only one label per data set.

Churnis the attribute of interest in this tutorial’s data set. Setting the role of the Churn attribute tolabelallows you to predict, for each example, whether the customer will cancel.

Model

The data mining method or prediction instruction. Amodel发现规则和/或预测unkn解释道own situations for current and future examples.

In this tutorial you created a model that predicts whether a customer will cancel. Your evaluation (validation) of the model returns accuracy percentages.

Operator

The building blocks, grouped by function, used to create RapidMiner processes. Anoperatorhas input and output ports; the action performed on the input ultimately leads to what is supplied to the output. Operatorparameterscontrol those actions. There are more than 1500 operators available in RapidMiner. Operators, in theOperatorspanel of theDesignview, are both browsable and searchable.

In this tutorial you connect theRetrieveoperator (which “retrieves” the data set) to theFilter Examplesoperator. The resulting labeled data set is connected to theDecision Treeoperator to determine the set of rules RapidMiner will use to generate its predictions.

Panel

Each view has its own set ofpanels, or tools, related to the view. They can be moved, sized, and hidden to suit. You can access additional panels from theView > Show Panelpull-down menu:

See the graphic with callouts to identify panels. The following lists the default panels for each view:

  • Design: Operators, Repository, Process, Parameters, Help
  • Results: Repository, Result History
  • Hadoop Data (if the extension is installed): Hadoop Data, Hadoop Metadata, Hadoop Data Log

Parameter

The setting(s) whose value(s) determine the characteristics or behavior of an operator. RapidMiner presents parameters in theParameterspanel of theDesignview. There are regular parameters and expert parameters. The expert parameters are indicated by italic names and are displayed or hidden by clicking theShow/Hide advanced parameterslink at the bottom of the panel.

As part of the Wisdom of Crowds capabilities, RapidMiner Studio provides parameter recommendations based on the knowledge and best practices of other RapidMiner users. The recommender helps configure operators by providing recommendations on which parameters to change and by suggesting appropriate parameter values.

This tutorial uses the filtering parameters of theFilter Examplesoperator to create a training data set.

Ports

The point through which data moves, represented by a semicircle labeled icon on the sides or operators and theDesignview. See thelist of port abbreviationsbelow.

To see your filtered example set, connect the Output (out) port of theRetrieveoperator to the ExampleSet (exa) port ofFilter Examples. Then, connect the ExampleSet (exa) port onFilter Examplesto the Results (res) port at the right of theProcessview and clickRun arrowRun.

Prediction

The most probable value for a target attribute;predictions派生的数据挖掘。如果you have rules and data, you can predict an outcome.

The process in this tutorial may predict, for example: If the customer is male, over 54 years of age, and paid by credit card, then the probability of this customer canceling is high.

Process

A set of interconnected operators represented by a flow design, where each operator manipulates your data. Aprocessmight, for example, load a data set, transform the data, compute a model, and apply the model to another data set.

This tutorial creates a process that retrieves a data set from the repository, filters the data to create a training set, applies a decision tree operator to derive rules for predictions, applies the model to unlabeled data, and runs validation to evaluate the model.

Process view

The working area for building processes. This isthe canvasin theDesignview where you drag operators or where, when you double-click a process, the operators of that process appear.

When building your process, you first dragged your data set,customer-churn-data, onto theProcesspanel. Next you added aFilter Examplesoperator and connected them.

Repository

The storage mechanism for data and RapidMiner processes. Best practice recommends you use therepositoryfor data storage instead of reading directly from a file or database. If you use aReadoperator, meta data will not be available to RapidMiner, limiting the available functions.

By default, RapidMiner Studio comes configured with a variety of sample data sets and process in theSamplesdirectory of your repository. When this tutorial is complete, yourLocal Repositorywill include a new data set in new processes. From theRepositorypanel you can also access theCloud Repository.

Role

The identifying tag for or function of an attribute.Rolestell RapidMiner of special meaning or treatment for an attribute. RapidMiner has several pre-defined roles and supports the ability to create your own roles. Thelabelrole is of utmost importance in defining the target for a prediction. Any attribute without a role assigned is known as aregularattribute.

Apply thelabelrole to thechurnattribute. If the data set included row numbers, assign that attribute theidrole. All other attributes are not assigned a role and are thereforeregular attributes.

Training

The process of finding predictive relationships. The outcome of this learning process is the model.

Assigning thelabelrole to theChurn属性创建了一个决策树那t considers the age, gender, payment method, and last purchase to create rules for the new data.

View

A "work area" in which you access a specific functionality. There are two pre-definedviews. Some extensions can add their own views (for example, the Radoop Extension). You can also create your own view by clickingNew view...in theViewmenu.

See thegraphic with calloutsto locate each view:

  • Design: Canvas and tools for building and managing processes.
  • Results: Visualization, in many varied formats, of design process results.
  • Hadoop Data: Access to Radoop-related work.

RapidMiner data types

The following terms describe the data types RapidMiner assigns to attributes. Defining a data type specifies the kind of values allowed for an attribute. RapidMiner supports the natural division of numbers, texts, and dates. Numeric is the label for numbers, nominal for texts or strings, and date_time for dates.

attribute

Parent of all possible types ("any type").

binominal

Exactly two values (for example true/false or yes/no).

date

Date without time (for example 23.12.2014).

date_time

Both date and time (for example 23.12.2014 17:59).

费尔e_path

Nominal data type (rarely used) that allows for more granular distinction. Can be used to mark a column as "only containing file paths."

integer

A whole number (for example, 23, -5, or 11,024,768).

nominal

All kinds of text values; includes polynomial and binomial.

numeric

All kinds of number values; includes date, time, integer, and real numbers.

polynominal

Many different string values (for example red, green, blue, yellow).

real

A fractional number (for example 11.23 or -0.0001).

text

Nominal data type that allows for more granular distinction (to differentiate from polynomial).

time

Time without date (for example 17:59).

Operator port information

The following table lists each port abbreviation and provides a brief description.

Port Abbreviation Meaning Description
ano Anova 方差分析矩阵方差分析显著性检验
ann Annotation Annotations extracted from the input object
arc Archive Archive file generated during execution of the operator
ass Association Association rules that have been discovered in a frequent item set
att Attribute Attribute weights (in and out)
ave Average Performance measures; estimate of performance using the model built on the complete delivered data set
clu Cluster model Cluster model created when clustering an example set
clu Clustered set Example set given to the clustering operator; may contain an attribute with a cluster role (describes the cluster of each example)
col Collection Collection of objects
con Condition Any object can be supplied; the condition specified in parameters is tested on this object
cov Covariance Covariance matrix
dic Dictionary Example set used for replacing 'from' values with 'to' values in a given example set
dis Distance measure 相似Measure object
doc Document Document or document set
err Error Standard error output
est Estimated performance Performance vector of the SVM model which gives an estimation of statistical performance of this model
exa Example set Example set
费尔 File File object
fla Flat Flat collection or flat clustering model
for Formula Formula result
fre Frequent Frequent item or item sets for association rule learning
gro Grouped Grouped models, attributes, items
hie Hierarchical Hierarchical clustering model
inp Input Input source, can take various objects
ite Item sets Frequent item sets (groups of items that often appear together in the data)
joi Join Join of the left and right example sets
lab Labeled data Model that was given in input is applied on the example set and the updated example set is delivered from this port
lef 左input port expecting an example set, which is used as the left example set for a join
lif Lift chart Lift Pareto chart for the given model and example set
mat Matrix Correlations matrix of all attributes of the input example set
mer Merged Merged example set
mod Model Default model from this output port
obj Object IO object
ori Original Input example set is passed without changing to this port
out Output Output port
par Parameter set Set of parameters that can be applied on an operator
pat Patterns GSP algorithm is applied on the given example set; resultant sequential patterns set is delivered through this port
per Performance Performance Vector for selected attributes
pre Preprocessing Preprocessing model with information regarding the operator's parameters in the current process
ran Random forest Model of a random forest
ref Reference Provided reference data or reference set
req Request set Provided example set
res Result set Distance or similarity between examples of the request set and reference set
rig Right Right input port expecting an example set, which is used as the right example set for a join
roc ROC curve Calculated ROC curves for included models
rul Rules Association rules that have been discovered in a frequent item set
sec Second Input take an example set derived from the output of the Generate ID operator in an attached example process
seg Segment Segment of an image
sel Selected Object specified by the index parameter is returned through this port
ses Session Session example set
sig Significance Significance test results of performance vector comparison is delivered through this port
sim 相似 Calculated similarity between each example of the given example set with every other example of the same set
sin Single Single object of the given collection, which is processed in the inner part of the operator
sta Stacking Stacking examples or model
sto Stored Through this port, the input object is passed without changing to the output
sub Subtrahend Expects an example set; example set must have ID attribute
sup Superset Superset of input example sets
thr Through Objects are passed through without changing
thr Threshold Threshold output of the Select Recall operator
tra Training Training data to train a model (example set)
uni Union Union of the input example sets
unl Unlabeled Examples that are not labelled and therefore not used when training a model
unm Unmatched Examples that did not match a specified pattern in the original example set
unr Unrelated Examples that were unrelated to a specified pattern in the original example set
vis Visualization Self-organizing map (SOM) visualization
wei Weights Attribute weights
wor Word Expects or outputs a word list
xsl XSLT EXtensible Stylesheet Language (XSLT) document