You are viewing the RapidMiner Radoop documentation for version 9.9 -Check here for latest version
Advanced Radoop Processes
To understand advanced Radoop process design, you should be familiar with the RapidMiner data flow,Radoop basics, andhow to import data.The following section introduces the predictive analytics capabilities of RapidMiner Radoop and illustrates how smoothly it integrates into the RapidMiner client interface by using the same I/O objects, metadata, and process design concepts. However, because Radoop implements operations in Hadoop, you can deploy your process (or a selected part of it) on the cluster to manage the enormous data volumes using scalable distributed algorithms.
Model objects
Data mining methods, primarily supervised learning algorithms, create prediction models. (Other algorithms generate other types of models, but this section focuses mainly on prediction and clustering models.) These models describe the information explored from a training data set. A RapidMiner model is an I/O object — aPrediction Model,Cluster Model, orPreprocessing Model— just like an ExampleSet object. Visualization of the model displays the explored information to the user.
Both Radoop and core RapidMiner operators train and apply the same types of model objects. More precisely, Radoop implements the scoring and learning of some popular RapidMiner models over the distributed infrastructure; all core prediction models can be applied now on the cluster. This is true not only for prediction models, but also for the preprocessing models that a preprocessing operator, for example,Add Noise, generates (Noise Model).
Another type of I/O object related to modeling is thePerformance Vectorobject. Predictions, in the form of models, are evaluated by comparing the expected target attribute values to the result of the scoring. With that result, performance criteria create measures to describe the validity of the prediction (model) from the viewpoint of the data mining process goals. In other words, Radoop performs — on the cluster — scoring and model application as well as performance criteria calculation.
Data mining scenarios
This section reviews some possible memory/cluster scenarios for your predictive modeling processes.
Unsupervised learning in Hadoop.In this scenario, scalable distributed algorithms create clustering of data in Hadoop. Some clustering algorithms build clustering models that you can apply on other similar data sets, either on Hadoop or in memory. The model is delivered on thecluster model(clu) output port of clustering operators.
Supervised learning in memory, scoring on the cluster.Core RapidMiner modeling (classification and regression) operators build predictive models in memory that you can then apply to your data in the cluster. Radoop supports applying all core prediction models. By using theSparkRMorSingle Process Pushdownoperators the learning can be performed in the memory of one or multiple cluster nodes in parallel using any core RapidMiner modeling operator.
Supervised learning and scoring with native algorithms on the cluster.In this scenario, learning and scoring are both implemented on the distributed platform. Radoop supports the following algorithms: Naive Bayes, Linear Regression, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest. It also supportsiterative Naive Bayesmodel building, which means that a model (built on the cluster or in memory) can be updated by new data (on the cluster or in the memory).
Clustering models
Radoopclustering operatorsare built on the distributed machine learning algorithms of theApache Mahoutproject. The input of a clustering operator is a HadoopExampleSet object. The algorithm creates the specified number of segments in the data. It adds a new attribute to the data set, namedclusterid, that has theclusterrole. The attribute contains the cluster identifiercluster_1,cluster_2, etc.
Clustering operators may also indicate that a record is so distant from the rest of the records that it should be considered an outlier. In this case, the cluster attribute contains the valueoutlier.
K-MeansandFuzzy K-Meansalgorithms also build a centroid clustering model. You can use the centroid model to cluster similarly structured data sets that reside on your cluster or in memory.
Classification and regression models
You can connect prediction and cluster models trained by core RapidMiner operators in the client machine's memory or in a cluster node's memory (using theSparkRMorSingle Process Pushdownoperators) to the input port of aRadoop Nestand use them inside its subprocess. Radoop'sApply Modeloperator takes just such a model and a test data set as inputs. The operator applies the prediction model on the HadoopExampleSet input and delivers a HadoopExampleSet with new prediction and confidence columns. In a cluster model, Radoop generates a cluster column.
Radoop supports scoring on the cluster for all RapidMiner prediction and cluster models. These models produce the same prediction inside and outside the nest on the same data sets, but scoring inside the nest is not limited to the operative memory size; it is scalable over the distributed platform.
You can also train a prediction model on the distributed platform. Radoop supports Naive Bayes, Linear Regression, Logistic Regression, Support Vector Machine, Decision Tree and the Random Forest learning algorithms. These algorithms have the following characteristics:
Algorithm | Attribute type | Class type | Notes |
---|---|---|---|
Naive Bayes | numerical and nominal | polynominal | Contains a unique, linearly scalable implementation for Naive Bayes. Ideal for iterative learning. |
Linear Regression | numerical | numerical and binominal | Integrates the Spark MLlib implementations. |
Logistic Regression | numerical | binominal | Integrates the Spark MLlib implementations. |
Support Vector Machine | numerical | binominal | Integrates the Spark MLlib implementations. |
Decision Tree | numerical and nominal | binominal and polynominal | TheDecision Tree (MLlib binominal)integrates the Spark MLlib implementation and can handle binominal label, theDecision Tree运营商处理多项式标签和集成spark.ml's algorithm. |
Random Forest | numerical and nominal | polynominal | Integrates spark.ml's Random Forest algorithm. |
You can apply models trained with the above algorithms in the same way, using the coreApply Modeloperator (in-memory) and the RadoopApply Modeloperator (in-Hadoop).
Iterative learning
Iterative learning is a special type of learning for classification or regression tasks. With each iteration, Radoopupdatesthe model with a new training data set. Since the much faster update with the new records results in the same model, you can rebuild the model with just the new records as if you had built it with the whole new data set. This is a very common scenario for big data analytics (for example, new log records may update a prediction model periodically).
Radoop implements iterative learning with theUpdate Modeloperator. This operator takes a model as input (aDistribution Modelthat was built with Naive Bayes) and updates it by training on the HadoopExampleSet on its other input port. The input model may have been trained either by a Radoop learner operator (an operator that creates a model based on an ExampleSet, such as Naive Bayes) or by a core RapidMiner learner. The new data set must have exactly the same schema as the original training data set. That is, the name and attribute type must match, but ordering is irrelevant. The learning algorithm of theUpdate Modeloperator may (or may not) have parameters, for example optimization parameters. Therefore, theUpdate Modeloperator has a special "generic" parameter that allows you to easily specify model type specific parameters. (For example, ifUpdate Modelhas a DistributionModel on the input, you can set the nominal group size training parameter.)
The following illustrates model update and apply in a single subprocess:
Ensemble modeling
Radoop contains a meta operator calledCombine Models.This operator can be used both inside and outside of theRadoop Nest.It simply creates aBagging Model——有时被称为一个投票模型——交易ined models on its input. (Note that all input models have to be trained on data with the same schema.) The input can be a collection of models or an arbitrary number of models on the input ports. The ensemble model on the output port will be a voting model. If this model is used for scoring, it applies all inner models, then takes the average of confidence values (if it's a classification model) to decide the predicted class. In the case of a regression model (having a numerical label), it calculates the average of the inner models' prediction.
You may want to use theCombine Modelsoperator together with a training loop if you have too much training data on the cluster and, for some reason, the distributed algorithms provided by Radoop do not fit your use case. You can fetch random samples from your data to memory and train different models on these samples. After combining these models into a single Bagging Model, you can then apply the model directly to your data on the cluster. In this way, you not only use RapidMiner's wide range of learning algorithms, but you build a very precise and robust ensemble model in a short time using only samples of data.
You can also use only a subset of the regular attributes in each training iteration to build a more precise voting model. TheSelect random attributesoperator is ideal for this task. For the training algorithm, RapidMiner Radoop's decision tree learning algorithm may be a good choice. This way, you can implement, for example, aRandom Forestlearning algorithm, controlling the sample size, the number of iterations, and the randomization method based on the resources available (memory, bandwidth, amount of data). You can use any unstable learning algorithm instead of Decisision Tree to benefit from theCombine Modelsoperator, as Radoop support building the Random Forest model in a distributed way with a single operator.
The following illustrates a process that implements a Random Forest training algorithm on data samples (a loop with 100 iterations). It uses theTreepanel of the process to show all operators. You can open theTreepanel for a process by selecting theView > Show Panel > Treemenu item.
In-memory subprocesses
RapidMiner Radoop offers multiple ways to take advantage of RapidMiner's in-memory operators. TheIn-Memory Subprocess (Sample)andIn-Memory Subprocess (Full)operators implement the concept described in the previous section in a more general way. They both operate on subsets of rows. These operators are very useful for training ensemble models, butIn-Memory Subprocess (Full)is also capable of doing other preprocessing tasks. Unlike theIn-Memory Subprocess, theSparkRMandSingle Process Pushdownoperators are capable of executing the subprocess on a cluster node using thenode's available memory.用这种方法,你可以使用Hadoop集群的resources to perform any in-memory computation task.
Using the cluster's memory
TheSparkRMandSingle Process Pushdownoperators can be used to execute in-memory subprocesses on one or more cluster nodes. They are described on details on theProcess Pushdown subpage.
Using the client machine's memory
Since large datasets reside on the cluster, and cluster nodes are expected to have more resources than a client machine (that runs Studio or Server in this case),SparkRMandSingle Process Pushdownshould be preferredover the two (in a sense, legacy) operators described below.
In-Memory Subprocess (Sample)is a meta operator inside theRadoop Nestthat runs its subprocess on a data sample in operative memory. The Radoop process fetches a sample from the HadoopExampleSet on theIn-Memory Subprocess (Sample)input port to memory, and the subprocess — consisting of any core RapidMiner operators — is applied on this data. The data is then written back to the HDFS for further processing by the Radoop process. You can connect other I/O objects, for example a model trained on this sample data, to the output port for later use on the cluster. Please note that theSparkRMorSingle Process Pushdownoperators are preferred over theIn-Memory Subprocess (Sample)as they perform the same operation but use the cluster's resources instead of the client machine's.
In-Memory Subprocess (Full)differs from the sample version in the following way: it processes all the data on its input port in small partitions that fit into operative memory. You must either define the number of these partitions that the input data will be split into or the size of the partition. The meta operator performs a loop on these partitions. In each iteration it reads the data in the current partition to memory, runs the subprocess on it, and then writes the data back to the HDFS (appends it to the output table).
There are basically two different scenarios for using theIn-Memory Subprocess (Full)meta operator:
when you want to do special data preprocessing that you can not implement with Radoop operators, but can do with core RapidMiner operators (or operators from another extension). In this case, with the meta operator you can stream your data through the client's operative memory while preprocessing it. As the operative memory and/or network bandwidth may limit this, larger data sets are best served using RapidMiner Server. Your server may have enough memory and a good connection to process larger input data (larger partitions mean fewer iterations).
when you build object(s) using splits and you do not need to write the data back to the cluster. For example, similar to the Random Forest example in the previous section, you can train as many Decision Tree models to as many partitions as you choose (that is, "number of decision trees on the output" = "number of partitions"). You can then combine these models to a voting model with theCombine Modelsoperator. To do this, connect the model built inside theIn-Memory Subprocess (Full)meta operator to the output, then connect the <Collection of Decision Tree> output of the meta operator to theCombine Modelsinput. This way, you train a model on data that is much larger than the operative memory size, but can still expect the voting model to have quite good precision.
Just like In-Memory subprocess (Sample), In-Memory Subprocess (Full) can be replaced by the Single Process Pushdown. If your data is bigger than the memory of the largest node in your cluster, you can use theGenerate Attributesoperator to create a random attribute andFilter Examplesin a Loop to create the partitions for the Pushdown operator.
Evaluation
The purpose of performance evaluator operators is to provide simple measures for evaluating the current prediction and the prediction model itself. Analysts can define the target measures that they want to optimize for and the models can be compared using these performance criteria.
Radoop implements measures for evaluating binominal and polynominal classification and regression. These form a subset of the set of criteria that core RapidMiner evaluation operators implement. The following list of performance criteria describes these measures.
Type | Criterion name | Criterion description |
---|---|---|
binominal and polynominal | accuracy | Relative number of correctly classified examples |
binominal and polynominal | classification_error | Relative number of misclassified examples |
binominal and polynominal | kappa | Kappa statistics for the classification |
binominal | precision | 相对数量的正确积极的classified examples among all examples classified as positive |
binominal | recall | 相对数量的正确积极的classified examples among all positive examples |
binominal | lift | The lift of the positive class |
binominal | fallout | Relative number of incorrectly as positive classified examples among all negative examples |
binominal | f_measure | Combination of precision and recall: f=2pr/(p+r) |
binominal | false_positive | Absolute number of incorrectly as positive classified examples |
binominal | false_negative | Absolute number of incorrectly as negative classified examples |
binominal | true_positive | Absolute number of correctly as positive classified examples |
binominal | true_negative | Absolute number of correctly as negative classified examples |
binominal | sensitivity | 相对数量的正确积极的classified examples among all positive examples (same as recall) |
binominal | specificity | Relative number of correctly as negative classified examples among all negative examples |
binominal | youden | The sum of sensitivity and specificity minus 1 |
binominal | positive_predictive_value | 相对数量的正确积极的classified examples among all examples classified as positive (same as precision) |
binominal | negative_predictive_value | Relative number of correctly as negative classified examples among all examples classified as negative |
binominal | psep | The sum of the positive predictive value and the negative predictive value minus 1 |
polynominal and regression | absolute_error | Average absolute deviation of the prediction from the actual value |
polynominal and regression | relative_error | Average relative error (average of absolute deviation of the prediction from the actual value divided by actual value) |
polynominal and regression | relative_error_lenient | Average lenient relative error (average of absolute deviation of the prediction from the actual value divided by maximum of the actual value and the prediction) |
polynominal and regression | relative_error_strict | Average strict relative error (average of absolute deviation of the prediction from the actual value divided by minimum of the actual value and the prediction) |
polynominal and regression | root_mean_squared_error | Averaged root-mean-squared error |
polynominal and regression | squared_error | Averaged squared error |