You are viewing the RapidMiner Radoop documentation for version 9.1 -Check here for latest version
Installing RapidMiner Radoop on RapidMiner Studio
RapidMiner Radoop is client software with an easy-to-use graphical interface for processing and analyzing big data on aHadoop cluster. It can be installed on RapidMiner Studio and/or RapidMiner Server, and provides a platform for editing and running ETL, data analytics, and machine learning processes in a Hadoop environment. RapidMiner Radoop runs on any platform that supports Java.
Integrating RapidMiner Radoop into the RapidMiner advanced analytics suite is as easy as downloading the extension and making some configuration changes. The following instructions describe the process for installing the RapidMiner Radoop extension.
Prerequisites
The installation instructions assume that you have completed the following tasks. If any of these prerequisites have not yet been met, be sure to finish them before proceeding with the installation.
Component | Notes |
---|---|
RapidMiner | You need RapidMiner Studio, and optionally, RapidMiner Server installed. If necessary, see the instructions forRapidMiner Studio installationorRapidMiner Server installation. |
RapidMiner Radoop license | Radoop free license is automatically downloaded once logged in. (Note thatRadoop Basicis not enough to use Radoop.) If you are interested in enabling advanced capabilities and support,contact usto purchase a RapidMiner Radoop license. |
Hadoop cluster | RapidMiner Radoop requires connection to a properly configured Hadoop cluster. SeeHadoop cluster requirementsandsupported Hadoop distributions. |
A distributed data warehouse system | RapidMiner Radoop supports Apache Hive or Impala. The system must be installed on a Hadoop cluster. See thesupported data warehouse systems. |
Networking Setup | Make sure that RapidMiner Radoop can connect to your Hadoop cluster. After installing RapidMiner Radoop and creating connections, refer tonetworking setupfor more information. |
Verifying port availability for RapidMiner Radoop
RapidMiner Radoop requires access to a variety of ports on the cluster. Make note of your port assignments for later use when configuring cluster connections and security settings. Thetable in the networking setup sectionlists the default port assignments for various components.
Hadoop cluster requirements
RapidMiner Radoop requires a connection to a properly configured Hadoop cluster where it will execute all of its main data processing operations and store the data related to these processes. The cluster contains the following components:
- asupported Hadoop distribution, which consists of an HDFS and YARN
- adistributed data warehouse system(Hive or Impala)
- Java 8 on the cluster nodes (necessary for applying most RapidMiner models in-Hadoop and using Process Pushdown operators)
- optionally,Apache Spark. Below you can find detailed descriptions about the Spark requirements on the cluster.
RapidMiner Radoop supports all Spark versions from 1.2.0. See table below for information for Spark operator to Spark version requirements.
火花特点res | Spark version 1.2.x/1.3.x/1.4.x | Spark version 1.5.x/1.6.x | Spark version 2.0.x/2.1.x/2.2.x/2.3.x |
---|---|---|---|
Linear Regression | |||
Logistic Regression | |||
Decision Tree (MLlib binominal) | |||
Support Vector Machine | |||
Decision Tree | |||
Random Forest | |||
Single Process Pushdown | |||
SparkRM | |||
Spark Script | |||
K-Means | |||
Isolation Forest |
Using all Spark operators
If you want to use every Spark operator and your Hadoop cluster does not have 1.5 or above, then it needs to be installed on the cluster manually. You can do so by downloading it from theApache Spark download page. Please take care that the package type should meet your cluster setup.
Installing Spark 1.5.2 for Hadoop 2.6 or later (you need to change the download link and the path for older Hadoop or newer Spark versions):
hadoop fs -mkdir -p /tmp/spark wget -O /tmp/spark-1.5.2-bin-hadoop2.6.tgz http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz tar xzvf /tmp/spark-1.5.2-bin-hadoop2.6.tgz -C /tmp/ hadoop fs -put /tmp/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar /tmp/spark/
For using theSpark Scriptoperator, you need to havePython 2.6+ or Python 3.4+(for PySpark scripts) andR 3.1+(for SparkR scripts) installed on the cluster nodes. To be able to use MLlib functions in Python, please also install thenumpypackage. Because ofPARQUET-136Hive version 1.2.0or later is recommended.
Consider the following differences between using Hive and Impala as the query engine for RapidMiner Radoop.
The following list contains the features unsupported by the Impala 1.2.3 release.
Sortoperator: Impala does not support the ORDER BY clause without a LIMIT specified (or, since Impala version 1.4.0, only with certain restrictions that Radoop does not comply with). You may use the Hive Script operator to perform a sort by using an explicit LIMIT clause as well.
Add Noiseoperator: Add Noise is not supported on Impala.
Nominal to Numericaloperator:Unique integersmethod of Nominal to Numerical is not supported on Impala.
Pivot Tableoperator: Pivot Table is not supported on Impala.
Apply Modeloperator: Model application with Impala is not supported.
Update ModelandNaive Bayesoperators: On Impala, RapidMiner Radoop does not support Naive Bayes learning or model updating by operator.
Correlation Matrix,Covariance Matrix, andPrincipal Component Analysisoperators: The CORR() function is not supported by Impala.
Performance operators: ThePerformance (Regression)operator is not supported on Impala. For thePerformance (Classification)operator, only the following criterions are supported on Impala: Accuracy, Classification Error, and Kappa.
Aggregation functions: Some aggregation functions are not supported by Impala. This may affectGenerate Attributes,Normalize, andAggregateoperators. For these limitations, RapidMiner Radoop provides design-time errors, even though Impala allows you to run them.
No advanced Hive settings: You cannot set advanced Hive parameters for an Impala connection.
Hadoop cluster considerations
Although RapidMiner Radoop easily connects to all supported platform, you may require special settings if you encounter a problem when trying to use it with one of the listed distributions. Details can be found in theDistribution Specific Notessection. This section lists a few considerations that you should be aware of when choosing an HDFS or data warehousing platform:
Cloudera Impala is an open-source query engine over Apache Hadoop. It provides a low-latency interface to data stored in the HDFS for SQL queries, making RapidMiner Radoop usage closer to the experience of using it in a single host environment. WhileCloudera Impalacan provide much faster response time than Hive, it does not support all the features of HiveQL.
Evaluate the Impalalimitations是否它是一个可接受的alternative for your organization. For example, if you need advanced features (like model scoring), you must use Hive. If you use both Hive and Impala, consult theImpala Documentationfor information on sharing metadata between the two frameworks. If using both, metadata used in Impala must be reloaded to reflect any metadata changes (such as creating new tables) made in Hive. (This can be done by enabling thereload impala metadataparameter of theRadoop Nest.)
Installing RapidMiner Radoop on RapidMiner Studio
The RapidMiner Radoop client installation is straight-forward, assuming theprerequisitesare met and the appropriateports are available. The extension can be easilyinstalled from the Marketplace.
If you want to install the extension manually, follow the steps below.
In Step 3, you will move the files to:
There are two options for the installation, please choose one.
For enabling the plugin for all users on a machine (global install), move the files into the install folder atlib/plugins
.
In case of RapidMiner Studio versions 6.4 and later, for enabling the plugin only for a single user, move the files to.RapidMiner/extensions/
at the user home folder. If the extensions folder does not exist, create it.
For Mac users running RapidMiner Studio versions 6.4 and later, move the files into.RapidMiner/extensions/
. If the extensions folder does not exist, create it. Note that RapidMiner Studio creates.RapidMiner
as a hidden folder, so you must set your Mac to display hidden files and folders if you cannot see it.
For Mac users running RapidMiner Studio versions prior to 6.4, move the files into the install folder atlib/plugins
.)
The process is as follows:
If necessary, quit RapidMiner Studio.
Download the RapidMiner Radoop plugin, a JAR file, from the location specified in your confirmation email.
Move the downloaded RapidMiner Radoop JAR file (
rapidminer-Radoop-onsite-
) file to theRapidMiner Studio directory在主机系统。.jar With the JAR files moved, start RapidMiner.
If the extension has been successfully intalled,Hadoop Dataappears in the middle, as a new view, in the RapidMiner Studio startup window:
That's it. Now that RapidMiner Radoop is installed, see the section onconfiguring connectionsto complete the installation.
Considering security
Consider the following security measures to secure your HDFS and data warehouse infrastructure:
- Apply the firewall settings for your data warehouse system (optional but recommended).
- Use Kerberos or Apache Sentry for securing your cluster. See theHadoop security sectionfor security configuration suggestions.