Categories

Versions

You are viewing the RapidMiner Radoop documentation for version 9.9 -Check here for latest version

RapidMiner Radoop Basics

After installing the RapidMiner Radoop extension, you can see theHadoop Dataview:

TheHadoop Dataview is a powerful tool for managing your data on the Hadoop cluster. Later in these pages, it isdiscussed in detail.

This section describes the main concepts required to design processes that run on your Hadoop cluster. Since Apache Hive is part of the solution, these explanations often refer to basic Hive concepts like tables or columns. If you are unfamiliar with Hive, think of it as a (distributed) relational database or data warehouse infrastructure over Hadoop. It understands SQL and translates input to distributed jobs.

Understanding the Radoop Nest operator

Radoop Nestis the most important building block in RapidMiner Radoop. Every process must contain at least oneRadoop Nest(meta) operator; it specifies the connection to the Hadoop cluster.

Any subprocess of Radoop operators putinsidetheRadoop Nestdescribes the process that runs on that Hadoop cluster; all contentoutsidethe nest (processed by RapidMiner Studio operators) is processed in memory. When you design a process, you will not notice any difference between these two sets of operators in their basic structure — inputs, outputs, and parameters. Most Radoop operators have a RapidMiner Studio counterpart. The main difference between the two is that the Radoop operators have a minimal memory footprint and always keep and process data on the cluster.

Radoop Nest parameters

You define cluster connection details in theconnectionparameter of theRadoop Nestoperator. If you change this parameter, the client immediately tests the connection (indicated by a progress bar in the lower right corner) and raises a warning if the connection test fails.

Additional parameters of theRadoop Nestoperator:

Parameter Description
table prefix During a process run, Radoop creates temporary objects in Hive. The object names start with the prefix defined with this parameter. You can allow user-specified prefixes during process design or set a default value for this parameter with thetable.prefixglobal property.
change sample size Overrides the output sample size for this subprocess. By deafult this parameter is set to false which means that thesample_size.overallpropertydetermines the sample size.
sample size 样本大小for Hadoop data sets on the Nest output; zero uses the full sample. This option is only available ifchange sample sizeis checked.
hive file format / impala file format Defines the storage format for Hive tables to use inside the nest. This setting applies to both temporary and permanent Hive tables, although you can override the storage format of permanent tables with theStore in HiveandRead CSVoperators. By default, the storage format is not specified and Radoop uses the Hive server or Impala default settings (usually TEXTFILE format). You can explicitly define TEXTFILE as the format, which has the advantage of providing human-readable data to the distributed file system. Change this parameter if you want to use more advanced file formats for optimized performance (smaller size and faster processing).
Impala connections only: reload impala metadata (advanced parameter) Calls invalidate metadata statement on the selected tables or the whole database if tables are not specified. This reloads the metadata in Impala from the Hive metastore so you can use all Hive tables and views in your process.
Impala connections only: tables to reload (advanced parameter) Call invalidate metadata on certain tables or the whole database if tables are not specified. Consider setting this parameter if your database contains a large number of tables.
cleaning (advanced parameter) Defines whether Radoop should delete temporary objects after a process finishes. The default behavior (true/checked) deletes the objects, which is highly recommended since your cluster could soon fill with temporary objects. You may want to uncheck this for a short period, for example to debug a process. You can轻松地删除这些对象.
auto convert (advanced parameter) If true/checked (the default), data sets consumed atRadoop Nestinput ports (and stored in operative memory) are immediately pushed to the cluster. Data is written to a temporary Hive table and is ready to serve as operator input inside the nest. If set to false/unchecked, only operators inside the nest write data to the cluster (usually when they consume it on their input ports). In this case you can also operate on in-memory data sets inside the nest, but you rarely need to do this.

Radoop Nest input

Radoop allows you to combine memory-based and cluster-based operators in the same process. On its input port,Radoop Nestimports data from the client's operative memory to the cluster. The operators inside the nest consume and produceHadoopExampleSetobjects (the cluster-based variant of the standard ExampleSet object). The HadoopExampleSet stores the data in Hive, in a temporary or permanent table or view, and has a minimal footprint on the client's operative memory. You can, of course, also process data that already resides on the cluster or import data directly to the cluster so that you do not have to use any of the Radoop Nest input ports. Instead, just access or import the data in a subprocess. The following is an example of a subprocess insideRadoop Nest.

Radoop Nest output

Radoop Nestcan have any number of output ports to deliver the memory-based ExampleSet objects directly to a process output port or to the input port of the next RapidMiner operator outside of the nest. You can connect an operator insideRadoop Nestto the nest's output port. Radoop fetches the data or a data sample from the HadoopExampleSet output to the client's operative memory and it then propagates further as a memory-based ExampleSet on the process flow. Because the data sample must fit into operative memory, you may want to work on aggregated data (after the aggregation took place on the cluster). You can limit the number of rows for the fetched data sample using thesample_size.overallpropertyor thesample sizeparameter of the nest.

Running a process

The process can be started using theRun arrowRunbutton in the main toolbar. Status icons in the bottom left corner on the operators and progress indicators can provide assistance regarding the process execution. Note, however, that there can be more operators with active progress indicator simultaneously. This is because Radoop operators usually create only Hive views, and postpone calculations. Because of the intensive computation, distributed jobs (e.g., MapReduce jobs) only happen when the HadoopExampleSet is materialized (a Hive table is generated for it and the (sometimes temporary) data is written to the HDFS). This is only done when necessary or when the optimizer algorithm decides to do so.

RapidMiner-to-Hive data type conversion

The termsattributeandcolumnare interchangeable in this document, since an attribute in RapidMiner can be a column of a Hive table or view on the cluster, and vice versa. The following two tables match RapidMiner and Hive data types.

The first table shows the conversion that takes place during an ExampleSet import; the second table shows the conversion that takes place when the data is fetched from the cluster to operative memory. Note that table and attribute names may change slightly inside aRadoop Nest——标识符会自动转换为低case, special characters are replaced by underscores, and, to avoid collision with certain reserved words in Hive, an underscore suffix is appended to some terms. For example, an attribute with the name "Column" in RapidMiner becomes "column_" inside theRadoop Nest(because "COLUMN" is a keyword in the Hive Query Language). You can easily track these changes at design time by checking the metadata propagation.

Conversion to Hive data type

RapidMiner data type Hive data type
integer bigint
real double
numerical double
binominal string or boolean
polynominal string
nominal string
date string
other string

Conversion to RapidMiner data type

Hive data type RapidMiner data type
tinyint integer
smallint integer
int integer
bigint integer
decimal real
float real
double real
boolean binominal
string nominal
other nominal

RapidMiner maintains anominal mappingfor nominal attributes. This internal data structure maps nominal (string) values to double values for efficient memory usage. Since the process inside the nest runs on the cluster and must have a minimal operative memory footprint, Radoop does not maintain this structure (although it does forbinominal attributes). However, whenRadoop Nestdelivers a data set on its output port fetched from the cluster, nominal mapping can be rebuilt by subsequent operators of the process. To do this, pay attention to notes about nominal attributes usage in the core operator help text. The nominal mapping for binominal attributes also indicates which string value is considered as the positive value out of the two possible string values.

RapidMiner Radoop operators

In addition toRadoop Nest, which is the container for the subprocess on the Hadoop cluster, there are a many RapidMiner Radoop operators that can run inside theRadoop Nest. These operators are categorized into the following groups:

  • Data Access
  • Blending
  • 清洗
  • Modeling
  • Scoring
  • Validation
  • Utility

These groups are described in more detail in thesection describing operators.

Breakpoints

During process design, you may want to examine intermediate results. To do so, you can define breakpoints before or after any operator and examine the objects on the input or output ports. For operators inside theRadoop Nest, since they store data on the cluster, you can fetch a data sample to operative memory for each HadoopExampleSet object and examine it in the same way as for any other ExampleSet object. Metadata and data shown in theChartsandAdvanced Chartspanels provide highly configurable tools to visualize the data. You can control the maximum size of these data samples by setting the样本大小breakpointproperty.

Note that using a breakpoint results in data materializing on the cluster at the point where the breakpoint pauses the process. This means that the computation may take place earlier than without the breakpoint and may use more space on the HDFS. Similarly, total run time with a breakpoint may be longer than the time required for the process to complete without breakpoints. Consider this as standard debugging overhead.