Categories

Versions

You are viewing the RapidMiner Radoop documentation for version 9.1 -Check here for latest version

Configuring RapidMiner Radoop Connections in RapidMiner Studio

You can configure connections between RapidMiner Radoop in RapidMiner Studio and one or more Hadoop clusters from theManage Radoop ConnectionsandConnection Settingsdialogs. You can access these dialogs from theConnectionsmenu, theHadoop Dataview, or theDesignview. After configuring and saving the connection entries, you can test them before deployment. The test validates the connection to the cluster and verifies that the connection settings comply with the RapidMiner Radoop requirements described in the section onprerequisites.

There are three methods to create a Radoop connection. We strongly recommend the first.

  1. If you have access to cluster manager software (Apache Ambari or Cloudera Manager), we strongly recommend using theImport from Manager IconImport from Cluster Manageroption. This method is the simplest one.

  2. 如果你不使用或没有访问cluster manager, but can ask for theclient configuration files, then use theImport Wizard IconImport Hadoop Configuration Filesoption.

  3. Otherwise, you always have the option toManual Connection IconAdd Connection Manually. This last option also allows you to import a Radoop connection that someone shared with you byclickingon theEdit XML...button, once theConnection Settingsdialog appears.

Note:When configuring RapidMiner Radoop, you must provide the internal domain name or IP address of the master node (that is, the domain name or IP address that the master node knows about itself). See thenetworking setup overviewfor details on how to ensure that your data is safe from unauthorized access.

Basic RapidMiner Radoop connection configuration

Once RapidMiner Radoop is installed, you can create a connection.

  1. Restart RapidMiner Studio so that it recognizes the RapidMiner Radoop extension. Once it restarts, you will see a newEdit data connection IconManage Radoop Connectionsoption in theConnectionsmenu:

  2. Select theEdit data connection IconManage Radoop Connectionsmenu item and theManage Radoop Connectionswindow opens:

    For more details on this dialog, seeManage Radoop Connectionssection below.

  3. Click onNew Connection IconNew Connectionbutton and chooseManual Connection IconAdd Connection Manually:

    If you wish to create the connection by importing client configuration files or by using a cluster management service, read theImporting Hadoop configurationsection.

  4. You can edit the connection properties in theConnection Settingsdialog.

    You can provide the name of the connection at the top of the dialog. Additional settings can be configured by selecting the appropriate tab on the left side. Complete the required connection fields listed below. Note that DNS and reverse DNS should work for all specified addresses, so the client machine must have access to the network name resolution system of the cluster or be able to resolve the addresses locally.

    Tab Field Description
    Global Hadoop Version The distribution that defines the Hadoop version type for this connection.
    Hadoop NameNode Address Address (usually hostname) of the node running the NameNode service.
    Hadoop Resource Manager Address Address (usually hostname) of the node running the Resource Manager service.
    Hive Hive Server Address Address (usually hostname) of the node running the Hive Server or the Impala Server.
    Spark Spark Version Spark version available on the cluster.
    Spark Assembly Jar Location / Spark Archive (or libs) path The HDFS location or local path (on all cluster nodes) of the Spark Assembly Jar file / Spark Jar files.

    For further details, seeAdvanced connection settingssection below.

    ClickOKOKto create the connection entry.

  5. ClickSave IconSaveto add the entry to the available connections.

  6. Test the connectionbetween RapidMiner Radoop and the Hadoop cluster. If necessary, and with the assistance of your Hadoop administrator, set theadvanced settingsbased on thedistribution specific notes.

Your connection settings are saved in a file calledradoop_connections.xmlin your.RapidMinerdirectory.

导入一个connection

配置可以cumbersom手动连接e for a more complicated cluster. In this case using one of the connection import features is recommended. There are two options: you can create a connection using the cluster's client configuration files, or by providing the URL and credentials for the cluster's management service (Cloudera Manager or Ambari).

You can create Radoop connections by setting up its parameters from client configuration files. To do so, chooseImport Wizard IconImport Hadoop Configuration Filesoption when adding a new connection. Set the location of the file in the following dialog:

You can select one or more folder(s) or compressed file(s) (such asziportar.gz) containing the configuration XML documents, or you can simply import singlexmlfiles. You can easily get hold of these files by using the distributor’sHadoop management tool. ClickImport Configuration IconImport Configurationand wait until a popup window shows the result of the import process:

  • Success IconSuccess:You can go on with the next step.
  • Warning IconWarning:Some fields will be missing, these can be provided in the next step.Show Details IconShow detailsbutton informs you about the problem(s).
  • Failure IconFailure:You need to goBack IconBackand choose the appropriate file(s).

ClickingNext IconNextwill lead you to theConnection Settingsdialog, where you will find all the properties that could be imported automatically. Some required fields might still be missing. The editor highlights them with a red border and an error message. If a tab contains fields with missing values, it is marked with an error sign.

You can create a connection by providing the URL and the credentials for the cluster’s management service. In this case, select theImport from Manager IconImport from Cluster Manageroption when adding a new connection, to obtain the following dialog:

The following fields need to be filled in:

  • Cluster Manager URL: The URL of the cluster’s management service. For HDP-like connections (HDP, HDInsight, IOP, IBM, etc.) this is usually Apache Ambari, which usually runs by default on port 8080 (except for HDInsight, where usually no port has to be provided). For CDH connections, this is Cloudera Manager, running by default on port 7180. Please take care of the protocol prefix (usually http, https). If the protocol is missing, “http://” will be used automatically.
  • Username: The username for the cluster manager. Please note that the user needs to have privileges for the client configuration. Read-only permissions are sufficient for retrieving most connection properties. Using an admin user is not required, but makes it possible to retrieve further settings that may have to be provided manually otherwise.
  • Password: The password for the provided cluster manager user.

After filling in the fields, clickImport Configuration IconImport Configurationto start the import process. If the cluster manager manages more than one cluster, the following input dialog will pop up. Select the name of the cluster you want to connect to.

The connection import can have two outcomes:

  • Success IconSuccess:You can go on with the next step.
  • Failure IconFailure:You need to goBack IconBackand fix the URL or the credentials. The detailed error can be seen if you click theShow Detailsbutton.
    • If theFailureis due to an untrusted certificate, user will be notified and shown the certificate details and be provided the option to trust the certificate and continue with importing from the Cluster Manager.

When the connection is successfully imported, theConnection Settingsdialog will pop up. Here you can change the name of the connection, and complete the connection configuration manually.

  • A required field with missing values is highlighted with a red border and an error message. A tab containing fields with missing values is marked with an error sign.
  • A field whose default value may need to be changed is highlighted with an orange border, and the tab is marked with a warning sign. Please note that the Hadoop version is automatically set to HDP if you used Apache Ambari as the cluster manager. In case of IBM and ODP distributions, for example, the Hadoop version needs to be changed manually.

During Radoop connection creation from an import duplicate properties maybe detected.
If a duplicate property is detected, deconfliction with the previous property value works as follows.
Deconfliction is based on property origin and is resolved in the following decending order:
- yarn-site origin
- core-site origin
- other origins in the order of processed

Priority within same origin, follows sequential read ordering.

If a property value is replaced, an INFO level log is provided stating what key, value/origin was and what value/origin is now being applied.

Manage Radoop Connections window

The Manage Radoop Connections window shows your already configured connections and allows you to edit them, or create and test new connections:

This window consists of 3 panels. The upper left panel lists all known connection entries. For each entry, one or more icons may be present showing some additional information, namely:

  • Spark IconSpark is configured for this connection
  • Impala IconThe connection uses Impala as query engine
  • Lock IconConnection to a secure cluster

The basic properties of the currently selected connection are gathered on the right-hand side panel. There are also buttons executing several actions available on the selected connection:

  • Configure IconConfigure...: Opens the Connection Settings dialog where you can configure the connection properties. Check theAdvanced connection settingssection for more details.
  • Save IconSave: Saves the currently displayed connection.
  • Save As IconSave As...: Saves a copy of the currently displayed connection. Useful for saving a slightly modified connection while keeping the original entry.
  • Quick Test IconQuick Test:在当前显示反对运行一个快速测试nection.
  • Full Test IconFull Test...: Runs a Full Integration Test on this connection. More information on the connection tests can be found in theTesting RapidMiner Radoop connectionssection.
  • Rename IconRename Action: Renames the current connection. Please note that all connection names should be unique.

The lower panel shows logs of the running tests. Several actions can be performed on this panel too:

  • Extract logsExtract logs...: This action creates a bundled zip file containing all relevant logs of your recent Radoop-related activities. See relatedsectionfor more details.
  • Clear logClear logs: Clears the connection log field.
  • Stop IconStop Test: The Stop Test action will halt the currently running test execution (see theTesting RapidMiner Radoop connectionssection).

Testing RapidMiner Radoop cluster connections

RapidMiner Radoop's built-in test functions help with troubleshooting before trouble begins.

Basic connection test

Click theQuick Test IconQuick Testbutton in theManage Radoop Connectionswindow to test the connection to the cluster. Through a series of simple tests to different components (APIs) on the cluster, the test verifies that the cluster is running and that the RapidMiner Radoop client can access it. You can stop the test anytime by clicking theStop IconStop Testbutton.

Full connection tests

Once your test succeeds, run a complete test (which may take several minutes) by clicking theFull Test IconFull Test...button. It's possible to customize a full connection test by clicking theCustomize iconCustomize...button. In this panel you can enable or disable tests, change the timeout and enable or disable the cleaning after the tests. These values are reset to the defaults after closing theManage Radoop Connectionswindow. ClickRunRunto start the test.

完整的测试启动几个工作和应用ons on the cluster and then checks the results. By successfully and extensively exercising RapidMiner Radoop interactions with your cluster, you can feel confident in your RapidMiner Radoop process design and execution.

In addition to testing connections when you first create a RapidMiner Radoop configuration, you can use theFull Testif you have an error in process execution or a change in the cluster. The output of the full test results can help identify the root cause of the problem for easier troubleshooting. You can stop the Full Test anytime by clicking theStop IconStop Testbutton. Stopping the current test process may take some time.

Note:The cluster connection initial test also starts automatically in the background when you open a process containing a RapidMinerRadoop Nestoperator (indicated by the status bar in the bottom right corner of the RapidMiner Studio screen).

Advanced connection settings

You can use theConnection Settingsdialog to edit the connection parameters. For example, you can change port numbers or define arbitrary parameters for Hadoop and Hive using key-value pairs. Do not modify the connection settings without first consulting with your organization's IT administrator. To open theConnection Settingsdialog, click theConfigure IconConfigure...button from theManage Radoop Connectionswindow.

Note: The fields displayed depend on the selections (for example, the selected Hadoop version). Also, some fields prepopulate based on the Hadoop version selection from the basic settings. If a field isboldin the window, it is required.

TheConnection Settingsdialog has multiple tabs. The following tables describe the fields in each tab. For advanced configuration details related to your environment, see thedistribution specific notes.

Global

Field Description
Hadoop Version The distribution that defines the Hadoop version type for this connection.
Additional Libraries Directory Any additional libraries (JAR files) on the client needed to connect to the cluster (optional, for expert users only).
Enable Kerberos Check this box to connect to a Hadoop cluster secured by Kerberos.
Client Principal Only with Kerberos security enabled and Server impersonation disabled. Principal of the user accessing Hadoop. The format is primary[/]@, where primary is usually the user name, instance is optional, and REALM is the Kerberos realm. Example: user/client.www.turtlecreekpls.com@RAPIDMINER.COM.
Use password instead of keytab file Only with Kerberos security enabled. Check this box to authenticate with a password instead of a keytab file.
KeyTab File Path of the user keytab file on the client machine. Enter or browse to the file location.
Password Only with Kerberos security enabled and "Use password instead of keytab file" option checked. The Kerberos password that can be used to connect to the secure cluster. RapidMiner Radoop uses the cipher.key file to encrypt the password in radoop_connections.xml.
KDC Address Only with Kerberos security enabled. Address of the Kerberos Key Distribution Center. Example: kdc.www.turtlecreekpls.com.
REALM Only with Kerberos security enabled. The Kerberos realm. It is usually the domain name in upper-case letters. Example: RAPIDMINER.COM.
Kerberos Config File Only with Kerberos security enabled. To avoid configuration differences between the machine running RapidMiner and the Hadoop cluster, it is good practice to provide the Kerberos configuration file (usually krb5.conf or krb5.ini). Obtain this file from your security administrator. Enter or browse to the file location.
Enable MapR security Only with some MapR Hadoop version selected. Check this box to connect to a Hadoop cluster secured by MapR Security.
MapR cluster Only with some MapR Hadoop version selected. MapR cluster to connect to. All MapR connections must be configured in the MapR client that MapR Home points to.
Hadoop Username The name of the Hadoop user. In most cases, the user must have appropriate permissions on the cluster. For a new connection, the default is the OS user.

Hadoop

Field Description
NameNode Address Address (usually hostname) of the node running the NameNode service. (Requires a working network name resolution system.)
NameNode Port Port of the NameNode service.
Resource Manager Address Address (usually hostname) of the node running Resource Manager service.
Resource Manager Port Port of the Resource Manager service.
JobHistory Server Address Address (usually hostname) of the node running the Job History Server service.
JobHistory Server Port Port of the Job History Server service.
Retrieve Service Principals from Hive Only with Kerberos security enabled. If checked, RapidMiner Radoop automatically retrieves all other service principals from Hive for easier configuration. Disable this setting only if there is a problem accessing other services.
NameNode Principal Only with Kerberos security enabled and Hive principal retrieval disabled. Principal of the NameNode service. You can use the _HOST keyword as the instance. Example: nn/_HOST@RAPIDMINER.COM
Resource Manager Principal Only with Kerberos security enabled and Hive principal retrieval disabled. Principal of the ResourceManager service. You can use the _HOST keyword as the instance. Example: rm/_HOST@RAPIDMINER.COM
JobHistory Server Principal Only with Kerberos security enabled and Hive principal retrieval disabled. Principal of the JobHistoryServer service. You can use the _HOST keyword as the instance. Example: jhs/_HOST@RAPIDMINER.COM
Advanced Hadoop Parameters Key-value properties to customize the Hadoop connection and Radoop's Yarn/MapReduce jobs. Some connections require certain advanced parameters. For detailed information, see thedistribution specific notes.

Spark

Field Description
Spark Version Spark Version available on the cluster. For more information on usingSparkoperators, see theConfiguring Sparksection.
Assembly Jar Location / Spark Archive (or libs) path The HDFS location or local path (on all cluster nodes) of the Spark Assembly Jar file / Spark Jar files.
Spark Resource Allocation Policy The resource allocation policy for Spark jobs. The default -Dynamic Resource Allocationstarting from 8.1.1 andStatic, Heuristic Configurationin 8.1.0 - is typically applicable. See moreSpark policy information.
Resource Allocation % Percentage of cluster resources allocated for a Spark job. This field is only enabled whenStatic, Heuristic Configurationis the Spark resource allocation policy.
Use custom PySpark archive Check this box if you want to provide your own PySpark archives.
Custom PySpark archive paths Only whenUse custom PySpark archiveoption is enabled. Set of archives used as PySpark libraries for PySpark job submissions. See theinstructions for configuring custom PySpark/SparkR archives.
Use custom SparkR archive Check this box if you want to provide your own SparkR archive.
Custom SparkR archive path Only whenUse custom SparkR archiveoption is enabled. Archive used as the SparkR library for SparkR job submissions. Set of archives used as PySpark libraries for PySpark job submissions. See theinstructions for configuring custom PySpark/SparkR archives.
Advanced Spark Parameters Key-value properties that customize RapidMiner Radoop's Spark jobs. See theinstructions for configuring Spark.

Hive

Field Description
Hive Version Select the appropriate Data Warehouse System — HiveServer2 (Hive 0.13 or newer) or Impala. Alternatively, you can selectCustom HiveServer2and provide your own Hive jars.
Custom Hive Lib Directory Only with Custom Hiveserver2 selected. Select a directory that contains the libraries (JAR files) needed to connect to the cluster.
Hive High Availability Check this box if Hive High Availability is activated for this cluster (provided that HiveServer access is coordinated by ZooKeeper).
Hive Server Address/Impala Server Address Address (usually hostname) of the node running the Hive Server or the Impala Server.
Hive Port/Impala Port Port of the Hive Server or Impala Server.
Database name Name of the database to connect to.
JDBC URL Postfix Optional postfix for the JDBC URL. The default is "auth=noSasl" for Impala connections.
Username Username for connecting to the specified database. The default is "hive" for all HiveServer2 version connections. This user should have access to theHDFS directorythat Radoop uses for storing files temporarily. If this directory is located in an encryption zone, the user should also have permissions to access the encryption zone key.
Password 密码连接到指定的数据库。RapidMiner Radoop uses thecipher.keyfile to encrypt the password inradoop_connections.xml.
UDFs are installed manually Check this box if the Radoop UDFs are installed on the cluster manually. More information on the manual UDF installation can be found on theOperation and Maintenancepage.
Use custom database for UDFs 检查这个箱子如果应该使用一个自定义的数据库for storing and accessing Radoop UDFs. This is useful when more users (having different project databases and granted privileges) wish to use Radoop. This common database should be accessible by all of them. The UDFs can still be automatically or manually created.
Custom database for UDFs Only when "Use custom database for UDFs" is checked. Define the database dedicated for storing Radoop UDFs (see above). The database must exist.
Hive on Spark / Tez container reuse Check this box if you would like to benefit fromHive on Spark / Hive on Tez container reuse.
Hive Principal Only with Kerberos security enabled. Principal of the Hive service. The format is primary[/]@, where primary is usually the user name, instance is optional, and REALM is the Kerberos realm. Do not use the _HOST keyword as the instance. If Hive is not configured for Kerberos but uses another authentication mechanism (e.g., LDAP), leave this field empty. Example: hive/node02.www.turtlecreekpls.com@RAPIDMINER.COM.
SASL QoP Level Level of SASL Quality of Protection. This setting must be the same as the cluster setting. (To find the cluster setting, find the value of hive.server2.thrift.sasl.qop in hive-site.xml; the default is “auth”.)
Advanced Hive Parameters Key-value properties to customize the behavior of Hive.

RapidMiner Server

This tab contains some multi-user configuration settings that affect the execution on RapidMiner Server. For more information and best practice solutions, see therelevant sectionsofInstalling Radoop on Serverpage.

Field Description
Enable impersonation on Server Check this box if you want to use an impersonated (proxy) Hadoop user on RapidMiner Server.
Server Principal Only with Kerberos security and Server impersonation enabled. Principal used by RapidMiner Server to access the cluster. The format is primary[/]@, where primary is usually the user name, instance is optional, and REALM is the Kerberos realm. Example: user/server.www.turtlecreekpls.com@RAPIDMINER.COM. Please note that this setting only affects the execution on Server.
Server Keytab File Only with Kerberos security and Server impersonation enabled. Path of the server keytab file on the server machine.
Impersonated user for local testing Only when impersonation on Server is enabled. Server user to impersonate for testing Server connections locally from Studio.
Access Whitelist Regex for Server users who has access to this connection. Leave it empty or use '*' to enable access to all users.

Radoop Proxy

Field Description
Use Radoop Proxy Check this box if you want to access the Hadoop Cluster through aRadoop Proxy.
Radoop Proxy Connection Only when Radoop Proxy is enabled. This field consists of two dropdown selectors, which together define the Radoop Proxy used for accessing the cluster. The first one defines the location of the Radoop Proxy. Can be local or one of the configured RapidMiner Server repositories. The second one is the identifier of the Radoop Proxy.

XML connection editor

The Radoop connection XML can be edited manually by clicking theEdit XML...button on the Connection Settings dialog. Please note that this feature should be used carefully as it's easy to make mistakes in a connection entry through the XML editor. The main purpose of the editor is to make connection sharing and copy-pasting some parts of it (e.g.Advanced Hadoop Parameters) much easier. When you close the window with the OK button, your changes appear in the fields of the Connection Settings dialog.

Note:Adding aseparate key attributetotag in the XML editor will have no effect. It can only be added inradoop_connections.xml, manually.

Configuring non-default properties

If your Hadoop cluster uses non-default properties, additional key-value pairs may be required. Cluster management tools like Cloudera Manager and Ambari allow you todownload theclient configuration files. You may have to add cluster connection-related properties from these files to theAdvanced Hadoop Parameterssection of theHadooptab. See below for single properties that occur to be (re)set frequently, and more complex examples describing the properties required to connect to a cluster with High Availability (HA) enabled. The following tables list the keys of the potentially required client-side settings. The values should be set to the appropriate property values from the client configuration files. Note that not all keys related to these features may be required, the required set of key-value pairs depend on your cluster settings.

Key Description
dfs.client.use.datanode.hostname 表明客户是否应该使用datanode主机names when connecting to datanodes. Setting it to true may allow to use the public network interface of the datanodes instead of the private one. By default, the property value retrieved from the cluster is used. If not properly set, DataNode networking test (part of the full connection test) will show a warning. Example: seeCDH 5.5 Quickstart VM
mapreduce.job.queuename Queue to which a job is submitted. The system must be configured with this predefined queue, and access must be granted for submitting jobs to it. When using other than the default queue, it must be defined here explicitly. Example:low_priority

Configuring the connection to an HA HDFS-enabled cluster only requires that you specify the proper Hadoop settings in theAdvanced Hadoop Parameterssection of theHadooptab.

The HA feature eliminates any single point of failure for a cluster by providing a standby (in addition to active) NameNode. HA implements manual switchover and automatic failover to provide continuous availability. The following table lists the settings required for the RapidMiner Radoop client to connect to the cluster. These properties must be configured in each cluster node configuration file. For further details, see your Hadoop documentation.

Key Description
fs.defaultFS(orfs.default.name) The default path for Hadoop FS typically contains the NameService ID of the HA-enabled cluster. Example:hdfs://nameservice1
dfs.nameservices The logical name for the service. Example:nameservice1
dfs.ha.namenodes. Comma-separated list of unique NameNode identifiers. Example:namenode152,namenode92
dfs.namenode.rpc-address.. RPC address for each NameNode to listen on. Example:node01.example.com:8020
dfs.client.failover.proxy.provider. Class HDFS clients use to contact the active NameNode. Currently there is only one option shipped with Hadoop. Example:org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

Setting the connection to an HA Resource Manager-enabled cluster only requires that you specify the proper Hadoop settings in theAdvanced Hadoop Parameterssection of theHadooptab.

The Resource Manager (RM) HA feature removes a single point of failure (adds redundancy) using an Active/Standby RM pair. The following table lists the settings required for the RapidMiner Radoop client to connect to the cluster. These properties must be configured in each cluster node configuration file. For further details, see your Hadoop documentation.

Key Description
yarn.resourcemanager.ha.enabled Enables Resource Manager High Availability.
yarn.resourcemanager.ha.automatic-failover.enabled Enables automatic failover. By default, only enabled when HA is enabled.
yarn.resourcemanager.ha.automatic-failover.embedded When automatic failover is enabled, uses embedded leader-elector to pick the active RM. By default, only enabled when HA is enabled.
yarn.resourcemanager.zk-address Address of the ZK-quorum. Used both for the state-store and embedded leader-election.
yarn.resourcemanager.cluster-id Identifies the cluster. Used by the elector to ensure an RM does not take over as active for another cluster. Example:yarnRM
yarn.resourcemanager.ha.id Identifies the RM in the ensemble. Optional, but if set, ensure that all the RMs have a unique ID.
yarn.resourcemanager.ha.rm-ids Comma-separated list of logical IDs for the RMs. Example:rm274,rm297
yarn.resourcemanager.address. Service address for each RM ID.
yarn.resourcemanager.scheduler.address. Scheduler address for each RM ID.
yarn.resourcemanager.resource-tracker.address. Resource tracker address for each RM ID.
yarn.resourcemanager.admin.address. RM admin address for each RM ID.
yarn.resourcemanager.store.class The class to use as the persistent store for RM recovery.

Configuring Spark for a RapidMiner Radoop connection

By configuring Spark for a RapidMiner Radoop connection, you enable the Spark operators. See the exact Spark version requirements for each operator on theInstalling Radoop on Studiopage.

To enable Spark, select a validSpark versionfrom the dropdown list in theConnection Settingsdialog.

You must provide the following mandatory inputs on theSparktab of theConnection Settingsdialog:

Field Description
Spark Version Dropdown list to select the version of Spark your cluster supports.
  • If Spark is not on the cluster or unneeded, select None.
  • Spark 1.2.x, 1.3.x, or 1.4.x, select Spark 1.4 or below.
  • Spark 1.5.x, 1.6.x, 2.0.x, 2.1.x, or 2.2.x, select the corresponding value.
  • Spark 2.3.x users will need to select based on the Spark patch version number. Spark 2.3.0 will be Spark 2.3.0, for all other 2.3.x versions select Spark 2.3.1+.
Assembly Jar location / Spark Archive (or libs) path The HDFS or local path of the distribution-specific Spark assembly JAR file / Spark JAR files. If you provide a local path then it must be the same on every node in your cluster. Specifying the local path is recommended if Spark is automatically installed (e.g. with Cloudera Manager or Ambari) on the cluster. For some Hadoop versions, the pre-built Spark assembly JAR can be downloaded fromthe Apache Spark download page. Some vendors (like Cloudera) provide a distribution-specific Spark assembly JAR. For the HDFS path of the JAR, contact your Hadoop administrator. For example, to install Spark 1.5 manually, refer to theSpark requirements section. If you followed the instructions there, your assembly jar is at the following location on the HDFS:hdfs:///tmp/spark/spark-assembly-1.5.2-hadoop2.6.0.jar
Spark resource allocation policy Spark needs specification of the cluster resources it is allowed use. See the Spark resource allocationpolicy descriptions
Advanced Spark Parameters Key-value pairs that can be applied to a Spark-on-YARN job. If the change has no effect on your Spark job, most likely it is ignored by YARN itself. To check the properties in theapplication log, setspark.logConfto true.

Configuring custom PySpark/SparkR archives for Spark

Radoop is shipped with PySpark and SparkR archives for each minor (x.y) Spark version to supportSpark Scriptingoperator. Using these archives for all sub-versions (x.y.z) is sufficient in most cases. However certain Spark minor versions (e.g. 2.2 and 2.3) shipped by Hadoop distributions have multiple incompatible patched versions which behave differently from the aspect of Python/R process <-> JVM communication. These minor versions cannot be tackled by shipping a single set of archives. Hence custom PySpark and SparkR archive options were introduced on the connection editor. When these options are enabled, Radoop uses the user-provided archives to executeSpark Scriptoperator, instead of the ones bundled with Radoop. These archives are usually shipped with the Hadoop distribution together with Spark so typically located close to Spark installation folder. This functionality is handled by the following extra settings:

Field Description
Use custom PySpark archive Check this box if you want to provide your ownPySparkarchives.
Custom PySpark archive paths Only when "Use custom PySpark archive" option is enabled.Set of archives used asPySparklibraries forPySparkjob submissions. You will typically need to provide two archives here, pyspark.zip and py4j-*.zip. The exact name and access path of these files depend on the Hadoop and Spark version of your cluster. Since you will need to provide at least two items, this parameter accepts multiple values. Each entry can be provided either as anHDFSlocation (hdfs:// protocol), as a file which is available on all cluster nodes at the same location (local:// protocol), or as a file on the client machine (file:// protocol). In a sample HDP 3 environment the necessary entries using local paths arelocal:///usr/hdp/current/spark2-client/python/lib/pyspark.zipandlocal:///usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip.
Use custom SparkR archive Check this box if you want to provide your ownSparkRarchive.
Custom SparkR archive path Only when "Use custom SparkR archive" option is enabled.Archive used as theSparkRlibrary forSparkRjob submissions. This path can be provided either as anHDFSlocation (hdfs:// protocol) or as a file on the client machine (file:// protocol).WARNING! Specifying archives available on the cluster nodes (local:// protocol) is not supported for this parameter.Therefore if your archive is accessible on a cluster node, you will first need to upload it to HDFS, and use the HDFS location for this parameter. In a sample HDP 3 environment this file is located at/usr/hdp/current/spark2-client/R/lib/sparkr.zip. In the example shown below, this file was uploaded tohdfs:///tmp/sparkr.zipHDFS location, which is then referenced by this parameter.

Spark resource allocation policies

RapidMiner Radoop supports the following resource allocation policies:

Dynamic Resource Allocation

Static, Heuristic Configuration

  • This is the default policy in 8.1.0 and previous versions. If you use this option, you do not need to set any advanced resource allocation settings. TheResource Allocation %field sets the percentage of cluster resources (cluster memory, number of cores) to be used for a Spark job. Note that if you set this value too high, other jobs on the cluster might suffer. The default value is 70%.

Static, Default Configuration

  • A policy that uses Spark's default settings for resource allocation. This value is very low and may not support a real cluster, but it may be a viable option for VMs/sandboxes.

Static, Manual Configuration

  • This policy requires that you set the following properties underAdvanced Spark Parameterson theSparktab of theConnection Settingsdialog. TheSpark documentationdescribes each property. (The corresponding Spark on YARN command line arguments are shown in parentheses.)
    • spark.executor.cores(--executor-cores)
    • spark.executor.instances(--num-executors)
    • spark.executor.memory(--executor-memory)
    • (optional)spark.driver.memory(--driver-memory)

Note: Because ofSPARK-6962, RapidMiner Radoop changes the default value ofspark.shuffle.blockTransferServicetonioinstead ofnetty. To override this setting, in theAdvanced Spark Parametersfield configure the keyspark.shuffle.blockTransferServiceto the valuenetty. Starting from 1.6.0, this setting is ignored by Spark, theBlockTransferServiceis alwaysnetty.

Hive on Spark & Hive on Tez container reuse

Reusing the containers of the Hive execution engine can dramatically speed up Radoop processes, especially if there are lots of Hive-only tasks. It is achieved by keeping a number of Spark / Tez containers (applications) in running state for executing Hive queries. Keep in mind that these containers will use cluster resources even if there are no running processes. Radoop tries to estimate the optimal number of containers by default, but it also can be changed to a fix number in the settings (see below). Idle containers are automatically closed after a timeout.

To use this feature, your cluster must support Hive on Spark or Hive on Tez, and in your connection have to sethive.execution.enginetosparkortezinAdvanced Hive Parametersand checkHive on Spark / Tez container reusecheckbox (this is the default value):

A number of global Radoop settings can be used to control the container reuse behaviour. You may want to test different settings to use your cluster optimally, seeRadoop Settingsfor details.

As Hive on Spark / Hive on Tez containers are kept running and reserving cluster resources, you may easily run out of memory/cores on small clusters (e.g. quickstart VMs) if you run other MapReduce, Spark or Tez jobs. To prevent this situation, Radoop automatically stops these containers before starting a MapReduce or Spark job. (Idle containers would have been closed anyway, but this enables closing them before the idle timeout, right when the resources are needed.)

Impala connections

If your are configuring an Impala connection someAdvanced Hadoop Parametersneed to be added manually. If you forget to add any of those, a warning message will warn you about the missing ones. TheAdd Required Entriesbutton adds the keys of these properties to the list, but their values must be set manually according to the cluster configuration.

When upgrading RapidMiner Studio or Server, further settings may become mandatory, which could mean that the Impala connections may have to be updated with the new required advanced settings.