Categories

Versions

You are viewing the RapidMiner Radoop documentation for version 9.8 -Check here for latest version

Installing RapidMiner Radoop on RapidMiner Server

Prerequisites

The following requirements must be met before installing the RapidMiner Radoop extension on RapidMiner Server:

  • RapidMiner Radoop Extension installed and tested on RapidMiner Studio. If necessary, seeConfiguring RapidMiner Radoop Connectionsto ensure that you have a valid connection to a Hadoop cluster in RapidMiner Studio.

Installing RapidMiner Radoop on RapidMiner Server and the connected Job Agent(s)

Installing the RapidMiner Radoop extension on RapidMiner Server requires that you copy files from your RapidMiner Studio configuration into your RapidMiner Server installation. Thecentral resource management functionalitywill automatically synchronize the Radoop extension, Radoop licenses and connection definitions to all connected Job Agents.

You need to prepare with the following artifacts to accomplish the installation:

  1. RapidMiner Radoop Extension(a JAR file). You can download RapidMiner Radoop extensionfrom the Marketplaceor you can get it on your desktop computer from yourlocal.RapidMiner/configuration directory (created by RapidMiner Studio).

  2. Radoop license(a license string and/or a .lic file). RapidMiner Radoop license needs manual installation on RapidMiner Server (note thatRadoop Basiclicense is not enough to use Radoop). You can get it on thehttps://my.www.turtlecreekpls.comor you can locate the license file on your desktop computer in yourlocal.RapidMiner/configuration directory (created by RapidMiner Studio).

  3. Radoop Connection definitions(an XML file). Locate theradoop_connections.xmlfile in yourlocal.RapidMiner/configuration directory (created by RapidMiner Studio).

Installing RapidMiner Radoop on RapidMiner Server

  1. Stop the server.

  2. Copy the Radoop extension JAR file to theresources/extensions/subfolder of yourRapidMiner Server Home Directory.

  3. Copy theradoop_connections.xmlfile into the.RapidMiner/subfolder of yourRapidMiner Server Home Directory

  4. Start the server.

  5. On the Server Web UI, navigate toAdministration > Manage Licensesand check your Radoop license underActive licenses. If it is aRadoop Basiclicense, click onInstall Licensein theActionsmenu (located on the right side by default) and paste your Radoop license in the text field.

Installing RapidMiner Radoop on RapidMiner Server Job Agents

Thecentral resource management functionalityof RapidMiner Server will automatically synchronize the Radoop extension, installed licenses, and connections described in yourradoop_connections.xmlto all connected Job Agents. Please make sure that central resource management is configured to sync the locations where you uploaded these artifacts (the default locations will already be covered out-of-the-box).

If you need instructions on how to set up Radoop on all Job Agentsmanually, you will find it inthe previous version of this document.

Updating Radoop connections on RapidMiner Server

Radoop connections are stored inradoop_connections.xmlon the server side (in the.RapidMiner/的子文件夹RapidMiner Server Home Directory), but there is no GUI on the server to edit the connections. The recommended procedure is to edit connections on the client side using RapidMiner Studio and then upload them to the server as an XML file.

Follow these steps to apply your new connection definitions on your Server deployment:

  1. Copy (overwrite)radoop_connections.xmlin the.RapidMiner/的子文件夹RapidMiner Server Home Directory

  2. 为了避免服务器重启,但仍然th播出e changes - you need to manually trigger an update on all connected Job Agents via calling a ServerREST API. To achieve this, you need to invoke the/执行/ /更新同步REST endpoint of the Server, with the"type":"EXECUTION_CONTEXT"parameter set and authentication in place. Successful trigger is indicated by a2xxstatus code in the HTTP response. Here's an example using command line:

    curl "https:///executions/sync/update" \ -X POST \ -d '{"type":"EXECUTION_CONTEXT"}' \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -w "\nResponse HTTP status code: %{http_code}\n"
  3. Alternatively, restart RapidMiner Server to apply the changes to Server and all connected Job Agents.

t变化o theradoop_connections.xmlare applied immediately to all process executions startedafterthe update. Already running processes remain unaffected.

Managing multiple Radoop connections on RapidMiner Server

Theradoop_connections.xmlfile can list an arbitrary number of connections and should list all connections that may be used by any process submitted by any user to this Server. These connections may point to the same Hadoop cluster or may point to different clusters. Rapidminer Server administrator may define connections for the same user or for different users (seeManaging multiple Hadoop usersbelow).

To control the access rights to these connections on the RapidMiner Server - e.g. to restrict which user can use which connection when submitting processes to the RapidMiner Server - each connection should set the so calledAccess Whitelistfield. SeeAccess control on Radoop connectionsfor details.

The connection names must be the same on the RapidMiner Server and in the RapidMiner Studio instance that submits the process to ensure correct process execution across the platform.

Once you have created aradoop_connections.xmlfile containing all desired connections, follow theprocedure about Updating Radoop connectionsto apply changes on the Server.

Managing multiple Hadoop users on RapidMiner Server

In a multi-user Hadoop environment the RapidMiner Server administrator needs to manually edit theradoop_connections.xmlfile on Server to make sure that all connections are included and to ensure that users of RapidMiner platform are restricted to use solely their own identity on the Hadoop cluster (i.e. execute Spark jobs and Hive queries using their Hadoop access rights). After the changes has been made toradoop_connection.xmlthen follow theprocedure about Updating Radoop connectionsto apply changes on the Server.

Two different configuration strategies are available:

  1. Dedicated Radoop connections. One for each Hadoop user.
  2. One connection with the credentials of a privileged Hadoop user, which is a user allowed to impersonate other users. (see ApacheHadoop user impersonation)

Option #1: Creating dedicated Radoop connections

This approach requires a dedicated connection definition for each Hadoop user. Administrators must take care of Radoop connection name conflicts and setting up individual Hadoop credentials for each Radoop connection. RapidMiner Studio users only need to have their own connection(s) in their local connection file on their client machine belonging to their Hadoop identity. On the RapidMiner Server side, there will be multiple connections defined in the connection file. An example for naming the connections:clustername_username, whereclusternameis an identifier for the Hadoop cluster andusernameis an identifier for the user (e.g. that may be the same as the value of the Hadoop username field).Edit XML...option on the Connection Settings dialog can be used to copy each user's connection entry into the mergedradoop_connections.xmlon the Server.

Although this strategy is the simplest to introduce since it doesn’t require a Hadoop cluster side setup, it may have its drawbacks. Eventually administrators has to keep several Radoop connection in sync, which connections may only differ in their Hadoop credentials.

Option #2: Using Hadoop user impersonation in the Radoop connection

Hadoop user impersonationis available for Radoop connections. This approach enables the administrators to maintain a single Radoop connection with the credentials of a privileged Hadoop user, who is able to impersonate other Hadoop users.

This approach results in less maintenance and simpler access right management, while the credentials of the individual users (their encrypted passwords or keytabs) are not stored on the RapidMiner server.

Prerequisite Hadoop cluster side configuration for impersonation

On the Hadoop side, there should be a dedicated user (username can be e.g.privilegeduser), who has the rights to impersonate others. This configuration can be done based on the Hadoop documentation. In a simple case, the following snippet should be added to thecore-site.xmlin the Hadoop Configuration:

 hadoop.proxyuser.privilegeduser.hosts *   hadoop.proxyuser.privilegeduser.groups * 

If HDFS Encryption (and KMS service) is enabled, the similar settings should be also ensured in thekms-site.xml. For detailed information please visit the KMS Proxyuser Configuration section on theKMS documentation pageor follow the instructions of your Hadoop vendor.

Creating and testing an impersonated connection for RapidMiner Server

As a recommended approach, a connection should be constructed using RapidMiner Studio. You can find RapidMiner Server related settings on theRapidMiner Servertab of theConnection Settingsdialog.

As on the screenshot above, theEnable impersonation on Servercheckbox should be enabled and the credentials of the superuser should be entered to theServer PrincipalandServer Keytab Filefields similar to the case with client users (presented in sectionHadoop security configuration).

In case of LDAP authentication is configured for Hive, theHive Principalshould be empty and the credentials of theprivilegedusershould be entered to the HiveUsernameandPasswordfields (these two fields are only enabled ifHive Principalis empty).

The connection can be tested from RapidMiner Studio, if the networking setup allows connecting to the Hadoop cluster from the client hosts. If theImpersonated user for local testingfield is set (e.g.scottis entered as username), then all the operations are submitted using theprivilegedusercredentials, but impersonating thescottuser and using its access rights. This field does not have an effect when running on RapidMiner Server: in that case, the effective user will always be the user who submitted the RapidMiner process.

Securing Radoop connections on RapidMiner Server

RapidMiner Server supports connections to Hadoop clusters with the same security settings as RapidMiner Studio, but you may need to manually edit the connection XML file (e.g. because of different file path settings on the server side). In general, connections should be constructed using RapidMiner Studio (using it as a "connection editor"), and the following additional steps should be considered.

Decrypting connection passwords

RapidMiner Radoop uses the localcipher.keyfile to encrypt and thekeyattribute of theradoop-entriestag in the XML file to decrypt the passwords in theradoop_connections.xmlfile by default. If theradoop_connections.xmlcontains entries from multiple users, there are two possible solutions:

  1. Creating every user's connection entry on the same computer (with the samecipher.keyfile), or
  2. it is possible to add akeyattribute to eachradoop-connection-entrymanually. Radoop will use the per-entrykey而不是文件的属性key.

For example, user John and Scott have the followingradoop_connections.xmlfiles:

  connection-john ...  
  connection-scott ...  

The mergedradoop_connections.xmllooks like the following:

  connection-john ...   connection-scott ...  

Connection to Hadoop clusters with Kerberos authentication

For configuring a connection to a cluster with Kerberos authentication, seeHadoop security. Please take the following notes when using these connections through RapidMiner Server.

Connecting with Kerberos password

It is possible to use a password to connect to a Kerberized cluster. To make sure that the encrypted passwords in the connection XML can be decrypted on the Server, please refer to theDecrypting connection passwordssection. Please note that on the Server side, using a keytab is recommended, as the ticket renewal isnot在中科院支持e of using a password.

Connecting with keytab file

Connections to a Kerberized cluster should specify the path for the users keytab file instead of the password. This means that the keytab file must be accessible on the local file system of the Server. The path usually differs from the path on the local file system of the user using RapidMiner Studio. The RapidMiner Server administrator have to ensure that thekeytabFilefield of theradoop_connections.xmlfile on the Server points to the appropriate path on the Server. The keytab file itself on the file system should only be accessible for the user running RapidMiner Server.

Note: A RapidMiner Server instance can only talk to a single kerberized Hadoop cluster, more precisely, to a single Kerberos Realm. This limitation comes from the architecture of the Java Kerberos implementation. However,multiple users can usethis kerberized Hadoop cluster concurrently through this RapidMiner Server instance.

Connecting to Hive with LDAP authentication

If LDAP is used for authentication to HiveServer2, then passwords should be entered similarly to the Kerberos passwords, please refer to theDecrypting connection passwordssection. In case of impersonation, the provided Hive LDAP user should also have Hadoop proxyuser privileges.

Access control on Radoop connections

The availability of a Hadoop connection on RapidMiner Server can be limited to a user or a group of users. This means that a RapidMiner Server user that is not on the optionally specified whitelist of a connection cannot use it when submitting Radoop processes. This way, the Server administrator can make sure that users cannot use connections that they are not permitted to use, and that they cannot evade this restriction by manipulating their connection identifiers in submitted processes.

To define a group (or user) whitelist for a connection, add theaccesswhitelistxml tag for the correspondingradoop-connection-entryin theradoop_connections.xml. The value of this property is an arbitrary regular expression (.*or*can be used for allowing all users). Only RapidMiner Server users whose group matches this expression are allowed to use the connection in a submitted process. If this optionalaccesswhitelist没有指定的连接,然后任何用户can use it in a process.

 .... ds_group|dba_group|john|scott 

Change Radoop Proxy enabled connections

Radoop Proxy is automatically disabled when a process is executed on RapidMiner Server, because in a typical setup, RapidMiner Server runs inside the secure zone, that's why there is no need to route the traffic through the Proxy.

In case you have a custom manual Radoop Proxy installed on an edge node, and RapidMiner Server (besides Studio) can only reach the Hadoop cluster via this edge node (so it runs outside the secure zone), you need to enableForce Radoop Proxy on Serversetting on theRapidMiner Servertab. This setting has no effect when running in Studio.

Alternatively, you can manually edit theradoop_connectons.xmlfile on the Server. In this case add theforceproxyonservertag with the valueT.

  ... T ...  

To apply the updated connection, follow theprocedure about Updating Radoop connections.

The location of the Radoop Proxy connection specified in Studio for this connection needs to be the Remote Repository corresponding to this RapidMiner Server instance. Otherwise the process won’t be able to find the proxy connection when running on the Server and will fail because of that.