You are viewing the RapidMiner Radoop documentation for version 9.4 -Check here for latest version
Installing RapidMiner Radoop on RapidMiner Server
Prerequisites
The following requirements must be met before installing the RapidMiner Radoop extension on RapidMiner Server:
- RapidMiner Radoop Extension installed and tested on RapidMiner Studio. If necessary, seeConfiguring RapidMiner Radoop Connectionsto ensure that you have a valid connection to a Hadoop cluster in RapidMiner Studio.
Installing RapidMiner Radoop on RapidMiner Server and the connected Job Agent(s)
Installing the RapidMiner Radoop client on RapidMiner Server requires that you copy files from your RapidMiner Studio configuration into your RapidMiner Server or Job Agent installations. You need to prepare with the following artifacts to accomplish the installation:
RapidMiner Radoop Extension(a Jar file). You can download RapidMiner Radoop extensionfrom the Marketplaceor you can get it on your desktop computer from yourlocal.RapidMinerconfiguration directory (created by RapidMiner Studio).
Radoop license(a license string and/or a .lic file). RapidMiner Radoop license needs manual installation on RapidMiner Server (note thatRadoop Basiclicense is not enough to use Radoop). You can get it on thehttps://my.www.turtlecreekpls.comor you can locate the license file on your desktop computer in yourlocal.RapidMinerconfiguration directory (created by RapidMiner Studio).
Radoop Connection definitons(an XML file). Locate theradoop_connections.xmlin yourlocal.RapidMinerconfiguration directory (created by RapidMiner Studio).
Installing RapidMiner Radoop on RapidMiner Server
Stop the server.
You should add the extension Jar file to the extensionorplugin directory of RapidMiner Server:
To determine the location of your RapidMiner Server plugins directory, from the RapidMiner Server home page openAdministrationand thenSystem Settings. The value of the
com.rapidanalytics.plugindir
system setting indicates the location of the directory.Starting from RapidMiner Server version 9.0 aRapidMiner Server Home Directoryis introduced. You can also copy theextension Jarfile into its
home/resources/extensions/
subfolder.On the Server Web UI, navigate toAdministration > Manage Licensesand check your Radoop license underActive licenses. If it is aRadoop Basiclicense, click onInstall Licensein theActionsmenu (located on the right side by default) and paste your Radoop license in the text field.
Restart the server.
Installing RapidMiner Radoop on RapidMiner Server Job Agents
You should do the following steps oneachJob Agent connected to your RapidMiner Server.
Stop the Job Agent
Add the extension Jar to the extensions directory ofeachJob Agent. For details see theinstructions for Job Agents configuration.
In a multi-user Server environment, please see theConfiguring and securing multiple connectionssection. The finalradoop_connections.xmlmust be placed in thecontainer propertiesfolder ofJob Agents. Copy or link the file into the
home/config/rapidminer/.RapidMiner/
folder.的工作人员,你需要复制安装license files to theJob Agents'
home/resources/licenses/radoop/
folder as well.Start the Job Agent
Managing Radoop connections on RapidMiner Server
Radoop connections are stored inradoop_connections.xmlon the server side, but there is no GUI on the server to edit the connections. Connections should be edited on the client side using RapidMiner Studio and added to the server as an XML file.
In a multi-user environment the Rapidminer Server administrator needs to manually edit theradoop_connections.xmlfile on Server and Job Agents to make sure that all connections are included. Theradoop_connections.xmlfile can list an arbitrary number of connections. These connections may point to the same Hadoop cluster or may point to different clusters. They may define connections for the same user or for different users (e.g., with different Hadoop username fields).
The connection file on RapidMiner Server should list all connections that may be used by any process submitted to this Server. The connection names must be the same on the Server and in the RapidMiner Studio instance that submits the process.
RapidMiner Server does not need to be restarted ifradoop_connections.xmlis modified. The changes are applied immediately, more precisely, all process executions after the modification will use the modified connection, because the xml file is re-read from the disk, but already running processes are unaffected.
In a multi-user RapidMiner Server environment, two different configuration solutions are available for creating Radoop connections:
- Dedicated Radoop connection for each client user on the server side, or
- one connection with the credentials of a privileged Hadoop user, a user allowed to impersonate other users. (see ApacheHadoop user impersonation)
Option #1: Creating dedicated Hadoop connections for the client users
This approach requires a dedicated connection definition for each user, and administrators must take care of connection name conflicts. RapidMiner Studio users only need to have their own connection(s) in their local connection file on their client machine. On the server side, there will be multiple connections defined in the connection file. An example for naming the connections:clustername_username
, whereclustername
is an identifier for the Hadoop cluster andusername
is an identifier for the user (e.g. that may be the same as the value of the Hadoop username field).Edit XML...option on the Connection Settings dialog can be used to copy each user's connection entry into the mergedradoop_connections.xmlon the Server.
To control the access rights to these connections, e.g. so that one user can only use his/her own connection when submitting processes to the Server, each connection should set the so calledAccess Whitelistfield to the correspondingusername
. SeeAccess control on Radoop connectionsfor details.
Option #2: Using Hadoop user impersonation in the Radoop connection
Hadoop user impersonationis available for Radoop connections. This approach enables the administrators to add a single connection to RapidMiner Server with the credentials of a privileged Hadoop user, who is able to impersonate other Hadoop users. This approach results in less maintenance and simpler access right management, while the credentials of the users (encrypted passwords or keytabs) are not stored on the server. Please note that using a keytab for the privileged superuser is strongly recommended, as the ticket renewal is not fully supported in case of using a password.
Hadoop-side configuration for impersonation
On the Hadoop side, there should be a dedicated user (username can be e.g.privilegeduser
), who has the rights to impersonate others. This configuration can be done based on the Hadoop documentation. In a simple case, the following snippet should be added to thecore-site.xmlin the Hadoop Configuration:
hadoop.proxyuser.privilegeduser.hosts * hadoop.proxyuser.privilegeduser.groups *
If HDFS Encryption (and KMS service) is enabled, the similar settings should be also ensured in thekms-site.xml. For detailed information please visit the KMS Proxyuser Configuration section on theKMS documentation pageor follow the instructions of your Hadoop vendor.
Creating and testing the connection for RapidMiner Server
Similar to the other approach, a connection should be constructed using RapidMiner Studio. You can find RapidMiner Server related settings on theRapidMiner Servertab of theConnection Settingsdialog.
As on the screenshot above, theEnable impersonation on Servercheckbox should be enabled and the credentials of the superuser should be entered to theServer PrincipalandServer Keytab FileorServer Passwordfields similar to the case with client users (presented in sectionHadoop security configuration). In case of LDAP authentication is configured for Hive, theHive Principalshould be empty and the credentials of theprivilegeduser
should be entered to the HiveUsernameandPasswordfields (these two fields are only enabled ifHive Principalis empty).
The connection can be tested from RapidMiner Studio, if the networking setup allows connecting to the Hadoop cluster from the client hosts. If theImpersonated user for local testingfield is set (e.g.scott
is entered as username), then all the operations are submitted using theprivilegeduser
credentials, but impersonating thescott
user and using its access rights. This field does not have an effect when running on RapidMiner Server: in that case, the Server user will always be the impersonated user.
Securing Radoop connections on RapidMiner Server
RapidMiner服务器支持连接到Hadoop clusters with the same security settings as RapidMiner Studio, but you may need to manually edit the connection XML file (e.g. because of different file path settings on the server side). In general, connections should be constructed using RapidMiner Studio (using it as a "connection editor"), and the following additional steps should be considered.
Decrypting connection passwords
RapidMiner Radoop uses the localcipher.keyfile to encrypt and thekeyattribute of theradoop-entriestag in the XML file to decrypt the passwords in theradoop_connections.xmlfile by default. If theradoop_connections.xmlcontains entries from multiple users, there are two possible solutions:
- Creating every user's connection entry on the same computer (with the samecipher.keyfile), or
- it is possible to add akeyattribute to eachradoop-connection-entrymanually. Radoop will use the per-entrykey而不是文件的属性key.
For example, user John and Scott have the followingradoop_connections.xmlfiles:
< radoop-entrieskey="XkzjmytZW2ffc7+MnU11BdhzomF8355R"> connection-john ...
< radoop-entrieskey="KLS4GvvZta0NhtXfwkXQeSqD11ngXeWP"> connection-scott ...
The mergedradoop_connections.xmllooks like the following:
< radoop-entrieskey="dontcare"> connection-john ... connection-scott ...
Connection to Hadoop clusters with Kerberos authentication
For configuring a connection to a cluster with Kerberos authentication, seeHadoop security. Please take the following notes when using these connections through RapidMiner Server.
Connecting with Kerberos password
It is possible to use a password to connect to a Kerberized cluster. To make sure that the encrypted passwords in the connection XML can be decrypted on the Server, please refer to theDecrypting connection passwordssection. Please note that on the Server side, using a keytab is recommended, as the ticket renewal isnotsupported in case of using a password.
Connecting with keytab file
Connections to a Kerberized cluster should specify the path for the users keytab file instead of the password. This means that the keytab file must be accessible on the local file system of the Server. The path usually differs from the path on the local file system of the user using RapidMiner Studio. The RapidMiner Server administrator have to ensure that thekeytabFilefield of theradoop_connections.xmlfile on the Server points to the appropriate path on the Server. The keytab file itself on the file system should only be accessible for the user running RapidMiner Server.
Note: A RapidMiner Server instance can only talk to a single kerberized Hadoop cluster, more precisely, to a single Kerberos Realm. This limitation comes from the architecture of the Java Kerberos implementation. However,multiple users can usethis kerberized Hadoop cluster concurrently through this RapidMiner Server instance.
Connecting to Hive with LDAP authentication
If LDAP is used for authentication to HiveServer2, then passwords should be entered similarly to the Kerberos passwords, please refer to theDecrypting connection passwordssection. In case of impersonation, the provided Hive LDAP user should also have Hadoop proxyuser privileges.
Access control on Radoop connections
The availability of a Hadoop connection on RapidMiner Server can be limited to a user or a group of users. This means that a RapidMiner Server user that is not on the optionally specified whitelist of a connection cannot use it when submitting Radoop processes. This way, the Server administrator can make sure that users cannot use connections that they are not permitted to use, and that they cannot evade this restriction by manipulating their connection identifiers in submitted processes.
To define a group (or user) whitelist for a connection, add theaccesswhitelisttag for the correspondingradoop-connection-entryin theradoop_connections.xml. The value of this property is an arbitrary regular expression (.*or*can be used for allowing all users). Only RapidMiner Server users whose group matches this expression are allowed to use the connection in a submitted process. If this optionalaccesswhitelist没有指定的连接,然后任何用户can use it in a process.
.... ds_group|dba_group|john|scott
Change Radoop Proxy enabled connections
Radoop Proxy is automatically disabled when a process is executed on RapidMiner Server, because in a typical setup, RapidMiner Server runs inside the secure zone, that's why there is no need to route the traffic through the Proxy.
In case you have a custom manual Radoop Proxy installed on an edge node, and RapidMiner Server (besides Studio) can only reach the Hadoop cluster via this edge node (so it runs outside the secure zone), you need to enableForce Radoop Proxy on Serversetting on theRapidMiner Servertab. This setting has no effect when running in Studio.
Alternatively, you can manually edit theradoop_connectons.xmlfile on the Server. In this case add theforceproxyonservertag with the valueT.
< radoop-entrieskey="XkzjmytZW2ffc7+MnU11BdhzomF8355R"> ... T ...
Please note that the location of the Radop Proxy connection specified in Studio for this connection needs to be the Remote Repository corresponding to this RapidMiner Server instance. Otherwise the process won't be able to find the proxy connection when running on the Server and will fail because of that.