Categories

Versions

You are viewing the RapidMiner Radoop documentation for version 9.9 -Check here for latest version

Amazon Elastic MapReduce (EMR) 5.x

There are multiple options to connect to an EMR cluster

  • If RapidMiner Radoop does not run inside the EMR cluster's connected VPC:
  • Direct access (eg: usingAmazon WorkSpaces)

The following steps follow Radoop Proxy recommendation, but you can also find step-by-step guides of the additional steps for other two remote methods below. For direct access setup please follow the Radoop Proxy guide but skip the parts describing the setup of the Radoop Proxy itself.

Connecting to a firewalled EMR cluster using Radoop Proxy

The following steps will guide you through starting and configuring an EMR cluster, and accessing it from RapidMiner Radoop via a RapidMiner Radoop Proxy that is running on the EMR cluster's Master Node.

  1. If you doesn't already have a running EMR cluster, then use the advanced options on AWS Console for creating your EMR cluster. Select a5.xversion forEMR release. Make sure thatHadoop,HiveandSparkare selected for installation in the Software Configuration step. Complete the rest of the configuration steps on AWS Console, then start the cluster.

  2. SSH onto the Master node (once its status became eitherRUNNINGorWAITINGon the EMR page of the AWS console) SSH instructions can be found on the Summary tab of your EMR cluster. Make note of the , this will be needed later for the Radoop Proxy configuration in RapidMiner Studio. (eg:ec2-35-85-2-17.compute-1.amazonaws.example.com)

  3. Obtain the internal IP address of the Master node (e.g.10.1.2.3) via thehostname -icommand and make note of it as this will be needed for theRadoop Connection. (Private IP and DNS information can also be obtained from the AWS Console on the EMR Cluster details page in the Hardware section checking the "EC2 Instances")

  4. Perform the following commands to setup Spark on the cluster for Radoop. For Spark 2.x versions, the best practice is to upload the compressed Spark jar files to HDFS from the preinstalled location from the master node. (This is crucial as EMR usually installs relevant libraries onto the file system of the master node only, whereas worker nodes also depend on them) On recent versions of EMR 5.x all of this can easily be done by issuing the following commands on the EMR master node:

    #Setup Spark 2.* libraries from the default install location cd /usr/lib/spark zip /tmp/spark-jars.zip --junk-paths --recurse-paths ./jars hdfs dfs -mkdir -p /user/spark hdfs dfs -put /tmp/spark-jars.zip /user/spark #Copy PySpark libaries onto hdfs hdfs dfs -put ./python/lib/py4j-src.zip /user/spark hdfs dfs -put ./python/lib/pyspark.zip /user/spark #Copy SparkR libaries onto hdfs hdfs dfs -put ./R/lib/sparkr.zip /user/spark #List all the files that have been put onto hdfs in the /user/spark directory hdfs dfs -ls /user/spark

    If everything went well the output should be very similar to this:

    [hadoop@ip-172-31-18-147 spark]$ hdfs dfs -ls /user/spark Found 4 items -rw-r--r-- 1 hadoop spark 74096 2019-07-25 17:47 /user/spark/py4j-src.zip -rw-r--r-- 1 hadoop spark 482687 2019-07-25 17:48 /user/spark/pyspark.zip -rw-r--r-- 1 hadoop spark 180421304 2019-07-25 17:47 /user/spark/spark-jars.zip -rw-r--r-- 1 hadoop spark 698696 2019-07-25 17:48 /user/spark/sparkr.zip
  5. Follow the instructions in theStandalone Radoop Proxysection. Start the Radoop Proxy after the configuration has been completed.

  6. RapidMiner工作室开始,创建一个新的Radoop公关oxyconnection. Use the (from step 2) as theRadoop Proxy Server host. Make sure to test the Proxy connection via theTest Connectionbutton.

  7. In RapidMiner Studio create a newRadoop Connection用以下值(你可以供应al configuration parameters as needed). Advanced Radoop users can alternatively import theExample Radoop Connection xml template for EMRbelow which includes all required settings listed in this table.

    Property Value
    Hadoop Version Amazon Elastic MapReduce (EMR) 5.x
    Hadoop username hadoop
    NameNode Address (e.g.10.1.2.3)
    NameNode Port 8020
    Resource Manager Address (e.g.10.1.2.3)
    Resource Manager Port 8032
    JobHistory Server Address (e.g.10.1.2.3)
    Hadoop Advanced Parameters Add key/valuedfs.client.use.datanode.hostnamevalue offalse
    Spark Version Corresponding Spark version (eg:Spark 2.3.1+)
    Use custom PySpark archive Checked
    Custom PySpark archive paths Add two entrieshdfs:///user/spark/py4j-src.zipandhdfs:///user/spark/pyspark.zip
    Use custom SparkR archive Checked
    Custom SparkR archive path hdfs:///user/spark/sparkr.zip
    Hive Server Address (e.g.10.1.2.3)
    Hive Username hive
    Use Radoop Proxy Checked
    Radoop Proxy Connection

    NotePlease consider fine tuning Spark memory settings as discussedhere.

 Amazon EMR Connection Example 9.4.0  <Master node internal IP address eg 10.1.2.3> <Master node internal IP address eg 10.1.2.3> <Master node internal IP address eg 10.1.2.3> <Master node internal IP address eg 10.1.2.3> T default 10000 8032 8020 10020 F   hadoop-emr-5.x F T F T    auth    F  F SPARK_23_1 T T hdfs:///user/spark/pyspark.zip,hdfs:///user/spark/py4j-src.zip hdfs:///user/spark/sparkr.zip hdfs:///user/spark/spark-jars.zip dynamic 30   dfs.client.use.datanode.hostname false T      spark.driver.extraJavaOptions -XX:+PrintGC -XX:+PrintGCDateStamps T   spark.driver.memory 2000 T   spark.executor.extraJavaOptions -XX:+PrintGC -XX:+PrintGCDateStamps T   spark.executor.memory 2000Mb T   spark.logConf true T    hadoop hive hive2  F F   hive_0.13.0 yarn * F T 

Save theRadoop Connectionand performQuick/Full Testsaccordingly.

A differentHadoop usernamecan be used, but please check that the username is created and has proper permissions and ownership rights on the/user/directory on HDFS viahdfs dfs -ls /user.

Connecting to an EMR cluster using SOCKS Proxy

SOCKS proxy is another option to connect to your EMR cluster. See theNetworking Setupsection for information on starting a SOCKS proxy and an SSH tunnel. Please open the SSH tunnel and the SOCKS proxy.

Setup the connection in RapidMiner Studio

  1. SelectAmazon Elastic MapReduce (EMR) 5.xas the Hadoop version.

  2. Set the following addresses:

    • NameNode Address: (e.g.10.1.2.3)
    • Resource Manger Address: (e.g.10.1.2.3)
    • JobHistory Server Address: (e.g.10.1.2.3)
    • Hive Server Address:localhost
  3. Set the ports if necessary

  4. Add the followingAdvanced Hadoop Parameterskey-value pair (as described inNetworking Setup):

    Key Value
    dfs.client.use.legacy.blockreader true
    hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.SocksSocketFactory
    hadoop.socks.server localhost:1234
  5. Save theRadoop Connectionand performQuick/Full Testaccordingly.

Connecting to an EMR cluster over VPN

EMR VPN is another option to connect to your EMR cluster. This will require setting up a dedicated EC2 instance with the VPN software.

Setting up the VPN

If the user already has a VPN established for the EMR cluster then this section can be skipped. But the user does need to still take note of the VPN’s IP address and DNS name and make sure the VPN is attached to the EMR cluster’s VPC and subnet.

  1. When the cluster is in aRUNNINGorWAITINGstate, note the private IPs and private domain names of the EC2 instances should be available.

  2. Start a VPN server using an EC2 instance in the same VPC as the EMR cluster.

  3. Connect to the VPN from your desktop

    • Check if the correct route is set up (e.g.172.30.0.0/16)
  4. Enable the network traffic from the VPN to the EMR cluster

    • On the EMR Cluster details page open the Master (and later the Slave) security group settings page
    • At the inbound rules add a new rule and enable “All Traffic” from the VPC network (e.g.172.30.0.0/16)
    • Do this setting on both the Master and Slave security groups of the EMR cluster
  5. Optional: Setup local hosts file (if you would like to use host/DNS names instead of IP addresses)

    • On the EMR Cluster details page in the Hardware section check the “EC2 Instances” and get the private IP and DNS.
    • Add the hostnames (DNS) and IP addresses of the nodes to your local hosts file (e.g.172.30.1.209 ip-172-30-1-209.ec2.local)

Setting up the Radoop Connection in RapidMiner Studio to use the VPN

When the VPN server has been established either by theSetting up the VPNinstructions above or by some external entity theRadoop Connectioncan be created. Use the steps described in theConnecting to a firewalled EMR cluster using Radoop Proxy部分Radoop跳过设置要求Proxy itself.