Categories

Versions

You are viewing the RapidMiner Radoop documentation for version 9.7 -Check here for latest version

Connecting to a 3.0.1+ Hortonworks Sandbox

在撰写本文时最新的一个vailable version of Hortonworks Data Platform (HDP) on Hortonworks Sandbox VM is 3.0.1. This guide was created for that.

Start and configure the Sandbox VM

  1. Download the Hortonworks Sandbox VM for VirtualBox from theDownload website.

  2. Import the OVA packaged VM to your virtualization environment (Virtualbox is covered in this guide).

  3. Start the VM. After powering it on, you have to select the first option from the boot menu, then wait for the boot to complete.

  4. Log in to the VM. You can do this by switching to the login console (Alt+F5), or even better via SSH on localhost port2122. It is important to note that there are 2 exposed SSH ports on the VM, one belongs to the VM itself (2122), while the other (2222) belongs to a Docker container running inside the VM. The username isroot, the password ishadoopfor both.

  5. Edit the/sandbox/proxy/generate-proxy-deploy-script.shby include the following ports in thetcpPortsHDParray 8025, 8030, 8050, 10020, 50010.

    1. vi /sandbox/proxy/generate-proxy-deploy-script.sh
    2. FindtcpPortsHDPvariable, leaving the other values in place, add to the hashtable assignment:

      [8025]=8025 [8030]=8030 [8050]=8050 [10020]=10020 [50010]=50010
  6. Run the editedgenerate-proxy-deploy-script.shvia/sandbox/proxy/generate-proxy-deploy-script.sh

    • This will re-create the/sandbox/proxy/proxy-deploy.shscript along with config files in/sandbox/proxy/conf.dand/sandbox/proxy/conf.stream.d, thus exposing the additional ports added to thetcpPortsHDPhashtable in previous step.
  7. Run the/sandbox/proxy/proxy-deploy.shscript via/sandbox/proxy/proxy-deploy.sh

    • Running thedocker pscommand, will show an instance namedsandbox-proxyand the ports it has exposed. The inserted values to thetcpPortsHDPhashtable should be shown in the output, looking like0.0.0.0:10020->10020/tcp.
  8. These changes only made sure that the referenced ports of the Docker container are accessible on the respective ports of the VM. Since the network adapter of the VM is attached to NAT, these ports are not accessible from your local machine. To make them available you have to add the port forwarding rules listed below to the VM. In VirtualBox you can find these settings underMachine/Settings/Network/Adapter 1/Advanced/Port Forwarding.

    Name Protocol Host IP Host Port Guest IP Guest Port
    resourcetracker TCP 127.0.0.1 8025 8025
    resourcescheduler TCP 127.0.0.1 8030 8030
    resoucemanager TCP 127.0.0.1 8050 8050
    jobhistory TCP 127.0.0.1 10020 10020
    datanode TCP 127.0.0.1 50010 50010
  9. Edit yourlocalhosts文件(主机操作系统上,而不是在the VM), addsandbox.hortonworks.comandsandbox-hdp.hortonworks.comto your localhost entry. At the end it should look something like this:

    127.0.0.1 localhost sandbox.hortonworks.com sandbox-hdp.hortonworks.com

  10. Reset Ambari access. Use an SSH client to login tolocalhost as root, this time using port2222!(For example, on OS X or Linux, use the commandssh root@localhost -p 2222, password:hadoop)

    • (At first login you have to set a new root password, do it and remember it.)
    • Runambari-admin-password-resetas root user.
    • Provide a new admin password for Ambari.
    • Runambari-agent restart.
  11. Open the Ambari website:http://sandbox.hortonworks.com:8080

    • Login withadminand the password you chose in the previous step.
    • Navigate to theYARN/Configs/Memoryconfiguration page.
    • Edit theMemory NodeSetting to at least 7 GB and click Override.
      • User will be prompted to create a new "YARN Configuration Group", enter a new name.
      • On the "Save Configuration Group" dialog, click theManage Hostsbutton.
      • On the "Manage YARN Configuration Groups page" take the node in the "Default" group and add the node into the group created in the "YARN Configuration Group" name step.
      • "Warning" Dialog will open requesting adding notes click theSavebutton.
      • "Dependent Configurations" dialog will open with Ambari providing recommendations to modify some related properties automatically. If so, unticktez.runtime.io.sort.mbto keep its original value. Click theOkbutton.
        • Ambari may open a "Configurations" page suggesting stuff. Review accordingly, but this is out of the scope of this document, so just clickProceed Anyway.
    • Navigate to theHive/Configs/Advancedconfiguration page.
    • In theCustom hiveserver2-sitesection. Thehive.security.authorization.sqlstd.confwhitelist.appendneeds to be added via theAdd Property...and be set to the following (it must not contain whitespaces):

      radoop\.operation\.id|mapred\.job\.name|hive\.warehouse\.subdir\.inherit\.perms|hive\.exec\.max\.dynamic\.partitions|hive\.exec\.max\.dynamic\.partitions\.pernode|spark\.app\.name|hive\.remove\.orderby\.in\.subquery
    • Save the configuration and restart all affected services. More details onhive.security.authorization.sqlstd.confwhitelist.appendcan be found inHadoop Security/Configuring Apache Hive SQL Standard-based authorizationsection.

Setup the connection in RapidMiner Studio

  1. Click onNew Connection IconNew Connectionbutton and chooseImport from Manager IconImport from Cluster Manageroption to create the connection directly from the configuration retrieved from Ambari.

  2. On theImport Connection from Cluster Managerdialog enter

    • Cluster Manager URL:http://sandbox-hdp.hortonworks.com:8080
    • Username:admin
    • Password: password used in Reset Amabari step.
  3. ClickImport Configuration

  4. Hadoop Configuration Importdialog will open up

    • If successful clickNextbutton andConnection Settingsdialog will open.
    • If failed clickBackbutton and review above steps and logs to solve issue(s).
  5. On theConnection SettingsDialog, which opens whenNextbutton is clicked from step above.

  6. Connection Namecan stay defaulted or be changed by user.

  7. Globaltab

    • Hadoop Version应该是Hortonworks HDP 3.x
    • SetHadoop usernametohadoop.
  8. Hadooptab

    • NameNode Address应该是sandbox-hdp.hortonworks.com
    • NameNode Port应该是8020
    • Resource Manager Address应该是sandbox-hdp.hortonworks.com
    • Resource Manager Port应该是8050
    • JobHistory Server Address应该是sandbox-hdp.hortonworks.com
    • JobHistory Server Port应该是10020
    • Advanced Hadoop Parametersadd the following parameters:

      Key Value
      dfs.client.use.datanode.hostname true

      (This parameter is not required when using theImport Hadoop Configuration Filesoption):

      Key Value
      mapreduce.map.java.opts -Xmx256m
  9. Sparktab

    • Spark VersionselectSpark 2.3 (HDP)
    • CheckUse default Spark path
  10. Hivetab

    • Hive Version应该是HiveServer3 (Hive 3 or newer)
    • Hive High Availability应该是checked
    • ZooKeeper Quorum应该是sandbox-hdp.hortonworks.com:2181
    • ZooKeeper Namespace应该是hiverserver2
    • Database Name应该是default
    • JDBC URL Postfix应该是empty
    • Username应该是hive
    • Password应该是empty
    • UDFs are installed manuallyandUse custom database for UDFsare both unchecked
    • Hive on Spark/Tez container reuse应该是checked
  11. ClickOKbutton, theConnection Settingsdialog will close

  12. User can test the connection created above onnManage Radoop Connectionspage select the connection created and clicking theQuick TestandFull Test IconFull Test...buttons.

If errors occur durning testing confirm that necessary Components are started correctly athttp://localhost:8080/#/main/hosts/sandbox-hdp.hortonworks.com/summary.