Categories

Versions

Connecting to Cloudera Data Platform

For your reference Cloudera Data Platform Private Cloud Base version 7.1.7 was used while creating this document. The following setup guide suits a Kerberised CDP cluster with TLS authentication supporting High Availability for Hive, HDFS and YARN services, which is the most common production use-case.

Configuring the Hadoop cluster

The cluster side configurations listed below can be done by a user with admin privileges in the Cloudera Manager instance used to administer your CDP cluster.

Spark dependencies

Setup Cloudera Spark3

Radoop 10.2 does support Cloudera's Spark3 distribution out of the box. Please install the Spark3 parcel following Cloudera documentation.

Add Java 11 to worker nodes

Cloudera supports running their cluster on Java 11, thus all worker nodes must be equipped with that version. This can be achieved either by running the whole cluster on Java 11 (which can be effortlessly configured in Cloudera Manager) or by installing a Java 11 on all the worker nodes into a local file system directory of your choice. If the latter was chosen, then please share that location with Radoop users since they will need it during their connection setup.

Radoop Proxy 2.0

Radoop 10.2.0+ requires Radoop Proxy 2.0 (or later) to submit Spark applications. It can be installed and managed via Cloudera Manager.

Hive setup

Allow changes of advanced HiveQL properties

Radoop relies on its ability to set certain advanced HiveQL properties along query execution. These must be explicitly enabled - whitelisted - on the cluster.

  • Navigate toHive on Tez/配置in Cloudera Manager
  • Search forHive Client Advanced Configuration Snippet (Safety Valve) for hive-site.xmladd the following both for Service and Client configurations (it must contain no whitespaces):

    Name: hive.security.authorization.sqlstd.confwhitelist.append Value: radoop\.operation\.id|mapred\.job\.name|hive\.warehouse\.subdir\.inherit\.perms|hive\.exec\.max\.dynamic\.partitions|hive\.exec\.max\.dynamic\.partitions\.pernode|spark\.app\.name|hive\.remove\.orderby\.in\.subquery|radoop\.testing\.process\.name
  • If everything went well it should look like this:

Enabling Radoop UDFs in Hive

Complex functionality of Radoop is partly achieved by defining custom functions (UDF, UDAF and UDTF) toHiveserver2extending its capabilities.

  • 蜂巢UDF傅nctions for Radoop 10.2+ requires Hive to run on Java 11
  • InstallRapidMiner Parcel
  • Navigate toHive on Tez/配置in Cloudera Manager
  • Search forhive_aux_jars_path_dirand add the following value/opt/cloudera/parcels/RAPIDMINER_LIBS/lib/radoop/
  • Restart Hive service (and possibly other stale services) to pick up changes.
  • Register蜂巢UDF傅nctions for Radoop

YARN configuration

Set YARN's logging configurations to allow reading of YARN application logs submitted by Radoop. Collecting and reading the CDP default IFiles is not supported currently thus using TFiles needs to be configured explicitly. In order to read YARN logs, enablingread permissionfor the log folder in HDFS is required.

  • Navigate toYARN/配置
  • Search foryarn_log_aggregation_file_formatsand set it's value toTFile
  • Search foryarn_log_aggregation_TFile_remote_app_log_dirand set it's value to/tmp/logs
  • Search foryarn_log_aggregation_TFile_remote_app_log_dir_suffixand set it's value to/logs
  • To finish YARN setup restart stale services.

Networking

Please follow the general description fornetworking setupfor accessing Hadoop cluster.

Security configuration

  • To ease Radoop Connection setup for Radoop users it is recommended to create and share atechnical account in Cloudera Managerwith which the Radoop Connection Import Wizard can perform its job. Such account can be created in Cloudera Manager underAdministration/Users & RolesselectAdd Local Userand set it'sRolestoRead-Only.
  • Radoop users are going to require a handful set ofpermissions(eg: access HDFS, execute HiveQL, submit YARN job) which should be already in place of a working cluster. For the exact set please refer to Configuring Apache Ranger authorization atHadoop Security.
  • On a Kerberized cluster Radoop users need theirkeytab fileandKDC detailsin order to authenticate to the cluster.
  • Radoop users will need theCA certificate and other trusted certificates in PEM formatto establishsecure communicationwith Hadoop services via TLS.

Setting up the Radoop connection in RapidMiner Studio

Operating a CDP cluster can happen on multiple environments with different network setups. During the setup process it is crucial to take into consideration whether the cluster is running on a separate, isolated network. In the latter case the Hadoop cluster is not aware of its nodes external addresses hence usingRadoop Proxyis required in order to operate properly.

The configurations in the following section need to be set on both secured and non-secured clusters. Westronglyrecommend using theImport from Cluster Manager一些先进的工具来创建连接properties required for correct operation are seamlessly gathered from the cluster during the import process.

  1. Auto-TLS Encryptionships with CDP clusters and Radoop Proxy also supports SSL. If any of those is equipped with an untrusted (aka self-signed) certificate for SSL you need toadd the certificate(s)to thecacertsfolder in RapidMiner Studio home in order to establish secure communication channel.

  2. UseImport from Manager IconImport from Cluster Managertocreate the connectiondirectly from the configuration retrieved from Cloudera Manager. The import process doesn't use Radoop Proxy thus Cloudera Manager has to be accessible over network for this task. If SSL is enabled, pick the hostname which corresponds the certificate installed in the previous step.

  3. When using Kerberos, setClient Principalwith the correspondingKeytab File,KDC Addressand theREALMon theGlobaltab.

  4. On theHivetab, enter theDatabase Nameto connect to. Choose a database where privileges for all operations are granted for the given user. In case your Hadoop administrator installed RapidMiner Parcel, tickUDFs are installed manually, otherwise Radoop will register UDFs at runtime.

  5. In case of using Radoop Proxy there should be aproxy connection readyto it. As a final step for a Radoop Connection tickUse Radoop Proxyon the Radoop Proxy tab and select aRadoop Proxy Connectionwhich had been created for this cluster.