Categories

Versions

You are viewing the RapidMiner Studio documentation for version 9.2 -Check here for latest version

Using the Azure Data Lake Storage Connector

The Azure Data Lake Storage Connector allows you to access your Azure Data Lake Storage Gen1 account directly from RapidMiner Studio. Bothreadandwriteoperations are supported. You can alsoread from a set of filesin an Azure Data Lake Storage directory, using theData Cloud IconLoop Azure Data Lake Storageoperator. This document will walk you through how to:

Connect to your Azure Data Lake Storage Gen1 account

Before you can use the Azure Data Lake Storage connector, you have to configure your Azure environment to support remote connection and setup a new Azure Data Lake Storage Gen1 connection in RapidMiner.

For this purpose, you need to go through the following main steps (see details below).

  • Create a web application registration on Azure portal.
  • Acquire information for the remote connection.
  • Setup and test the new Azure Data Lake Storage Gen1 connection in RapidMiner.

Step 1: Create a web application registration on Azure portal

Create and configure an Azure AD web application to allow sevice-to-service authentication with Azure Data Lake Storage Gen1 using Azure Active Directory. Go throughStep 1toStep 3ofService-to-service authentication guide. The first step registers a web application that will provide access for RapidMiner to Azure Data Lake Storage. Note that you can use arbitrary values forNameanSign-on URLfields. The second step describes how to get your tenant ID, the application ID for the registered application, and a key that needs to be provided in RapidMiner so that it is able to use this application. Third steps configures this Active Directory application to have access to your Data Lake Storage.

After performing those steps in your Azure tenant, you should have aweb applicationregistration that is configured to access some or all folders of your target Azure Data Lake Storage Gen1 resource. Note that for the file browser of the RapidMiner operators (see below) to work, you need to giveReadandExecuteaccess on the root directory, and on all directories where you want to allow navigation. Besides that, you needWritepermission to be able to write to the cloud storage from RapidMiner. If you can work without the file browser, you can limit the permissions to the target folders / files that your operators directly use.

Step 2: Acquire information for the remote connection

To create the connection in RapidMiner, you need to acquire the following information. The previous step and linked guide described how to get them, but let's repeat the direct links here to these details.

  1. Tenant ID that identifies your company's account.Get tenant ID.
  2. Fully Qualified Domain Name of your accont. Example: if your Azure Data Lake Storage Gen1 is namedcontoso, then the FQDN iscontoso.azuredatalakestore.netby default..
  3. Application ID and application key for the Web application you created.Get application ID and authentication key.

步骤3:设置和测试新Azure湖站的数据rage Gen1 connection in RapidMiner

After you have all information, it is straightforward to set up your connection in RapidMiner.

  1. Open theManage Connectionsdialog in RapidMiner Studio by going toManage Connections IconConnections > Manage Connections.

  2. Click onAdd ConnectionAdd Connection Iconin the lower left:

  3. Enter a name for the new connection and selectData Cloud IconAzure Data Lake Storage Gen1 Connectionas theConnection Type:

  4. Fill in the connection details of your Azure Data Lake Storage Gen1 account. Specify thetenant id,account fqdn(fully qualified domain name),client id(web application id),client key(password to access the web application).

  5. 虽然不是必需的,我们建议测试你的新Azure Data Lake Storage Gen1 connection by clicking theConnection Test IconTestbutton.

  6. ClickSave IconSave all changesto save your connection and close theManage Connectionswindow. You can now start using the Azure Data Lake Storage operators.

Read from Azure Data Lake Storage

TheData Cloud IconRead Azure Data Lake Storageoperator reads data from your Azure Data Lake Storage Gen1 account. The operator can be used to load arbitrary file formats, since it only downloads and does not process the files. To process the files you will need to use additional operators such asRead CSV,Read Excel, orRead XML.

Let us start with reading a simplecsvfile from Azure Data Lake Storage.

  1. Open a new processNew Process Iconin RapidMiner Studio and chooseBlank Projectfrom the list. Drag theData Cloud IconRead Azure Data Lake Storageoperator into theProcessview, and connect its output port to the result port of the process:

  2. Select your Azure Data Lake Storage Gen1 connection from theconnectiondrop down menu in theParametersview.

  3. Click on thefilechooser buttonfile chooser iconto view the files in your Azure Data Lake Storage Gen1 account. Select the file that you want to load and clickFile Chooser IconOpen. Note that you need to haveReadandExecuteaccess to the root directory, if you want to use the file browser starting from the root folder. If you do not have that permission, you can type a path into the parameter field. If you have access to the parent folder of that path (file or directory) andExecuteaccess up to the root folder, you can open the file browser. Or you can always use a manually typed path and use the operator with that (in that case, permission is only checked during runtime).

    As mentioned above, theData Cloud IconRead Azure Data Lake Storageoperator does not process the contents of the specified file. In our example, we have chosen acsvfile (acomma separated valuesfile). This file type can be processed via theRead CSVoperator.

  4. Add aRead CSVoperator between theData Cloud IconRead Azure Data Lake Storageoperator and the result port. You may set the parameters of theRead CSVoperator - such as column separator -, depending on the format of your csv file:

  5. RunRun Processthe process! In theResultsperspective, you should see a table containing the rows and columns of your choosen csv file:

You could now use further operators to work with this document, e.g., to determine the commonness of certain events. To write results back to Azure Data Lake Storage, you can use theData Cloud IconWrite Azure Data Lake Storageoperator. It uses the same connection type as theData Cloud IconRead Azure Data Lake Storageoperator and has a similar interface. You can alsoread from a set of filesin an Azure Data Lake Storage directory, using theData Cloud IconLoop Azure Data Lake Storageoperator. For this you need to specify theconnectionand thefolderthat you want to process, as well the steps of the processing loop with nested operators. For more details read the help of theData Cloud IconLoop Azure Data Lake Storageoperator.