Reading Excel files directly from your companies OneDrive

pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, MemberPosts:96RM Research
edited December 2018 inKnowledge Base

Nowadays many corporate Excel filestend to be online. This seems quite natural, when you consider the wide range of benefits: ease of sharing and collaboration, access from any pc with Internet connection, user right management and constant backups, just to name a few. Wouldn't it be great to include data from those files directly into your Data Mining process, even without downloading the whole file? Now you can! Introducing the newRead Excel OnlineOperator. It is part of the latest update of theSpreadsheet Table Extractionextension (download link).excel_sheet.PNGExample Excel Online Sheet Above you can see an Excel Sheet stored in a OneDrive for Business instance. Its filename is 'Customer Data.xlsx', it contains one sheet called 'Sheet1' and note, that the 4th column is not given a name, yet. Now let's access this sheet. Open up RapidMiner and search the Marketplace (found in theExtensionsmenu item) forSpreadsheet Table Extraction. After the installation is complete, you will find a new Operator, calledRead Excel Onlinein your Operator view. Drag it into your process and fill in the parameters. For the sheet shown above it will look like this:

process.pngExample process extracting data from OneDrive

TheFile Pathneeded is the relative path in yourOneDrive for Business. Unfortunately, Microsoft's API does not support private OneDrive accounts,yet. As you can see, my Excel Sheet is located directly in the upper most folder of my OneDrive for Business. If I would move it into a folder named 'Data', I would need to provide 'Data/Customer Data.xlsx' as theFile Path. Note, that no leading slash is used.

TheSheet Nameis set per default to Microsoft's current default sheet name. Make sure, that you provide the name of the sheet you intend to access. If you don't provide aCell Rangethe whole data containing range will be used. Leading rows and columns that are empty, will be skipped. Providing aCell Rangeis done using the A1 notation used in Excel. Selecting all cells starting from the 2nd column (B) and the 3rd row up to the 8th column (H) and the 10th row would require to provide 'B3:H10' as theCell Range.

Since your OneDrive for Business is an internal resource, you also need to verify, that you have access to the Excel Sheet. Therefore you need a so called authentication token. You can get one by visiting theMicrosoft Graph Explorer并与OneDrive登录凭证(often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into theAuth Tokenfield and the Operator will extract the token information automatically.

Executing the process results in the ExampleSet shown below:result.pngResulting ExampleSet from the extraction process

As you can see, all data was extracted successfully. Empty cells were filled up with missing values and the 4th column, which had no name, was given a generic name (here:Attribute 4). And the best part about it, the file was never completely downloaded to your machine. Only data present in the cells you wanted to access was send to your machine. So now you are able to extract only the data of interest while for example facing a large excel document with many sheets, without creating lots of temporary versions of the file on your machine.

In the rare case, that your Excel Sheet only contains data without any column names, check theNo column names givenoption to generate generic attribute names instead of using the first data row as column names.

Advanced Integration

If you have documents stored in MicrosoftSharePointsites, you can use theSharePoint Connectorextension (download link) to obtain a list of existing files and load them directly into your process. Therefore have a look atthis articlesexplaining the usage of the extension, and extract the IDs of the desired files and the site itself as described there. Using these information you can provide them via Macros to theRead Excel OnlineOperator to directly access the data from your SharePoint even without downloading the files. Checking theRead from SharePointparameter enables you to enter theSharePoint Site IDas well as theSharePoint File ID.

sharepoint_access.PNGRead from SharePoint option

An example process using both extensions for direct access to SharePoint data in Excel Sheets is given below:


























<操作符= " true " class = " spreadsheet_tabl激活e_extraction:read_excel_online" compatibility="0.2.001" expanded="true" height="68" name="Read Excel Online" width="90" x="447" y="85">









Obtaining the SharePoint site and file IDs from previously set macros<br><br><br><br><br><br><br>Make sure to check the Sheet Name and to provide a valid token.









#1 Login tohttps://developer.microsoft.com/en-us/graph/graph-explorerand copy the URL from your browsers address bar into the &quot;Auth Token&quot; field.<br>#2 Go to your SharePoint and choose a site you want to access.<br>#3 Fill in the parameters 'SharePoint URL' with the URL you used to access your SharePoint and fill in the parameter 'SharePoint Site' with the name of the site you want to access.
Select only entries containing Excle Sheets (ending with .xlsx).
Extract the SharePoint site ID from the annotated ExampleSet and provide it as a macro called 'siteId'
Loop over all remaining entries (only the ones being Excel Sheets) to access the SharePoint file IDs from the 'sharepointId' column and providing it via a macro called 'fileId'


Happy Mining,

Philipp for the RapidMiner Research Team

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website:http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

Tagged:
BalazsBarany IngoRM JEdward ey Edin_Klapic Pavithra_Rao

Comments

  • MontseMontse MemberPosts:19Maven
    Hi@pschlunder,

    I have some problems trying to read an Excel file from my Business OneDrive.
    Could you specify with more detail how to obtain the Auth Token?

    I have logged in into Microsoft Graph Explorer but the URL I have copied is not correct. When I run the process, RapidMiner tells me "The URL you provided is currently not supported for access token extraction. Please use the URL of the Graph Explorer as explained in the operators documentation".

    The steps I have done:
    1. I Log in into Microsoft Graph Explorer
    2. I have proved the Get API "all the items in my drive" and I can view in JSON format all the items
    3. Wich URL I have to copy and put into Auth Token property from Read Excel Online operator? These are the options I have tried
    a) the URL that appears in Google Chrome bar:https://developer.microsoft.com/en-us/graph/graph-explorer#
    b) the URL that appears into JSON format (when I call all the items in my drive) with the title:@odata.context
    c) the URL that appears into JSON format (when I call all the items in my drive) in one of my folders with the title: webUrl
    4. With all of them, I can't access the file.

    Thanks in advance.
    Best regards,
    Montse

  • MontseMontse MemberPosts:19Maven
    Thank you Edwin. I will prove the steps you have told me
Sign InorRegisterto comment.