Connect to any document within your SharePoint

pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, MemberPosts:96RM Research
edited December 2018 inKnowledge Base

公司and organizations often store and share information viaMicrosoft SharePointSites. They are a great way of collecting and sharing information around a given topic. Many sites therefore contain lots of office documents and files in other formats. Integrating these information into a Data Mining process often involves manual searching through sites and folders as well as downloading files by hand. This isn't fast, nor simple. Therefore we created theSharePoint Connectorextension to speed things up. You can download it through theRapidMiner Marketplace. It consists of theList SharePoint Filesoperator, that creates a list of all available files and folders and theDownload from SharePointOperator which downloads files of interest.

Below you can see the document section of a SharePoint site created for a little demonstration. This site groups together a project folder and a few documents with varying file format.sharepoint_site.PNGDemo SharePoint SiteThe first step for integrating your SharePoint data into your Data Mining process is to find out, what theSharePoint URLof your company or organization is. Just have a look into your browsers address bar and extract it along with your sites name. Both things are underlined in the picture above. Now enter these information into theList SharePoint FilesOperator, that comes with theSharePoint Connectorextension, as shown in the picture below.list_process.pngList SharePoint Files Operator configuration

Since your SharePoint site is an internal resource, you also need to verify, that you have access to the information. Therefore you need a so called authentication token. You can get one by visiting theMicrosoft Graph Explorerand logging in with your SharePoint credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into theAuth Tokenfield and the Operator will extract the token information automatically.

If you now run the process an ExampleSet is created, that contains information about the files stored in the site you accessed. Below you can see the result from scanning my demonstration SharePoint site shown at the beginning of this post. TheauthorandlastModifiedBycolumns are redacted for this post.result.PNGResult view containing all files and folders found in the site

You gain information about thefilename, its location within the site (path), aurlfor downloading it manually, theauthor's name, the creation date and time (creationDateTime), the person having modified it last (lastModifiedBy), the date and time of the last change (lastModificationDateTime),一个独特的sharepointIdand the information if the entry is afolderor not. The Operator always scans files at the given folder level. If you need to dig deeper you can use the information derived above together with theScan specific folderparameter to search for files and folders in a subfolder.

With this information you can for example filter out all entries created by a givenauthor或下载所需的文件格式,以them. Therefore you can add theFilter Examplesoperator or any other Operator to create a more specific list of files you want to download. Providing this list to theDownload from SharePointOperator enables you to download all files to the destination defined in theDownload Pathparameter or continue working on them by using the collection of files provided at its output port. An example process using this filtering is shown below and provided as a tutorial process, that comes with theDownload from SharePointOperator.read_process.pngFile download and integrationTo continue using the files directly in your process you can for example use theLoop CollectionOperator to handle each file and use one of RapidMiner's many reading Operators to extract the data into your process. Don't worry, you don't need to provide theAuth Tokento theDownload from SharePointOperator again. It will be stored alongside the ExampleSet (as an annotation) so you don't need to handle it again. But if you store the ExampleSet in your repository and want to download files later, your token might expire. Hence the operator offers an option to set a new token. Again you can just provide the URL obtained after logging into Microsoft Graph Explorer.

Happy Mining,

Philipp for the RapidMiner Research Team

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website:http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

IngoRM Thomas_Ott Edin_Klapic robin

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    This is a really handy extension considering many people use Sharepoint.

  • robinrobin MemberPosts:100Guru

    I did not expect to find this, this is really handy.

Sign InorRegisterto comment.