Caching Data within a Process

pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, MemberPosts:96RM Research
edited December 2018 inKnowledge Base

large.png

Problem

When creating processes you sometimes want to create temporary ExampleSets, that are stored in the repository, so you don't need to re-run longer lasting processes over and over again. This esp. occurs, when you have processes depending on the results of others.

Idea

Create a library process, that only executes a process if its output isn't stored in the repo, yet. Otherwise just read the output from the repository.

Solution

Before we can start creating a process we need to setup our studio to show the "Context View". Therefore head over to "View -> Show Panel" and select "Context".

Overview

We're going to try to retrieve an ExampleSet from the repository, if this is not possible, execute a desired process.


The process we are going to create will be a library process. A library process is a standard process designed to be useful in a general usecase. Hence it can be placed in an extra folder (e.g. named 'lib') to reuse it without changes, when the problem solved by this process occures again.

To create more general processes often Macros and the concept of Context is used. A Macro is a variable that can be used as a placefolder e.g. to fill in parameter values, while Context sets up an environment for a process. In the Context (setup through the Context View) input and output objects can be defined, that can be accessed through a process input and output port. Furthermore you need to define Macros in the Context, that should be accessible from outside the process.

Step-by-Step Walkthrough


Create a new process, copy in the XML provided at the end of this post and save it e.g. in a 'lib' folder and give it a name, e.g. 'caching'.

1. Let's check the Context.

context.PNGContext of the caching process

In the Context View twoMacrosare defined: 'repo_path' which will be used to store the temporary ExampleSet, and 'path_to_process' which will store the path to the process, that should be executed, when no ExampleSet was created, yet. We're not using 'process_path' as a Macro name, because it's a predefined one. You don't need to fill in values for this to work. I just set some up for testing without including this process inside another. Also: Make sure to use relative locations, that are relative to the location of this library process.


2. Trying to retrieve the ExampleSet viaHandle Exception.

caching_process.PNGHandling the failure of not being able to retrieve data

Handle Exception is a nested Operator. It executes the process defined inside the 'Try' section and on failure, executes the 'Catch' section. When trying to retrieve our ExampleSet from the repository, loading the location defined by the 'repo_path' Macro will create an Exception, if it is not existing.

caching_process_inner.PNGInside Handle Exception

The Retrieve Operator has %{repo_path} set as the value for therepository entry. The following Print to Console Operator, only prints our a message stating that the ExampleSet was retrieved from the repository.


On the 'Catch' site theExecute Process Operatoris used, to call a desired process in case no ExampleSet is available, yet. Therefore theprocess locationparameter is set to %{path_to_process}. Afterwards Print to Console logs a message stating that an ExampleSet was created and stores it. The Store Operator again uses the %{repo_path} as a value for therepository entryparameter.

Usage

To illustrate the usage let's have a look at a sample repository:

repository_before.PNGExample Repository

In this example we have two data sets stored in a 'data' folder, a 'process' folder with a preparation process named' 01 Clean Data', a follow up process named '02 Build Model', that builds up on the (cached) results, a 'lib' folder, with our newly created 'caching' process and a 'results' folder.

We'll use the caching library process at the start of the '02 Build Model' process. It will create an ExampleSet called 'cleaned_data' inside a folder named 'temp' located inside 'data'.

process_with_caching.PNGProcess using caching

To use the caching library process we drag & drop the process into the '02 Build Model' process and setup the Macros to define the location of the ExampleSet to create and the process to execute in order to create the ExampleSet.

caching_configuration.PNGCaching configuration

首先设置宏,点击拖&dropped Execute caching Operator and then on 'Edit List' for themacrosparameter. A new window pops up, where you can setup the Macros defined in the Context of the caching process.

On first time execution of this process the ExampleSet stored in 'repo_path' can't be retrieved, hence the processed located at 'path_to_process' will be executed, its result stored. This leads to our example repo looking like this:

repository_after.PNGExample Repository after creation of the cached ExampleSet

如果我们执行02建立模型过程中另一个time, it won't need to create the ExampleSet anymore but fallback on reading the cached ExampleSet and thus run faster.

Find the XML code of the caching process here:







repo_path
../../data/temp/cleaned_data


path_to_process
../clean_data



<运营商激活= " true " class = "过程”兼容ibility="8.0.001" expanded="true" name="Process">





















process to be executed if repo entry is not available






save output of process to cache location










try to load data





#1 path and filename of the temp. data set are set via a macro in the context (View -&gt; Show Panel -&gt; Context)<br><br>#2 If no data can be found under the path of location #1 a process is executed. The path to the process is also defined in the context.



Thomas_Ott
    Sign InorRegisterto comment.