You are viewing the RapidMiner Studio documentation for version 9.9 -Check here for latest version
Deployments
Introduction
See also the video introductions to Model Operations:Introduction/Deployment/Management
To realize the full value of your models, you have to put them into production. In the guided approach provided by RapidMiner Studio, that means:
Preparing your data withTurbo Prep
Building your models withAuto Model
Deploying your models in the Deployments View ("Model Ops")
From within Auto Model, you can deploy a model with a single click!
Adeployment是一家集模型描述inpu一样吗t data. In its simplest form, it lives in a repository and scores data (e.g., makes predictions), but it can do much more!
- A deployment organizes your models and keeps essential data together in one place (e.g., for compliance with regulations such asGDPR).
- A deployment tracks the performance of your models over time, alerting you to drift and bias.
- A deployment can be shared by a group collaborating on a common project.
- A deployment provides web services, so that you can integrate it with your other software.
Adeployment locationis a container for one or more deployments.
- Multiple deployments can be stored in a common deployment location.
- Teams working separately can have different deployment locations.
The basic philosophy is: the more models you can get into production, the better. Therefore, deployment should be as easy as possible. Why let your models go to waste?
Jump to theTable of contents
Plan your deployment
A deployment location is to start with an empty folder, either local or remote. The best policy, once you havecreated this folder, is not to touch it, to avoid breaking your deployments.
Deployment location | Folder location |
---|---|
Local | RapidMiner Studio repository |
Remote | RapidMiner AI Hub repository |
A deployment location on RapidMiner AI Hub can be shared, and you can control access by appropriately configuringusers and groups。
如果你不小心你deployme在规划nt, you may discover afterwards that it is lacking crucialcomponents。注意that:
- monitoringhelps you to collect long-term scoring statistics. To activate monitoring, you have to create adatabase connection。
- remote deploymentallows you to share the deployment and to integrate the deployment with other software.
Make sure to create an appropriatedeployment location。
Components of a deployment
Within a deployment location, youcreate a deployment。That deployment will provide a subset of the following components, depending on the context of the deployment location.
If you create aremote deploymentwithmonitoring, the full set of components will be available.
Component | Description |
---|---|
Dashboard | Displays scoring statistics, measured over time |
Models | A list of competing models, built on the same or at least similar data sets (same columns and same data types), where one model is active and the remainder are challengers. For compliance with regulations such asGDPR, full details of each model are included, including the original input data set and the process that was used to build the model. |
Performance | Displays detailed scoring statistics, per model, measured over time |
Drifts | The difference between the input data distribution and the distribution of the scoring data. The difference may occur either because of bias -- because the input data was not representative -- or because the scoring data has drifted. If the measured drift is significant, you may want to rebuild your models. |
Simulator | The model simulator, as known fromAuto Model |
Scoring | An interface to upload the data you want to score and to review the results |
Alerts | An alert warns you when there is unusual or undesired behavior in your deployment |
Integrations | Web services make it possible to share the deployment and integrate it with your other software |
By default, every deployment location includes Models, Simulator, and Scoring. The table below describes theadditionalcomponents that become available when:
- you activatemonitoring,
- you create aremote deploymentlocation
RapidMiner Studio | + Database Connection | |
---|---|---|
RapidMiner Studio |
|
|
+ RapidMiner AI Hub |
|
|
Create a deployment location
To create aremote deploymentlocation, you need access to aRapidMiner AI Hub repository。
The following screenshots illustrate the creation of aremote deploymentlocation, withmonitoring。The steps for creating a local deployment location are mostly the same. Where the situation is different, we make a comment.
From within theDeploymentsOverview, clickNewto add a deployment location, and enter a name in the resulting dialog. In our example, we write "Monitored_Remote_Deployments", and clickNext。
We choose to create aRemotedeployment location. The deployments in this location will includealertsandintegrations。In contrast, aLocaldeployment location does not include these components.
If you have not yet created an empty folder inside the repository, you can clickCreate New Folder。In this case, we create and select a folder called "Monitored_Remote_Deployments" inside the "Remote Repository", under/home/admin
。You can create the folder anywhere you like, so long as it is empty.
注意that this repository includes a "PostgreSQL" connection, which we will need when we activatemonitoring。
To activatemonitoring, you need adatabase connection。
To activate monitoring, click the checkbox and select a database connection. ClickAdd New Connectionif you need tocreate a connection。In our example, we select a pre-existing "PostgreSQL" connection.
Thanks to monitoring, all the deployments in this location will include adashboardand summaries ofperformanceanddrifts。Without monitoring, these components are not included.
To createalerts, you need aremote deploymentlocationandyou need to activatemonitoring。
To activate email alerts, click the checkbox and select a Send Mail connection. ClickAdd New Connectionif you need tocreate a connection。In our example, we create and select anEmail connectioncalled "Email_Alert".
ClickCreate Location,您已经准备好添加部署!的deployment location we just created appears in the upper right corner of the overview, in a drop-down list together with any other deployment locations.
From within this Deployments Overview, you canAddnew deployment locations, andManageyour existing deployment locations.
Create a deployment
At this stage you can clickAdd Deploymentand create a deployment, but that deployment will have no content until you add some models. We create a deployment called "Churn", and identify our problem as a classification problem, before proceeding to build the models usingAuto Model。
Example: Churn
To provide the content for our deployment, we use a data set provided in theCommunity Samplesrepository, underCommunity Samples>Community Real World Use Cases>TelCo Customer Churn, using part of the data set for modeling, and the rest to simulate scoring.
The issue here is not merely to predict which customers of the phone company will drop their subscriptions, but to calculate the gains achievable by the model if it can correctly identify the churners, assuming that the phone company can retain them via a rebate.
Gains
As an option, Auto Model includes a performance metric that allows you to assign a cost to every element of the confusion matrix, so that the results are no longer identified as merely true or false, but in terms of profit and loss.
During the creation of the models with Auto Model (Prepare Target), we press the buttonDefine Costs / Benefits, and a dialog opens. Costs are identified as negative numbers, and benefits as positive.
- The default situation, with zero cost or benefit, is that the customer keeps his subscription.
- A customer who is identified by the model as a churner will be offered a rebate worth $200 to convince him to stay.
- A false negative (the model predicts the customer will stay, but he churns) is a worst case scenario. We assume that with the rebate, he would have stayed. Therefore a cost of $500 dollar is assigned to that element of the matrix, the revenue associated with a lost customer.
Using these values, we can calculate the cost of applying the model, and compare it to the baseline cost, according to which every customer who churns (with no rebates offered) represents a loss of $500. Thegainprovided by the model is the difference between these two calculations.
In the sections that follow, includingScoring,Dashboard, and thePerformancesummary, the costs / gains associated with the model will be included when presenting the results.
Auto Model
The procedure for building a set of models was discussed previously inAuto Model。Our current example is no different, except that we have defined a cost matrix as discussed above. After accepting all the defaults and creating the models, we arrive at the Auto Model overview.
Any or even all of these models can be included in our "Churn" deployment by clicking on the model in the side panel, choosing one of the submenu items, and then clickingDeploy。
Deploy the model
Deploying the model consists of three steps.
Name the model (e.g., "Gradient Boosted Trees")
Choose the deployment location (e.g., "Monitored_Remote_Deployments")
Choose the deployment folder (e.g., "Churn")
If you have not yet created adeployment locationor adeployment, you can do so during this process.
ClickAdd Model, and the model appears in the Deployments View. To include additional models in the deployment, you have to return to the Auto Model View.
Models and Scoring
This section and the following sections discuss thecomponentsof a deployment in greater detail.
The essential components of any deployment aremodels, the modelsimulator, andscoring。If you want not just to score data, but to calculate statistics for your scored data over time, you should in additionactivate monitoring。
Models
Usually, a deployment will contain multiple models. Ideally, you want to use the model with the best performance, but you can't exclude the possibility that the situation will change over time, as the world changes and your input datadrifts。
One model should be marked asActive, while the remaining models are marked asChallengers。The choice of active model is at all times up to you; you might choose the model with the best performance or the the model with the fastest scoring times, for example. Every time you use the deployment to score new data, both the active model and the challengers calculate the results. If, at some later stage, you decide that one of the challenger models better suits your problem, you can replace the active model.
注意that in the example below, theGeneralized Linear Modelhad the best performance when the models were built, before deployment, but it has since been superseded both byDeep Learning, marked in green because it is doing better than expected, andGradient Boosted Trees, marked in yellow because it is doing slightly less well than expected. BothNaive BayesandDecision Treeare doing poorly. If these models continue to perform poorly, you may want to mark them asInactive, so you don't waste Scoring Time.
SinceGradient Boosted Treeshas better performance thanDeep Learning, we right-click the model and selectChange to Active。
If you right-click any model and selectShow Details, you will see that the model contains, among other things:
- The completeInput Dataset that was used to create it
- An XML representation of theProcessused to build the model
For compliance with regulations such asGDPR, this information is essential.
An annotated version of theProcesscan be displayed in theDesign View。
Simulator
The model simulator is described in more detail inAuto Model。The simulator attached to each deployment is the simulator belonging to the active model.
Scoring
For a programmatic approach to scoring data, seeintegrations。
Arguably, the main purpose of a deployment is to score data. Your models take new data as input, and return a result. If you haveactivated monitoring(highly recommended!), the results are collected so that you can keep long-term statistics.
ClickScore Data, and choose a data set from the repository. Note the following points.
- The scoring data set should contain columnssimilar tothe data set that was used to build the models. The scoring process is robust enough to accept data types that are slightly different, so long as they both have the same supertype (e.g. real or integer, since both are numeric). The process can also detect changes in the column name, so long as the data type and distribution resemble the input data.
- Data columns that were not included when you built the model will be ignored.
- Data columns that are required by the model, but missing, will be supplied, using mean values or the mode.
Two columns of data have a special status:
- Target column- If the target values of the scoring data are known, they can be compared with the predictions to generate statistics for the models' error rate. Often you won't have this information at the time of scoring (that's why you need a prediction!), but you can add it later if you can identify the data via anID column。
- ID column- If you lack aTarget columnin your scoring data, but you have anID column, you can resubmit the data with the same ID later, when the target values are known, to generate error rates and other statistics. ClickDefine Actualsto resubmit the data.
When you submit your scoring data, an attempt will be made to identify the ID and target columns. Even if these columns are identified incorrectly, you can assign the correct values by right-clicking the affected columns and choosingUse as IDorUse as target。Only columns that arenotused by the model as input can be identified as ID or target columns.
The results are returned as a table with some additional data columns:
- prediction
- 信心值
- costs
Since in our example the scoring data included target values, the error rate is calculated. Note that you canExportthe results to a file or a repository.
If when you ran Auto Model, the checkboxExplain Predictionswas checked, the scoring data is color-coded to indicate its importance for the prediction: dark green values strongly support the prediction for that row of data, dark red values strongly support adifferentprediction for that row of data, and lighter colors are less important.
Monitoring
To activate monitoring, you need to create adatabase connection。Currently, only MySQL and and PostgreSQL are supported.
Note that the Model Operations database is independent of theRapidMiner AI Hub database; if you have installed a RapidMiner AI Hub database with MySQL or PostgreSQL, you can use the same database, but there is no need to.
Every time you score data, you get a result. You could manually collect all the results and analyze them, but there is a better way.
If you activate monitoring, the scoring statistics are collected for you, so that you can check that everything is working as expected. Monitoring helps you to answer the following questions:
- Are you regularly scoring data? How much data?
- What is the error rate of your models?
- How much have yougainedby applying the models to your data?
- Is the distribution of your scoring data consistent with your models (Drifts)?
- Have unusual events triggered anyalerts?
- What is the average response time when you score data?
The answers to these questions are displayed cumulatively in aDashboardand in greater detail (per model) in thePerformancesummary.
Dashboard
The Dashboard provides the following statistics, displayed over time. You choose the time interval: daily, weekly, monthly, or quarterly. For more detail, see thePerformancesummary.
Performance
The Performance summary provides even more detailed statistics than theDashboard, displayed per model, over time. You choose the time interval: daily, weekly, monthly, or quarterly. The Performance summary is the heart of your Model Operations.
- Scores
- Errors
- Scoring Times
- Predicted Classes versus Actual Classes
- Gains from Model
注意that by clicking onDefine Cost Matrix, we can redefine thecost matrixthat we created earlier, leading to a revised chart inGains from Model。
Drifts
For every column of input data that was used to build the deployment models, there is a unique distribution of values. While there is no reason to expect that the scoring data will be identical to the input data, the success of your models to some extent depends on the stability of these data distributions. When the scoring data has a different distribution than the input data, it is calleddrift。
Drift is not unusual; the world changes, and so do your data distributions. But if the changes are significant, you should rebuild your models. How do you know if the changes are significant? The Drifts component helps to provide an answer. For every column of input data, Drifts compares its distribution with the distribution of the scoring data. There are two charts available:
- Drift for Factors- a bar chart, with column names ordered by the amount of drift. Click on a bar to see the distributions.
- Drift vs. Importance- a scatter plot, measuring drift versus importance for each column. Click on a point to see the distributions.
The problems occur when a column that plays a significant role in scoring also has significant drift. In the worst case, a point would then appear in the upper right quadrant of the scatter plot. In the example below, there are no such problems. "Contract" is an important column, but its drift lies close to zero -- you can see that the two distributions on the right are very similar.
Remote deployment
To getalertsandintegrations, you need to have RapidMiner AI Hub installed, and you need to create a remotedeployment location。
Alerts
To createalerts, you need aremote deploymentlocation and you need to activatemonitoring。
If in addition you want to send email alerts, thedeployment locationneeds to include aSend Mail connection。
An alert warns you when there is unusual or undesired behavior in your deployment. The triggers for an alert include:
- an average error that exceeds a user-defined threshold
- average scoring times greater than a user-defined threshold
- fewer scores than expected within a given time period
- an active model that has larger errors than a challenger
- driftgreater than a user-defined threshold.
To create an alert, clickCreate New Alert。When an alert is triggered, it will appear in theDashboard, but you can also configure the alert to send you an email.
Integrations
An integration is aweb servicethat provides aScoring URL, where you can post data for scoring. In this context, it also provides anActuals URL, corresponding to the buttonDefine Actualsin thescoringinterface. If you're planning to automate the scoring process, a web service is precisely what you need; RapidMiner AI Hub provides aREST API帮助你整合部署哟ur other software.
The Integrations component displays theScoring URLand allows you to test it with arbitrary values of the scoring data. On the right side of the screenshot below, you can see theTest URLand theTest Response。In this example, the response from the server is "deployment is not active", because we have forgotten to flip theActive?switch on the top right of the screen, but normally the response would include a prediction with confidence values, in JSON format.