Skip to main content

Build models

Auto ML is designed to help you build predictive models from your data – fast and simple. All you need is a data set and something you want to predict. It's that simple!

As discussed in theintroduction, we will guide you through the following steps:

  1. Start Auto ML-- Assuming you have a data file inrmhdf5tableformat linked to a project, selectStart Auto ML
  2. 选择列-- choose the column whose values you want to predict
  3. Select Inputs-- decide what's relevant and eliminate what's irrelevant
  4. Select Models-- select and build one or more models

By the end of step (4), you will have created one or more models. After that, you caninspect the modelsand decide which one best suits your purpose.

步骤1:年代tart Auto ML

The use of Auto ML presupposes that:

  1. you have adata file inrmhdf5tableformat

    (if you want to use your own data set, but it's not inrmhdf5tableformat,find out how to convert it)

  2. and that file islinked to a project

From within theDatatab or theContenttab of the project, selectStart Auto ML

Step 2: Choose Column

In what follows, we'll discuss the consequences of choosing the sample data setChurnPredictionData。数据问题的客户一个电话公司,who may or may not give up on their subscription.

Choose column

One of the data columns -- we'll call it thetarget column-- has values that you want to predict. In our current example, the target column isChurn, since we want to predict who will churn. From the dropdown menu, chooseChurn在点击之前Next

Choose column Churn

In general, the values of the target column can be numerical (likeCustServ Calls) or categorical (likeChurn). Depending on your target column, the problem will fall into one of the three following categories:

  1. Binary classification- Categorical data, two possible values (likeChurn)
  2. Multiclass classification- Categorical data, three or more possible values
  3. Regression- Numerical data (likeCustServ Calls)

Choose a column, and Auto ML will automatically detect what type of problem it has to solve. Additional details for for each type of problem are given below.

  1. Binary Classification(predicting one of exactly two possible values)

    Some questions have a yes-or-no answer. For example, if you take a medical test, the results are often described aspositiveornegative:

    • Positive: the test found what you were looking for (e.g., an infection)
    • Negative: the test did not find what you were looking for (e.g., no infection)

    If the result is positive, a more thorough investigation may be necessary; if the result is negative, no more work is needed. Arguably, the positive result is more important and deserves a higher degree of attention, because the focus of medical work is to treat the infection.

    我们当前的问题,Churntakes the values "yes" or "no", is an example of a binary classification problem, with the focus on "yes", since we want to predict which customers will churn.

  2. Multiclass Classification(predicting one of three or more possible values)

    If your target column has three or morenon-numericalvalues, your problem is called a multiclass classification problem.

  3. Regression(predicting numerical values)

    If your target column is numerical, and you want to predict the numbers in that column, your problem is called a regression problem. For example, in ourChurn Prediction Data, there is a column calledCustServ Callswhose value is the number of times a customer has called customer service.

Step 3: Select Inputs

Not all of your data columns will help you make a prediction. By discarding some of the columns, you may speed up your model-building and / or improve the model'sperformance。But how do you make that decision? A key point is that you're looking for patterns. Without some variation in the data and some discernible patterns, the data is not likely to be useful.

The four criteria that Auto ML uses to determine if a particular column is useful are:

  • Correlation- how closely do the values resemble the target column?
  • ID-ness- how different are the values from one another?
  • Stability- how similar are the values to one another?
  • Missing——有多少缺失值the column relative to the total?

Each column is marked with a quality tag: green, yellow, or red.

Green
Good quality
Yellow
Needs examination
Red
Poor quality
No problem!
  • Low correlation
  • High correlation
  • High missing values
  • High ID-ness
  • High stability

By default, Auto ML will deselect the columns marked with a red or yellow quality tag, but you are of course free to to select or deselect any columns you like! Usually the defaults will work well, but you should pay careful attention if a column is marked with a yellow tag and hashigh correlation

To understand the issue with high correlation, consider an extreme example: perfect correlation. If you have two columns called X and Y, and X = Y, then the correlation is 100% and X is just another name for Y. If you are predicting X, you would discard the column called Y, because it's redundant. It may be redundant even if the correlation is less than 100%. Ask yourself the following question: will I have access to the data in the highly- correlated column prior to making a prediction? If not, the data is not useful.

In some cases, however, the column is useful for prediction, precisely because it is highly correlated with the target column; if you exclude it, you riskdamaging your model。Only you can tell for certain. In case of doubt, you can create two models: one with the highly-correlated column and one without, to help you decide which is best.

Churn Prediction Data

Select inputs

Auto ML identifies the following issues with our Churn Prediction Data:

  • High ID-ness: thePhonenumber is an ID, unique to each customer. It has no value in predicting churn.
  • Many missing values: only 3% of the customers have international charges (Intl Charge), so this data column won't tell us much.
  • Low correlation: there is zero correlation betweenAccount LengthandChurn。It seems that there is little or no relation between the time a customer has been with the phone company and the probability that he will churn, soAccount Lengthis unlikely to be useful.

By default,PhoneandIntl Chargeare deselected.Account Lengthis selected, but it appears to have little value. Our attention should be focused onCustServ Calls:

  • High correlation:CustServ Callshas a 57% correlation withChurn

Apparently, the number of customer service calls is a good indicator of churn. The phone company would be well-advised to take proactive steps to keep the customer if the customer has called customer service repeatedly. But do you want to includeCustServ Callswhen building your model? Let's return to the question we asked a moment ago: will I have access to the data in the highly-correlated column prior to making a prediction? In this case, the answer isyes。We choose therefore to includeCustServ Callsin our model, with the understanding that the predictions of the model will be heavily weighted towards the value in that column.

Jump ahead to see the results with and without the customer service call data

Step 4: Select Models

Auto ML provides some of the more popular machine learning algorithms. Depending on the type of data in your target column, only a subset of these algorithms may be available.

Model Binary classification Multiclass classification Regression
Naïve Bayes true true false
Logistic Regression true true false
Deep Learning true true true
Decision Tree true true true
Generalized Linear Model true true true
Random Forest true true true
Gradient Boosted Trees true true true
Support Vector Machine true true true
Fast Large Margin true true false

Depending on your use case, you may want to explore all of the models or only a subset. With a subset, you can the reduce training time or focus on models that are easy to understand. We have organized the models into 3 categories based on training time and interpretability. Within each group, you can enable or disableColumn AnalysisandData Preparationfeatures.

automl-select-models.png

Easily interpretable

Easily interpretablemodels are relatively easy to understand. TheColumn Analysisoptions, if enabled, will help you to better understand how the predictions are made, and the importance of each column.

Quick prototyping

If you are short on time or you only want to do a quick test, chooseQuick Prototyping。The models in this group have a very small footprint and are usually the fastest when it comes to training time.Column Analysisoptions are turned off to ensure the shortest possible training time.

Higher accuracy

WithHigher Accuracy, you can explore the whole set of available models. The trade-off is the potential waiting time; depending on the characteristics of the training data, it can be quite long. By default, we disable theColumn Analysisoptions to reduce training time, but you should feel free to re-enable them if you need more insight into model behavior.

Show advanced settings

UnderShow advanced settings, you can review and modify the list of selected models. In addition, you can change the sampling method (the way we split your data into training and test sets).

automl-select-models.png

Training Data Selection

warning

Changing this setting may cause training errors, especially when training on imbalanced data sets! An example of an imbalanced data set isfraud detection, when most of the data is non-fraudulent, but a tiny fraction is fraudulent.

Training set Test set
60% 40%

Automatic Samplingrandomly builds subsets to avoid potential bias coming from row order. When the target column is categorical, it builds training and test sets through random selection while keeping the same class distribution as in the original data set.

Linear Samplingwill keep the order of the original data set when building training and test sets. This option is useful when the rows are not independent (e.g., time series data).

Select the group or choose individual models, and pressRun Analysis

Next:Inspect models

Further Reading

The links below provide more information about the predictive model algorithms used in Auto ML: