Build models
Auto ML is designed to help you build predictive models from your data – fast and simple. All you need is a data set and something you want to predict. It's that simple!
As discussed in theintroduction, we will guide you through the following steps:
- Start Auto ML-- Assuming you have a data file inrmhdf5tableformat linked to a project, selectStart Auto ML。
- 选择列-- choose the column whose values you want to predict
- Select Inputs-- decide what's relevant and eliminate what's irrelevant
- Select Models-- select and build one or more models
By the end of step (4), you will have created one or more models. After that, you caninspect the modelsand decide which one best suits your purpose.
步骤1:年代tart Auto ML
tip
To follow the documentation step by step:
The use of Auto ML presupposes that:
you have adata file inrmhdf5tableformat
(if you want to use your own data set, but it's not inrmhdf5tableformat,find out how to convert it)
and that file islinked to a project。
From within theDatatab or theContenttab of the project, selectStart Auto ML。
Step 2: Choose Column
In what follows, we'll discuss the consequences of choosing the sample data setChurnPredictionData。数据问题的客户一个电话公司,who may or may not give up on their subscription.
One of the data columns -- we'll call it thetarget column-- has values that you want to predict. In our current example, the target column isChurn
, since we want to predict who will churn. From the dropdown menu, chooseChurn
在点击之前Next。
In general, the values of the target column can be numerical (likeCustServ Calls
) or categorical (likeChurn
). Depending on your target column, the problem will fall into one of the three following categories:
- Binary classification- Categorical data, two possible values (like
Churn
) - Multiclass classification- Categorical data, three or more possible values
- Regression- Numerical data (like
CustServ Calls
)
Choose a column, and Auto ML will automatically detect what type of problem it has to solve. Additional details for for each type of problem are given below.
Binary Classification(predicting one of exactly two possible values)
Some questions have a yes-or-no answer. For example, if you take a medical test, the results are often described aspositiveornegative:
- Positive: the test found what you were looking for (e.g., an infection)
- Negative: the test did not find what you were looking for (e.g., no infection)
If the result is positive, a more thorough investigation may be necessary; if the result is negative, no more work is needed. Arguably, the positive result is more important and deserves a higher degree of attention, because the focus of medical work is to treat the infection.
我们当前的问题,
Churn
takes the values "yes" or "no", is an example of a binary classification problem, with the focus on "yes", since we want to predict which customers will churn.Multiclass Classification(predicting one of three or more possible values)
If your target column has three or morenon-numericalvalues, your problem is called a multiclass classification problem.
Regression(predicting numerical values)
If your target column is numerical, and you want to predict the numbers in that column, your problem is called a regression problem. For example, in ourChurn Prediction Data, there is a column called
CustServ Calls
whose value is the number of times a customer has called customer service.
Step 3: Select Inputs
Not all of your data columns will help you make a prediction. By discarding some of the columns, you may speed up your model-building and / or improve the model'sperformance。But how do you make that decision? A key point is that you're looking for patterns. Without some variation in the data and some discernible patterns, the data is not likely to be useful.
The four criteria that Auto ML uses to determine if a particular column is useful are:
- Correlation- how closely do the values resemble the target column?
- ID-ness- how different are the values from one another?
- Stability- how similar are the values to one another?
- Missing——有多少缺失值the column relative to the total?
Each column is marked with a quality tag: green, yellow, or red.
Green Good quality |
Yellow Needs examination |
Red Poor quality |
---|---|---|
No problem! |
|
|
By default, Auto ML will deselect the columns marked with a red or yellow quality tag, but you are of course free to to select or deselect any columns you like! Usually the defaults will work well, but you should pay careful attention if a column is marked with a yellow tag and hashigh correlation。
To understand the issue with high correlation, consider an extreme example: perfect correlation. If you have two columns called X and Y, and X = Y, then the correlation is 100% and X is just another name for Y. If you are predicting X, you would discard the column called Y, because it's redundant. It may be redundant even if the correlation is less than 100%. Ask yourself the following question: will I have access to the data in the highly- correlated column prior to making a prediction? If not, the data is not useful.
In some cases, however, the column is useful for prediction, precisely because it is highly correlated with the target column; if you exclude it, you riskdamaging your model。Only you can tell for certain. In case of doubt, you can create two models: one with the highly-correlated column and one without, to help you decide which is best.
Churn Prediction Data
Auto ML identifies the following issues with our Churn Prediction Data:
- High ID-ness: the
Phone
number is an ID, unique to each customer. It has no value in predicting churn. - Many missing values: only 3% of the customers have international charges (
Intl Charge
), so this data column won't tell us much. - Low correlation: there is zero correlation between
Account Length
andChurn
。It seems that there is little or no relation between the time a customer has been with the phone company and the probability that he will churn, soAccount Length
is unlikely to be useful.
By default,Phone
andIntl Charge
are deselected.Account Length
is selected, but it appears to have little value. Our attention should be focused onCustServ Calls
:
- High correlation:
CustServ Calls
has a 57% correlation withChurn
Apparently, the number of customer service calls is a good indicator of churn. The phone company would be well-advised to take proactive steps to keep the customer if the customer has called customer service repeatedly. But do you want to includeCustServ Calls
when building your model? Let's return to the question we asked a moment ago: will I have access to the data in the highly-correlated column prior to making a prediction? In this case, the answer isyes。We choose therefore to includeCustServ Calls
in our model, with the understanding that the predictions of the model will be heavily weighted towards the value in that column.
Jump ahead to see the results with and without the customer service call data
Step 4: Select Models
Auto ML provides some of the more popular machine learning algorithms. Depending on the type of data in your target column, only a subset of these algorithms may be available.
Model | Binary classification | Multiclass classification | Regression |
---|---|---|---|
Naïve Bayes | |||
Logistic Regression | |||
Deep Learning | |||
Decision Tree | |||
Generalized Linear Model | |||
Random Forest | |||
Gradient Boosted Trees | |||
Support Vector Machine | |||
Fast Large Margin |
Depending on your use case, you may want to explore all of the models or only a subset. With a subset, you can the reduce training time or focus on models that are easy to understand. We have organized the models into 3 categories based on training time and interpretability. Within each group, you can enable or disableColumn AnalysisandData Preparationfeatures.
Easily interpretable
Easily interpretablemodels are relatively easy to understand. TheColumn Analysisoptions, if enabled, will help you to better understand how the predictions are made, and the importance of each column.
Quick prototyping
If you are short on time or you only want to do a quick test, chooseQuick Prototyping。The models in this group have a very small footprint and are usually the fastest when it comes to training time.Column Analysisoptions are turned off to ensure the shortest possible training time.
Higher accuracy
WithHigher Accuracy, you can explore the whole set of available models. The trade-off is the potential waiting time; depending on the characteristics of the training data, it can be quite long. By default, we disable theColumn Analysisoptions to reduce training time, but you should feel free to re-enable them if you need more insight into model behavior.
Show advanced settings
UnderShow advanced settings, you can review and modify the list of selected models. In addition, you can change the sampling method (the way we split your data into training and test sets).
Training Data Selection
warning
Changing this setting may cause training errors, especially when training on imbalanced data sets! An example of an imbalanced data set isfraud detection, when most of the data is non-fraudulent, but a tiny fraction is fraudulent.
Training set | Test set |
---|---|
60% | 40% |
Automatic Samplingrandomly builds subsets to avoid potential bias coming from row order. When the target column is categorical, it builds training and test sets through random selection while keeping the same class distribution as in the original data set.
Linear Samplingwill keep the order of the original data set when building training and test sets. This option is useful when the rows are not independent (e.g., time series data).
Select the group or choose individual models, and pressRun Analysis。
Next:Inspect models
Further Reading
The links below provide more information about the predictive model algorithms used in Auto ML: