Generalized Linear Model
Synopsis
Executes GLM algorithm using H2O 3.30.0.1.
Description
Please note that the result of this algorithm may depend on the number of threads used. Different settings may lead to slightly different outputs.
Generalized linear models (GLMs) are an extension of traditional linear models. This algorithm fits generalized linear models to the data by maximizing the log-likelihood. The elastic net penalty can be used for parameter regularization. The model fitting computation is parallel, extremely fast, and scales extremely well for models with a limited number of predictors with non-zero coefficients.
The operator starts a 1-node local H2O cluster and runs the algorithm on it. Although it uses one node, the execution is parallel. You can set the level of parallelism by changing the Settings/Preferences/General/Number of threadssetting. By default it uses the recommended number of threads for the system. Only one instance of the cluster is started and it remains running until you close RapidMiner Studio.
Please note that below version 7.6, a threshold value optimized for maximal F-measure is used for prediction by default.
Input
training set
The input port expects a labeled ExampleSet.
Output
model
The Generalized Linear classification or regression model is delivered from this output port. This classification or regression model can be applied on unseen data sets for prediction of the label attribute.
example set
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
weights
This port delivers the weights of the attributes with respect to the label attribute.
threshold
This port is used only for binominal classification tasks. It provides a threshold value optimized for maximal F-measure. If you wish to use this threshold value calculated by H2O, connect this output to anApply Thresholdoperator, along with the scored ExampleSet. (By default, RapidMiner uses 0.5 threshold value when applying models.)
Parameters
Family
Family. Use binomial for classification with logistic regression, others are for regression problems.
- AUTO: Automatic selection. Uses multinomial for polynominal, binomial for binominal and gaussian for numeric labels.
- gaussian: The data must be numeric (real or integer).
- binomial: The data must be binominal or polynominal with 2 levels/classes.
- multinomial: The data must be polynominal with more than two levels/classes.
- poisson: The data must be numeric and non-negative (integer).
- gamma: The data must be numeric and continuous and positive (real or integer).
- tweedie: The data must be numeric and continuous (real) and non-negative.
Solver
Select the solver to use. IRLSM is fast on problems with a small number of predictors and for lambda-search with L1 penalty, while L_BFGS scales better for datasets with many columns. COORDINATE_DESCENT is IRLSM with the covariance updates version of cyclical coordinate descent in the innermost loop. COORDINATE_DESCENT_NAIVE is IRLSM with the naive updates version of cyclical coordinate descent in the innermost loop. COORDINATE_DESCENT_NAIVE and COORDINATE_DESCENT are currently experimental. Values:
- AUTO
- IRLSM
- L_BFGS
- COORDINATE_DESCENT (experimental)
- COORDINATE_DESCENT_NAIVE (experimental)
Link
The link function relates the linear predictor to the distribution function. The default is the canonical link for the specified family. Only available for gaussian, poisson and gamma families, because only one link type is possible for the others:
- Family: binomial; Link: logit
- Family: multinomial; Link: multinomial
- Family: tweedie; Link: tweedie
- family_default: Uses identity for gaussian, log for possion and inverse for gamma family.
- identity: Possible family options: Gaussian, Poisson, Gamma
- log: Possible family options: Gaussian, Poisson, Gamma
- inverse: Possible family options: Gaussian, Gamma
Reproducible
Makes model building reproducible. If set then maximum_number_of_threads parameter controls parallelism level of model building. If not set then parallelism level is defined by number of threads in General Preferences.
Maximum number of threads
Controls parallelism level of model building.
Specify beta constraints
If enabled, beta constraints for the regular attributes can be provided.
Use regularization
Check this box if regularization should be used. For regularization, you can specify the lambda, alpha and the lambda search related parameters. If alpha or lambda is undefined (default), H2O will calculate default values for them based on the training data and the other parameters. If this parameter is set to false, lambda is set to 0.0 (means no regularization).
Lambda
The lambda parameter controls the amount of regularization applied. If lambda is 0.0, no regularization is applied and the alpha parameter is ignored (you can set this by disabling theuse regularizationparameter). The default value for lambda is calculated by H2O using a heuristic based on the training data. Providing multiple lambda values via the advanced parameters triggers a search.
Lambda search
A logical value indicating whether to conduct a search over the space of lambda values, starting from the max lambda, given lambda will be interpreted as the min lambda. Default is false.
Number of lambdas
The number of lambda values when lambda search = true. 0 means no preference.
Lambda min ratio
Smallest value for lambda as a fraction of lambda.max, the entry value, which is the smallest value for which all coefficients in the model are zero. If the number of observations is greater than the number of variables then default lambda_min_ratio = 0.0001; if the number of observations is less than the number of variables then default lambda_min_ratio = 0.01. Default is 0.0, which means no preference.
Early stopping
Check this box if early stopping should be performed on the lambda search based on the stopping rounds and stopping tolerance parameters. The used stopping metric is always deviance.
Stopping rounds
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events.
Stopping tolerance
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much).
Alpha
阿尔法参数控制元素分布een the L1 (Lasso) and L2 (Ridge regression) penalties. A value of 1.0 for alpha represents Lasso, and an alpha value of 0.0 produces Ridge regression. Providing multiple alpha values via the advanced parameters triggers a search. Default is 0.0 for the L-BFGS solver, else 0.5.
Standardize
Standardize numeric columns to have zero mean and unit variance
Non-negative coefficients
Restrict coefficients (not intercept) to be non-negative.
Compute p-values
请求假定值计算。假定值只能with IRLSM solver and no regularization. Intercept must also be added to the model. Moreover, non-negative coefficients and specify beta constraints parameters have to be set to false to compute p-values.
Remove collinear columns
In case of linearly dependent columns remove some of the dependent columns. Works only if intercept is added to the model.
Add intercept
Include constant term in the model.
Missing values handling
Handling of missing values. Either Skip or MeanImputation.
- Skip: Missing values are skipped.
- MeanImputation: Missing values are replaced with the mean value.
Max iterations
Maximum number of iterations. 0 means no limit.
Beta constraints
Constraints for beta values. A row consists of the following values: Names
- Attribute name: The name of the attribute.
- Category: A value from the attribute's domain. Please take care to provide the exact value. Use more rows to specify constraints for multiple categories.
- Lower bound: Lower bound of the beta.
- Upper bound: Upper bound of the beta.
- Beta given: Specifies the given solution in proximal operator interface. The proximal operator interface allows you to run the GLM with a proximal penalty on a distance from a specified given solution.
- Beta start: Starting value of the beta.
Max runtime seconds
Maximum allowed runtime in seconds for model training. Use 0 to disable.
Expert parameters
These parameters are for fine tuning the algorithm. Usually the default values provide a decent model, but in some cases it may be useful to change them. Please use true/false values for boolean parameters and the exact attribute name for columns. Arrays can be provided by splitting the values with the comma (,) character. More information on the parameters can be found in the H2O documentation.
- score_each_iteration: Whether to score during each iteration of model training. Type: boolean, Default: false
- fold_assignment: Cross-validation fold assignment scheme, if fold_column is not specified. Options: AUTO, Random, Modulo, Stratified. Type: enumeration, Default: AUTO
- fold_column: Column name with cross-validation fold index assignment per observation. Type: column, Default: no fold column
- offset_column: Offset column name. Type: Column, Default: no offset column
- max_confusion_matrix_size: Maximum size (# classes) for confusion matrices to be printed in the Logs. Type: integer, Default: 20
- keep_cross_validation_predictions: Keep cross-validation model predictions. Type: boolean, Default: false
- keep_cross_validation_fold_assignment: Keep cross-validation fold assignment. Type: boolean, Default: false
- tweedie_variance_power: specifyi数值ng the power for the variance function when family = "tweedie". Type: real, Default: 0
- tweedie_link_power: specifyi数值ng the power for the link function when family = "tweedie". Type: real, Default: 1
- prior: A numeric specifying the prior probability of class 1 in the response when family = "binomial". Must be from (0,1) exclusive range or -1 (no prior). The default value is the observation frequency of class 1. Type: real Default: -1 (no prior)
- beta_epsilon: A non-negative number specifying the magnitude of the maximum difference between the coefficient estimates from successive iterations. Defines the convergence criterion. Type: real, Default: 0.0001
- objective_epsilon: Specify a threshold for convergence. If the objective value is less than this threshold, the model is converged. Type: real, Default: -1 (no threshold)
- gradient_epsilon: (For L-BFGS only) Specify a threshold for convergence. If the objective value (using the L-infinity norm) is less than this threshold, the model is converged. Type: real, Default: 0.0001
- max_active_predictors: Specify the maximum number of active predictors during computation. This value is used as a stopping criterium to prevent expensive model building with many predictors. Type: integer, Default: -1 (no limit)
- obj_reg: Likelihood divider in objective value computation, Type: real, Default: 1/nobs
- additional_alphas: Providing additional alphas triggers a search. Ignored ifalphais undefined.
- additional_lambdas: Providing additional lambdas triggers a search. Ignored iflambdais undefined.
- nfolds: Number of folds for cross-validation. Use 0 to turn off cross-validation. Type: integer, Default: 0