Skip to main content

Generalized Linear Model

Synopsis

Executes GLM algorithm using H2O 3.30.0.1.

Description

Please note that the result of this algorithm may depend on the number of threads used. Different settings may lead to slightly different outputs.

Generalized linear models (GLMs) are an extension of traditional linear models. This algorithm fits generalized linear models to the data by maximizing the log-likelihood. The elastic net penalty can be used for parameter regularization. The model fitting computation is parallel, extremely fast, and scales extremely well for models with a limited number of predictors with non-zero coefficients.

The operator starts a 1-node local H2O cluster and runs the algorithm on it. Although it uses one node, the execution is parallel. You can set the level of parallelism by changing the Settings/Preferences/General/Number of threadssetting. By default it uses the recommended number of threads for the system. Only one instance of the cluster is started and it remains running until you close RapidMiner Studio.

Please note that below version 7.6, a threshold value optimized for maximal F-measure is used for prediction by default.

Input

training set

The input port expects a labeled ExampleSet.

Output

model

The Generalized Linear classification or regression model is delivered from this output port. This classification or regression model can be applied on unseen data sets for prediction of the label attribute.

example set

The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

weights

This port delivers the weights of the attributes with respect to the label attribute.

threshold

This port is used only for binominal classification tasks. It provides a threshold value optimized for maximal F-measure. If you wish to use this threshold value calculated by H2O, connect this output to anApply Thresholdoperator, along with the scored ExampleSet. (By default, RapidMiner uses 0.5 threshold value when applying models.)

Parameters

Family

Family. Use binomial for classification with logistic regression, others are for regression problems.

  • AUTO: Automatic selection. Uses multinomial for polynominal, binomial for binominal and gaussian for numeric labels.
  • gaussian: The data must be numeric (real or integer).
  • binomial: The data must be binominal or polynominal with 2 levels/classes.
  • multinomial: The data must be polynominal with more than two levels/classes.
  • poisson: The data must be numeric and non-negative (integer).
  • gamma: The data must be numeric and continuous and positive (real or integer).
  • tweedie: The data must be numeric and continuous (real) and non-negative.

Solver

Select the solver to use. IRLSM is fast on problems with a small number of predictors and for lambda-search with L1 penalty, while L_BFGS scales better for datasets with many columns. COORDINATE_DESCENT is IRLSM with the covariance updates version of cyclical coordinate descent in the innermost loop. COORDINATE_DESCENT_NAIVE is IRLSM with the naive updates version of cyclical coordinate descent in the innermost loop. COORDINATE_DESCENT_NAIVE and COORDINATE_DESCENT are currently experimental. Values:

  • AUTO
  • IRLSM
  • L_BFGS
  • COORDINATE_DESCENT (experimental)
  • COORDINATE_DESCENT_NAIVE (experimental)

The link function relates the linear predictor to the distribution function. The default is the canonical link for the specified family. Only available for gaussian, poisson and gamma families, because only one link type is possible for the others:

  • Family: binomial; Link: logit
  • Family: multinomial; Link: multinomial
  • Family: tweedie; Link: tweedie
  • family_default: Uses identity for gaussian, log for possion and inverse for gamma family.
  • identity: Possible family options: Gaussian, Poisson, Gamma
  • log: Possible family options: Gaussian, Poisson, Gamma
  • inverse: Possible family options: Gaussian, Gamma

Reproducible

Makes model building reproducible. If set then maximum_number_of_threads parameter controls parallelism level of model building. If not set then parallelism level is defined by number of threads in General Preferences.

Maximum number of threads

Controls parallelism level of model building.

Specify beta constraints

If enabled, beta constraints for the regular attributes can be provided.

Use regularization

Check this box if regularization should be used. For regularization, you can specify the lambda, alpha and the lambda search related parameters. If alpha or lambda is undefined (default), H2O will calculate default values for them based on the training data and the other parameters. If this parameter is set to false, lambda is set to 0.0 (means no regularization).

Lambda

The lambda parameter controls the amount of regularization applied. If lambda is 0.0, no regularization is applied and the alpha parameter is ignored (you can set this by disabling theuse regularizationparameter). The default value for lambda is calculated by H2O using a heuristic based on the training data. Providing multiple lambda values via the advanced parameters triggers a search.

A logical value indicating whether to conduct a search over the space of lambda values, starting from the max lambda, given lambda will be interpreted as the min lambda. Default is false.

Number of lambdas

The number of lambda values when lambda search = true. 0 means no preference.

Lambda min ratio

Smallest value for lambda as a fraction of lambda.max, the entry value, which is the smallest value for which all coefficients in the model are zero. If the number of observations is greater than the number of variables then default lambda_min_ratio = 0.0001; if the number of observations is less than the number of variables then default lambda_min_ratio = 0.01. Default is 0.0, which means no preference.

Early stopping

Check this box if early stopping should be performed on the lambda search based on the stopping rounds and stopping tolerance parameters. The used stopping metric is always deviance.

Stopping rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events.

Stopping tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much).

Alpha

阿尔法参数控制元素分布een the L1 (Lasso) and L2 (Ridge regression) penalties. A value of 1.0 for alpha represents Lasso, and an alpha value of 0.0 produces Ridge regression. Providing multiple alpha values via the advanced parameters triggers a search. Default is 0.0 for the L-BFGS solver, else 0.5.

Standardize

Standardize numeric columns to have zero mean and unit variance

Non-negative coefficients

Restrict coefficients (not intercept) to be non-negative.

Compute p-values

请求假定值计算。假定值只能with IRLSM solver and no regularization. Intercept must also be added to the model. Moreover, non-negative coefficients and specify beta constraints parameters have to be set to false to compute p-values.

Remove collinear columns

In case of linearly dependent columns remove some of the dependent columns. Works only if intercept is added to the model.

Add intercept

Include constant term in the model.

Missing values handling

Handling of missing values. Either Skip or MeanImputation.

  • Skip: Missing values are skipped.
  • MeanImputation: Missing values are replaced with the mean value.

Max iterations

Maximum number of iterations. 0 means no limit.

Beta constraints

Constraints for beta values. A row consists of the following values: Names

  • Attribute name: The name of the attribute.
  • Category: A value from the attribute's domain. Please take care to provide the exact value. Use more rows to specify constraints for multiple categories.
  • Lower bound: Lower bound of the beta.
  • Upper bound: Upper bound of the beta.
  • Beta given: Specifies the given solution in proximal operator interface. The proximal operator interface allows you to run the GLM with a proximal penalty on a distance from a specified given solution.
  • Beta start: Starting value of the beta.

Max runtime seconds

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Expert parameters

These parameters are for fine tuning the algorithm. Usually the default values provide a decent model, but in some cases it may be useful to change them. Please use true/false values for boolean parameters and the exact attribute name for columns. Arrays can be provided by splitting the values with the comma (,) character. More information on the parameters can be found in the H2O documentation.

  • score_each_iteration: Whether to score during each iteration of model training. Type: boolean, Default: false
  • fold_assignment: Cross-validation fold assignment scheme, if fold_column is not specified. Options: AUTO, Random, Modulo, Stratified. Type: enumeration, Default: AUTO
  • fold_column: Column name with cross-validation fold index assignment per observation. Type: column, Default: no fold column
  • offset_column: Offset column name. Type: Column, Default: no offset column
  • max_confusion_matrix_size: Maximum size (# classes) for confusion matrices to be printed in the Logs. Type: integer, Default: 20
  • keep_cross_validation_predictions: Keep cross-validation model predictions. Type: boolean, Default: false
  • keep_cross_validation_fold_assignment: Keep cross-validation fold assignment. Type: boolean, Default: false
  • tweedie_variance_power: specifyi数值ng the power for the variance function when family = "tweedie". Type: real, Default: 0
  • tweedie_link_power: specifyi数值ng the power for the link function when family = "tweedie". Type: real, Default: 1
  • prior: A numeric specifying the prior probability of class 1 in the response when family = "binomial". Must be from (0,1) exclusive range or -1 (no prior). The default value is the observation frequency of class 1. Type: real Default: -1 (no prior)
  • beta_epsilon: A non-negative number specifying the magnitude of the maximum difference between the coefficient estimates from successive iterations. Defines the convergence criterion. Type: real, Default: 0.0001
  • objective_epsilon: Specify a threshold for convergence. If the objective value is less than this threshold, the model is converged. Type: real, Default: -1 (no threshold)
  • gradient_epsilon: (For L-BFGS only) Specify a threshold for convergence. If the objective value (using the L-infinity norm) is less than this threshold, the model is converged. Type: real, Default: 0.0001
  • max_active_predictors: Specify the maximum number of active predictors during computation. This value is used as a stopping criterium to prevent expensive model building with many predictors. Type: integer, Default: -1 (no limit)
  • obj_reg: Likelihood divider in objective value computation, Type: real, Default: 1/nobs
  • additional_alphas: Providing additional alphas triggers a search. Ignored ifalphais undefined.
  • additional_lambdas: Providing additional lambdas triggers a search. Ignored iflambdais undefined.
  • nfolds: Number of folds for cross-validation. Use 0 to turn off cross-validation. Type: integer, Default: 0