Skip to main content

AutoEncoder模型

Synopsis

This Operator can be used to configure neural networks in a configuration of an autoencoder. Autoencoders are characterized through consisting of an encoder- and a decoder part of the architecture trained simultaneously on unlabeled data. Three models are provided as the output of this operator: A full model generating data in the dimensionality of the input data. An encoder model, consisting of the architecture part defined in the encoder subprocess. This model outputs data in the dimensionality of the last encoder operators number of neurons. And a decoder model, which expects data with the dimensionality of the encoder models output, providing predictions in the dimensionality of the original input data set.

Description

这个操作符允许配置一个autoencoder network architecture. The Operator has two subprocesses, one for the encoder part of the architecture and one for the decoder site. Often the decoder site is a mirrored version of the encoder site intended to reconstruct an embedding created with the encoder site. For this purpose both parts are trained simultaneously and often used separately afterwards. E.g. for creating labels with an encoder. Another property of autoencoder architectures is a typical bottleneck in the middle. The bottleneck is created by reducing the dimensionality of the data (e.g. through reducing the number of neurons, or pooling). An autoencoder is an unsupervised modelling technique. This operator uses the data provdided as input as its label. E.g. training an autoencoder with a bottleneck can results in a model, that represents a given data set with less dimensions as initially available.

Input

training set

ExampleSet or TensorIOObject holding the training data.

test set

ExampleSet or TensorIOObject holding the evaluation/test data.

Output

model

A full model generating data in the dimensionality of the input data.

encoder model

An encoder model, consisting of the architecture part defined in the encoder subprocess. This model outputs data in the dimensionality of the last encoder operators number of neurons.

decoder model

A decoder model, which expects data with the dimensionality of the encoder models output, providing predictions in the dimensionality of the original input data set.

throughput

Input sample set passed through.

history

An ExampleSet containing example-based loss values and respective epoch counts representing the training and test behavior. The training loss is derived from the training model, thus including dropped out neurons and other regularization methods. Test loss is derived from the test model without these mechanisms. Loss values are on a per example-base. Plot these loss values as a scatter plot for example to check whether the learning rate needs to be changed, another weight initialization needs to be chosen, or if regularization should be applied. Expected is a decreasing function. As long as it decreases more epochs could be used.

Parameters

Epochs

Number of times the whole data set is passed through the network. Use the advanced parameter use early-stopping to select a strategy enabling an early-stopping. These strategies often result in a shorter training time, since the training process is stopped, when a desired criteria is reached.

Use minibatch

Pass data in batches through the network and update weights and biases after each of those batches.

Batch size

Number of examples to be used in one batch for a single weight update. Values are often chosen as multiples of 2. When switching from CPU to GPU backend execution increase batch size by a lot (factor of 100-1,000) to take advantage of the GPUs extra memory.

Log each epoch

Disable this option to choose the number of epochs after which loss values should be logged. This effects the output of the history port as well as the actual process log.

Epochs per log

This parameter is available, iflog each epochis disabled.

Use early stopping

When training neural networks, numerous decisions need to be made regarding the settings (hyper parameters) used, in order to obtain good performance. One such hyper parameter is the number of training epochs, i.e. the number of full passes over the data set. Early-stopping attempts to remove the need to manually set this value. It can also be considered as a type of regularization method (like L1/L2 weight decay and dropout) in that it can stop the network from over-fitting. The number of epochs set using the epochs parameter is always used as an upper limit. Available conditions can be selected through the condition strategy selector.

Available epoch conditions, that are tested every epoch:

  • score_improvement:使用参数的耐心和我最小的分数mprovement to check for a score not improving anymore, which leads to a stop. The patience defines the number of epochs the score needs to be considered constant. The definition of a score being constant in comparison to the previous one, is dependent on the amount it has changed. A minimum value to define a score change as such is set using minimal score improvement.
  • best epoch score: Use this strategy to define a targeted score with the best epoch score parameter. If this score is reached training is stopped.
  • score improvement: Allowed number of epochs without any improvement in score compared to the best score so far.

Available iteration conditions, that are tested on each mini-batch:

  • max iteration score: Score threshold for every iteration. If an iteration exceeds this score threshold the training will be stopped. This can occur for example with a poorly tuned (too high) learning rate.
  • maximum time: Maximum amount of time (in seconds) an iteration can last during training.

Configure optimization

Whether to override default training configurations regarding network optimisation.

Optimization method

An optimization method defines the strategy used to define when to update parameters (weights and biases) and how. Provided methods allow to change between batched and none-batched methods. Batched methods like Conjugate Gradient Line Search and L-BFGS update the calculated loss after the full data set was passed through the network once. Hence these methods are more memory demanding. None-batched methods like Stochastic Gradient Descent and Line Gradient Descent perform updates for each example. Applying use miniBatch can alter this behaviour by first collecting a predefined amount of examples (set with the batch size parameter) before performing an update.

Most of the time a combination of Stochastic Gradient Descent with miniBatch is very performant, while providing good results.

For ExampleSets with less than 10,000 examples or big ExampleSets without much redundancy it is recommended to use a batch optimization method in combination with adaptive update mechanisms like Adam.

  • Line Gradient Descent: Stochastic gradient descent with line search. This method performs weight and bias updates for each example.
  • Conjugate Gradient Line Search: This method performs weight and bias updates after calculating losses for the full data set. Might be memory intensive.
  • L-BFGS: This method performs weight and bias updates after calculating losses for the full data set. Might be memory intensive.
  • Stochastic Gradient Descent: Stochastic gradient descent. This method performs weight and bias updates for each example.

Backpropagation

选择错误propaga的类型tion through the network. For most scenarios the standard setting is sufficient. But for recurrent networks, e.g., when using a LSTM layer the truncated option might be used.

  • Standard: Standard backpropagation method propagating errors (defined by the chosen loss function) back through the network for updating parameter weights.
  • Truncated: You can choose this option, when using a recurrent network architecture (one that uses recurrent layers like the LSTM layer). It enables another option calledbackpropagation lengthwhich defines the number of steps to use for one backpropagation step. Using the full length often increases the training time by a lot, due to the complexity added by hidden states of recurrent layers.

Backpropagation length

This option is available when selectingtruncatedas thebackpropagationmethod. Define a value for the number of backpropagation steps to use for one error propagation.

Configure updater

Whether to override the default network update configurations.

Updater

Method used to calculate new weight and bias values in order to minimize the chosen loss.

  • SGD: Stochastic gradient descent. Uses a learning rate to adjust the extend to which weights and biases are updated.
  • Adam: Adaptive momentum change.http://arxiv.org/abs/1412.6980
  • AdaMax: Similar to Adam but using the infinity norm. Recommended parameter settings: learning rate = 0.002, beta 1 = 0.9, beta 2 = 0.999.http://arxiv.org/abs/1412.6980
  • AdaDelta: Similar to AdaGrad but adjusts learning rate based on moving window averages instead of all collected gradients. Recommended parameter settings: learning rate = 1.0, rho = 0.95.
  • Nesterovs: Use SGD with nesterov momentum.
  • NAdam: Similar to Adam, but using nesterov mechanism for momentum change. Recommended parameter settings: learning rate = 0.002, beta 1 = 0.9, beta 2 = 0.999.
  • AdaGrad: Uses set learning rate as a baseline and decreases it during training. The rate is adjusted for each weight and reduced when more updates are performed. Recommended parameter settings: learning rate = 0.01.
  • RMSProp: Use a moving average of squared gradients to divide the gradient by. This is well suited for recurrent networks. Recommended parameter settings: learning rate = 0.001, beta 1 = 0.9, beta 2 = 0.999.
  • None: Don't update weights and parameters.

Learning rate

Speed of navigating through the landscape of potential weight and bias values. 0.005 and 0.01 are often good starting points. Higher learning rates can reduce the number of epochs needed for reaching a 'good' result, while it also increases the change of missing the optimal point.

Momentum

Acceleration of a chosen learning rate. Reduces fluctuating values and can help avoid local minima / results getting stuck too early.

Rho

Exponential decay rate of the learning rate. Slows down learning, while decreasing possibility to miss weight and bias values resulting in lower losses.

Epsilon

Jitter value used to ensure numerical stability of updates. Should be very small.

Beta1

Fine tuning parameter for some updaters. In many cases this should be close to one.

Beta2

Fine tuning parameter for some updaters. In many cases this should be close to one.

Rmsdecay

Decay rate for the RMSProp update mechanism.

Configure layers

Whether to override default layer level configurations.

Weight initialization

A Deep Learning model is defined by so called weights. Weights are set within most layers and define the model. The process of finding the best weight values during training is an iterative process and requires start values. Weight values are multiplied to respective input data. At the first layer the input data is the data provided at the training port of the Deep Learning operator. For successive layers weights are multiplied to the output of the previous layer. Select one of the provided pre-defined methods to initialize all weights by the given strategy. Change this parameter, if during training the loss is not decreasing or it takes a long time before the loss values goes down.

  • Identity: Use identity matrices.
  • Normal: Use a Gaussian distribution with a mean of zero and a standard deviation of 1 / sqrt(number of layer inputs).
  • Ones: Use ones.
  • ReLU: Use a Gaussian distribution with a mean of zero and a variance of 2 / (number of layer inputs).
  • ReLU Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6/(number of layer inputs)).
  • Sigmoid Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6 / (number of layer inputs + number of layer outputs)).
  • Uniform: Use a Uniform distribution from -a to a, where a = 1 / sqrt(number of layer inputs).
  • Xavier: Use a Gaussian distribution with a mean of zero and a variance of 2 / (number of layer inputs + number of layer outputs).
  • Xavier Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6/(number of layer inputs + number of layer outputs)).
  • Zero: Initialize all weights with zero. This is rarely a good idea.

Bias initialization

As described for the weight initialization method parameter, a Deep Learning model needs starting values for the training process. While the weights are multiplied to input data, the bias values are added ontop of this product. When training a regression model, for a data set with a mean target value of 10, starting with a bias initialization value of 10 could enable a network to find a fitting bias value more quickly.

Use regularization

Define whether to use regularization for weight calculation or not. Regularization can help reduce overfitting. Use a scatter plot from the data provided at the history port to check for the development of the training and test loss across epochs. Oscillating / jumping loss values often indicate a need for regularization. Set one of the values (L1 and L2) to 0.0 to only use one of the values. Starting with a small L2 regularization (~0.1) is often a good starting point.

L1 strength

定义L1(和所有的力量绝对重量values) for regularization.

L2 strength

Define strength of L2 (proportional to the weight itself) for regularization.

Cudnn algo mode

This parameter will only influence the runtime environment of the network if it is executed on a GPU. Nvidia (manufacturer of the supported GPU architecture) provides a library called: CuDNN, which contains efficient implementations for various layers. CuDNN can accelerate training but at a the cost of a potentially larger memory footprint. In certain edge-cases due to the higher memory consumption, strange errors will occur. When that happens, it is advised to reduce performance but preserve more memory by setting this parameter to "No workspace".

  • Prefer fastest: Default setting. Has better performance but at the cost of higher memory consumption, which could potentially impose a problem. If memory constraints are a problem, it is advised to switch over to "No workspace" and give that a try.
  • No workspace: Has a lower memory footprint than "Prefer fastest", but at the cost of lower performance.

Infer input shape

Guess and log to console the input shape of the given training data.

Network type

Choose network type to configure data shape for.

  • Simple Neural Network: Simple two dimensional neural network only consisting of fully-connected, dropout, activation and batch normalization layers.
  • Convolutional: A four dimensional neural network using convolutional layers, pooling and others.
  • Convolutional Flattened: A two dimensional neural network using convolutional layers, pooling and others, but converting the input from four dimension to two. Conversion to:[miniBatchSize, heightwidthchannels]

Input dimension

Dimension of the data as expected by the network. For simple neural networks this is often the number of regular attributes.

Height

Height dimensionality of the data.

Width

Width dimensionality of the data.

Depth

Depth dimensionality of the data.

Use local random seed

This parameter indicates if alocal random seedshould be used.

Local random seed

If theuse local random seedparameter is checked this parameter determines the local random seed.