Day 4 - Regression and Classification using MLPs

MLPs for Regression

Number of neurons in the input layer depends on data.

Use one output neuron per output dimension required (e.g. one neuron to predict price of house, two neurons to predict x and y coordinates).

Choice of activation function

Allow output of any range of values: no activation function.

Allow only positive values: ReLU or softplus function.
- Softplus is a smooth variant of ReLU, close to 0 when $z$ is negative and close to $z$ when $z$ is positive.
- $\text{Softplus}(z) = \log (1 + e^{z})$

Allow values within a given range:
- Logistic function and scale, for values between 0 and $\text{scale}$ .
- Hyperbolic tangent function and scale, for values between $\text{-scale}$ to $\text{scale}$ .

Choice of loss function

Mean Squared Error (MSE)

Mean Absolute Error (MAE) - suitable when there are outliers in the data

Huber loss - combination of both of the above:
- Quadratic when error is smaller than a threshold ( $\delta$ ), usually 1 - allows to converge faster and be more preecise than MAE.
- Linear when error larger than threshold ( $\delta$ ) - makes it less sensitive to outliers.

Typical MLP architecture for Regression

Hyperparameter	Typical value
Number of neurons in input layer	Equal to number of input features
Number of hidden layers	Depends on the problem (typically 1 to 5)
Number of neurons per hidden layer	Depends on the problem (typically 10 to 100)
Number of neurons in output layer	Equal to number of prediction dimensions
Hidden layer activation	ReLU or SeLU
Output activation	None ReLU/softplus for positive outputs Logistic/tanh with scaling for bounded outputs
Loss function	MSE MAE/Huber if there are outliers

MLPs for Classification

Binary classification

Use single output neuron using logistic activation.

Output between 0 and 1 can be trated as estimated probability of positive class.

Probability of negative class = 1 - output of network.

Multilabel binary classification

Use multiple neurons with each using logistic activation function.

Use one output neuron for each positive class.

The output classes are not exclusive (an instance can belong to multiple classes at once) so the output probabilities don't add up to 1.

Multiclass classification

The output classes are exclusive (each instance can belong to only one of multiple classes).

Use one output neuron per class with softmax activation in the output layer.

All outputs will be between 0 and 1, and add up to 1.

Choice of loss function

Cross-entropy loss (a.k.a log loss) is a good choice for loss function:

$- \displaystyle \sum_{c} y_c \cdot \log(p_c)$

$y_c = 1$ if $c$ is the right class, and 0 otherwise.

$p_c$ is the probability the network predicts for class $c$ .

Typical MLP architecture for Classification

Hyperparameter	Binary classification	Multilabel binary classification	Multiclass classification
Number of input and hidden layers	Same as regression	Same as regression	Same as regression
Number of neurons in output layer	1	1 per label	1 per class
Output layer activation function	Logistic	Logistic	Softmax
Loss function	Cross entropy	Cross entropy	Cross entropy