18. Hyperparameter Tuning

As you have learned so far, there are number of optimization algorithms to train deep learning models. Training a complex deep learning model can take hours, days or even weeks. Therefore, random guesses for model’s hyperparameters might not be very practical and some deeper knowledge is required. These hyperparameters include but are not limited to learning rate, number of hidden layers, dropout rate, batch size and number of epochs to train for. Notably, not all these hyperparameters contribute in the same way to the model’s performance, which makes finding the best configurations of these variables in such high dimensional space a nontrivial challenge (searching is expensive!). In this chapter, we look at different strategies to tackle this searching problem.

Hyperparameters can be categorized into two groups: those used for training and those related to model structure and design.

18.1. Training Hyperparameters

18.1.1. Learning rate

Gradient descent algorithms multiply the gradient by this scalar known as learning rate to determine the next point in the weight’s space. Learning rate is a hyperparameter that controls the step size to move in the direction of lower loss function, with the goal of minimizing it. In most cases, learning rate is manually adjusted during model training. Large learning rates (\(\alpha\)) make the model learn faster but at the same time it may cause us to miss the minimum loss function and only reach the surrounding of it. In cases where the learning rate is too large, the optimizer overshoots the minimum and the loss updates will lead to divergent behaviours. On the other hand, choosing lower \(\alpha\) values gives a better chance of finding the local minima with the trade-off of needing larger number of epochs and more time.

../_images/loss-lr.gif

Fig. 18.1 Effect of learning rate on loss.

Note that we can almost never plot the loss as a function of weight’s space (as shown in Fig. 18.1) and this makes finding the reasonable \(\alpha\) tricky. With a proper constant \(\alpha\), the model can be trained to a passable yet still unsatisfactory accuracy, because the constant \(\alpha\) can be overlarge, especially in the last few epochs. Alternatively, \(\alpha\) can be adaptively adjusted in response to the performance of the model. This is also known as learning rate decay schedule. Some commonly applied decay schedules include linear (step), exponential, polynomial and cyclic. By starting at a larger learning rate, we achieve the rarely discussed benefit of allowing our model to escape the local minima that overfits, and find a broader minimum as learning rate decreases over the number of epochs.

../_images/decay.gif

Fig. 18.2 Decay schedules on learning rate can possibly help escaping the local minima.

18.1.2. Batch Size

The batch size is always a trade-off between computational efficiency and accuracy. By reducing the batch size, you are essentially reducing the number of samples based on which loss is calculated at each training iteration. Since smaller samples can have more variations from one another, they can add mode noise to convergence. Thus, by keeping the batch size small, we are more likely to find a broader local minima.

As mentioned in the previous section, decaying the learning rate is a common practice in training ML frameworks. It has actually been shown that we can obtain the same benefits and learning curve by scaling up the batch size during training instead [SKYL17].

18.3. Hyperparameter Optimization

Mathematically, hyperparameter optimization (HPO) is the process of finding the set of hyperparamters to either achieve minimum loss or maximum accuracy of the objective model. The general philosophy is the same for all HPO algorithms: determine which hyperparameters to tune and their corresponding search space, adjust them from coarse to fine and obtain the optimal combination.

The state-of-the-art HPO algorithms are classified into two categories: search algorithms and trial schedulers. In general, search algorithms are applied for sampling and trial schedulers deal with the early stopping methods for model evaluation.

18.3.1. Search Algorithms

18.3.1.3. Bayesian Optimization

Bayesian optimization (BO) is a sequential model-based optimization that aims at becoming less wrong with more data by finding the global optimum by balancing exploration and exploitation that minimizes the number of trials. BO outperforms random and grid search in two aspects: 1. There is no need to have some preliminary knowledge of the distribution of hyperparameters. 2. Unlike random and grid search, the posterior probability is obtained based of a relevant search space, meaning that the algorithm discards the hyperparameter ranges that will most likely not deliver promising solutions according to the previous trials. 3. Another remarkable advantage is that BO is applicable to different settings, where the derivative of the objective function is unknown or expensive to calculate, whether it is stochastic or discrete, or convex or non-convex.

BO consists of two key ingredients: 1. A Bayesian probability surrogate model to model the expensive objective function. A surrogate mother is a women who agrees to bear a child for another person, so in context, a surrogate function is a less expensive approximation of the objective function. A popular surrogate model for BO are Gaussian processes (GPs). 2. An acquisition function that acts as a metric function to determine the next optimal sampling point. This is where BO provides a balanced trade-off between exploitation and exploration. Exploitation means sampling where the surrogate model predicts a high objective, given the current available solutions. Exploration means sampling at locations where the prediction uncertainty is high. These both correspond to high acquisition function values, and the goal is to determine the next sampling point by maximizing the acquisition function.

18.3.2. Trial Schedulers

HPO is a time-consuming process and in realistic scenarios, it is necessary to obtain the best hyperparameter with limited available resources. When it comes to training the hyperparameters by hand, by experience we can narrow the search space down, evaluate the model during training and decide whether to stop the training or keep going. An early stopping strategy tries to mimic this behavior and maximize the computational resource budget for promising hyperparameter sets.

18.3.2.1. Median Stopping

Median stopping is a straightforward early termination policy that makes the stopping decision based on the average primary metrics, such as accuracy or loss reported by previous runs. A trial \(X\) is halted at step \(S\) if the best objective value by step \(S\) is strictly worse than the median value of the running average of all completed trials objective values reported at step \(S\) [GSM+17].

18.3.2.2. Curve Fitting

Curve Fitting is another early stopping algorithm rule that predicts the final accuracy or loss using a performance curve regressed from a set of completed or partially completed trials [DSH15]. A trial \(X\) will be stopped at step \(S\) if the extrapolation of the learning curves is worse than the tolerant value of the optimal in the trial history. Unlike median stopping, where we don’t have any hyperparameters, curve fitting is a model with parameters and it also requires a training process.

18.3.2.3. Successive Halving

Successive Halving (SHA) converts the hyperparameter optimization problem into a non-stochastic best-arm identification and tries to allocate more resources only to those hyperparameter sets that are more promising [JT16]. In SHA, user defines and fixed budget (\(B\)) and a fixed number of trials (\(n\)). The hyperparameter sets are uniformly queried for a portion of the intial budget and the corresponding model performances are evaluated for all trials. The worst promising half is dropped while the budget is doubled for the other half and this is done successively until one trial remains. One drawback of SHA is how resources are allocated. There is a trade-off between the total budget (\(B\)) and number of trials (\(n\)). If \(n\) is too large, each trial may result in premature termination, whereas too small \(n\) would not provide enough optional choices.

Compared with Bayesian optimization, SHA is easier to understand and it is more computationally efficient as it evaluate the intermediate model results and determines whether to terminate it or not.

18.3.2.4. HyperBand

HyperBand is an extension of SHA that tries to solve the resource allocation problem by considering several possible \(n\) values and fixing \(B\) [LJD+18].

18.4. Running This Notebook

Click the    above to launch this page as an interactive Google Colab. See details below on installing packages, either on your own environment or on Google Colab

18.5. Hyperparameter Tuning

Now that we know more about different HPO techniques, we develop a deep learning model to predict hemolysis in antimacrobial peptides and tune its hyperparameters. The model for hemolytic prediction is trained using data from the Database of Antimicrobial Activity and Structure of Peptides (DBAASP v3 []). The activity is defined by extrapolating a measurement assuming a dose response curves to the point at which 50% of red blood cells (RBC) are lysed. If the activity is below \(100\frac{\mu g}{ml}\), it is considered hemolytic. Each measurement is treated independently, so sequences can appear multiple times. The training data contains 9,316 positive and negative sequences of only L- and canonical amino acids.

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
import warnings
import urllib

warnings.filterwarnings("ignore")
sns.set_context("notebook")
sns.set_style(
    "dark",
    {
        "xtick.bottom": True,
        "ytick.left": True,
        "xtick.color": "#666666",
        "ytick.color": "#666666",
        "axes.edgecolor": "#666666",
        "axes.linewidth": 0.8,
        "figure.dpi": 300,
    },
)
color_cycle = ["#1BBC9B", "#F06060", "#5C4B51", "#F3B562", "#6e5687"]
mpl.rcParams["axes.prop_cycle"] = mpl.cycler(color=color_cycle)
np.random.seed(0)
urllib.request.urlretrieve(
    "https://github.com/ur-whitelab/peptide-dashboard/raw/master/ml/data/hemo-positive.npz",
    "positive.npz",
)
urllib.request.urlretrieve(
    "https://github.com/ur-whitelab/peptide-dashboard/raw/master/ml/data/hemo-negative.npz",
    "negative.npz",
)
with np.load("positive.npz") as r:
    pos_data = r[list(r.keys())[0]]
with np.load("negative.npz") as r:
    neg_data = r[list(r.keys())[0]]

# create labels and stich it all into one
# tensor
labels = np.concatenate(
    (
        np.ones((pos_data.shape[0], 1), dtype=pos_data.dtype),
        np.zeros((neg_data.shape[0], 1), dtype=pos_data.dtype),
    ),
    axis=0,
)
features = np.concatenate((pos_data, neg_data), axis=0)
# we now need to shuffle before creating TF dataset
# so that our train/test/val splits are random
i = np.arange(len(labels))
np.random.shuffle(i)
labels = labels[i]
features = features[i]
full_data = tf.data.Dataset.from_tensor_slices((features, labels))
def build_model(reg=0.1, add_dropout=False):
    model = tf.keras.Sequential()
    # make embedding and indicate that 0 should be treated specially
    model.add(
        tf.keras.layers.Embedding(
            input_dim=21, output_dim=16, mask_zero=True, input_length=pos_data.shape[-1]
        )
    )

    # now we move to convolutions and pooling
    model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=5, activation="relu"))
    model.add(tf.keras.layers.MaxPooling1D(pool_size=4))

    model.add(
        tf.keras.layers.Conv1D(
            filters=16,
            kernel_size=3,
            activation="relu",
        )
    )
    model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

    model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation="relu"))
    model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

    # now we flatten to move to hidden dense layers.
    # Flattening just removes all axes except 1 (and implicit batch is still in there as always!)

    model.add(tf.keras.layers.Flatten())
    if add_dropout:
        model.add(tf.keras.layers.Dropout(0.3))
    model.add(
        tf.keras.layers.Dense(
            256, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(reg)
        )
    )
    if add_dropout:
        model.add(tf.keras.layers.Dropout(0.3))
    model.add(
        tf.keras.layers.Dense(
            64, activation="tanh", kernel_regularizer=tf.keras.regularizers.l2(reg)
        )
    )
    if add_dropout:
        model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(1, activation="sigmoid"))
    return model
model = build_model(reg=0, add_dropout=False)
# now split into val, test, train
N = pos_data.shape[0] + neg_data.shape[0]
print(N, "examples")
split = int(0.1 * N)
test_data = full_data.take(split).batch(64)
nontest = full_data.skip(split)
val_data, train_data = nontest.take(split).batch(64), nontest.skip(split).shuffle(
    1000
).batch(64)


def train_model(
    model, lr=1e-3, Reduced_LR=False, Early_stop=False, batch_size=32, epochs=20
):

    tf.keras.backend.clear_session()
    callbacks = []

    if Early_stop:
        early_stopping = tf.keras.callbacks.EarlyStopping(
            monitor="val_auc",
            mode="max",
            patience=5,
            min_delta=1e-2,
            restore_best_weights=True,
        )
        callbacks.append(early_stopping)
    opt = tf.optimizers.Adam(lr)
    if Reduced_LR:
        # decay learning rate on plateau
        reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
            monitor="val_loss", factor=0.9, patience=5, min_lr=1e-5
        )
        # add a callback to print lr at the begining of every epoch
        callbacks.append(
            [
                tf.keras.callbacks.LambdaCallback(
                    on_epoch_begin=lambda epochs, logs: print("lr =", opt.lr.numpy())
                ),
                reduce_lr,
            ]
        )
    model.compile(
        opt,
        loss="binary_crossentropy",
        metrics=[
            tf.keras.metrics.AUC(from_logits=False),
            tf.keras.metrics.BinaryAccuracy(threshold=0.5),
        ],
    )
    history = model.fit(
        train_data,
        validation_data=val_data,
        epochs=epochs,
        batch_size=batch_size,
        verbose=0,
        callbacks=callbacks,
    )
    print(
        f"Train Loss: {history.history['loss'][-1]:.3f}, Test Loss: {history.history['val_loss'][-1]:.3f}"
    )
    return history
9316 examples
def plot_losses(history, test_data):
    plt.plot(history.history["loss"], label="training")
    plt.plot(history.history["val_loss"], label="validation")
    plt.legend()
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.show()
    result = model.evaluate(test_data)
    print(f" Test AUC: {result[1]:.3f}  Test Accuracy: {result[2]:.3f}")

18.5.1. Baseline Model

history = train_model(model, Reduced_LR=False)
plot_losses(history, test_data)
Train Loss: 0.289, Test Loss: 0.453
../_images/Hyperparameter_tuning_18_1.png
 1/15 [=>............................] - ETA: 0s - loss: 0.4944 - auc: 0.8040 - binary_accuracy: 0.7969

15/15 [==============================] - ETA: 0s - loss: 0.4460 - auc: 0.7980 - binary_accuracy: 0.8249

15/15 [==============================] - 0s 4ms/step - loss: 0.4460 - auc: 0.7980 - binary_accuracy: 0.8249
 Test AUC: 0.798  Test Accuracy: 0.825

So with the default hyperparameters, the baseline model above is clearly ovefitting to the training data. We first try adding l2 regularization:

model = build_model(reg=0.01, add_dropout=False)
history = train_model(model, Reduced_LR=False, Early_stop=False)
plot_losses(history, test_data)
Train Loss: 0.352, Test Loss: 0.433
../_images/Hyperparameter_tuning_20_1.png
 1/15 [=>............................] - ETA: 0s - loss: 0.5207 - auc: 0.7572 - binary_accuracy: 0.8125

14/15 [===========================>..] - ETA: 0s - loss: 0.4255 - auc: 0.7981 - binary_accuracy: 0.8304

15/15 [==============================] - 0s 4ms/step - loss: 0.4207 - auc: 0.7969 - binary_accuracy: 0.8324
 Test AUC: 0.797  Test Accuracy: 0.832

Great! Now We have a lower test loss and better AUC.

18.5.2. Early Stopping

We can use early stopping regularization and return best weights based on maximum obtained AUC value for the validation data. This is done by adding the early stopping callback, when compiling the model:

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor="val_auc", mode="max", patience=5, min_delta=1e-2, restore_best_weights=True
)

Let’s see how the model performs with early stopping:

model = build_model(reg=0.01, add_dropout=False)
history = train_model(model, Reduced_LR=False, Early_stop=True)
plot_losses(history, test_data)
Train Loss: 0.357, Test Loss: 0.451
../_images/Hyperparameter_tuning_25_1.png
 1/15 [=>............................] - ETA: 0s - loss: 0.4687 - auc: 0.7904 - binary_accuracy: 0.8281

15/15 [==============================] - 0s 4ms/step - loss: 0.4242 - auc: 0.7804 - binary_accuracy: 0.8228
 Test AUC: 0.780  Test Accuracy: 0.823

We have about the same performance but in less number of epochs. Note that for learning purposes, we have limited the number of epochs to \(20\) in this example. Early stopping regularization becomes more relevant when we typically have a large number of epochs, as it halts the training and saves computational budget, unless there is gain in more training.

18.5.3. Reduced Learning Rate on Plateau and Dropout

Now let’s try reducing the learning rate and dropout. Since the training epochs is already limited to \(20\), we don’t use early stopping here:

model = build_model(reg=0.01, add_dropout=True)
history = train_model(model, Reduced_LR=True, Early_stop=False, epochs=20)
plot_losses(history, test_data)
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
lr = 0.001
Train Loss: 0.372, Test Loss: 0.446
../_images/Hyperparameter_tuning_28_21.png
 1/15 [=>............................] - ETA: 0s - loss: 0.5038 - auc: 0.7917 - binary_accuracy: 0.7500

14/15 [===========================>..] - ETA: 0s - loss: 0.4161 - auc: 0.8058 - binary_accuracy: 0.8170

15/15 [==============================] - 0s 4ms/step - loss: 0.4137 - auc: 0.8015 - binary_accuracy: 0.8195
 Test AUC: 0.802  Test Accuracy: 0.820

18.6. Discussion

In this chapter, we explored means of reducing overfitting and enhancing feature selection in deep learning models. Techniques suggested can give you a good head start on tuning your model’s hyperparameters. What is important is that, you need to experiment to find a good set of hyperparameters. This can be time-consuming, so use your own judgment and intuition to come up with a smart search strategy. Before you start hypertuning, make sure you obtain a baseline model and slowly add more pieces to the puzzle, based on training and validation loss, AUC or other metrics.

There are also some toolkits for hyperparameter optimization that might be handy:

18.7. Cited References

LJD+18

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: a novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1–52, 2018. URL: http://jmlr.org/papers/v18/16-558.html.

SKYL17

Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.

GSM+17(1,2)

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and David Sculley. Google vizier: a service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 1487–1495. 2017.

DSH15

Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-fourth international joint conference on artificial intelligence. 2015.

JT16

Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Artificial intelligence and statistics, 240–248. PMLR, 2016.