{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "ipub": {
     "titlepage": {
      "author": "Mehrad Ansari",
      "email": "mehrad.ans@gmail.com",
      "institution": [
       "Institution1",
       "Institution2"
      ],
      "logo": "path/to/logo_example.png",
      "subtitle": "Sub-Title",
      "supervisors": [
       "First Supervisor",
       "Second Supervisor"
      ],
      "tagline": "A tagline for the report.",
      "title": "Main-Title"
     }
    }
   },
   "source": [
    "# Hyperparameter Tuning \n",
    "\n",
    "```{admonition} Authors:\n",
    "[Mehrad Ansari](https://github.com/mehradans92)\n",
    "```\n",
    "\n",
    "As you have learned so far, there are number of optimization algorithms to train deep learning models. Training a complex deep learning model can take hours, days or even weeks. Therefore, random guesses for model's hyperparameters might not be very practical and some deeper knowledge is required. These hyperparameters include but are not limited to learning rate, number of hidden layers, dropout rate, batch size and number of epochs to train for. Notably, not all these hyperparameters contribute in the same way to the model's performance, which makes finding the best configurations of these variables in such high dimensional space a nontrivial challenge (searching is expensive!). In this chapter, we look at different strategies to tackle this searching problem.\n",
    "\n",
    "```{admonition} Audience & Objectives\n",
    "This chapter builds on {doc}`layers` and {doc}`../ml/classification`. After completing this chapter, you should be able to \n",
    "\n",
    "  * Distinguish between training and model design-related hyperparameters \n",
    "  * Understand the importance of validation data in hyperparameter tuning  \n",
    "  * Understand how each hyperparameter can affect a model's performance\n",
    "```\n",
    "\n",
    "Hyperparameters can be categorized into two groups: those used for training and those related to model structure and design."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Hyperparameters\n",
    "\n",
    "### Learning rate\n",
    "\n",
    "Gradient descent algorithms multiply the gradient by a scalar known as learning rate to determine the next point in the weights' space. Learning rate is a hyperparameter that controls the step size to move in the direction of lower loss function, with the goal of minimizing it. In most cases, learning rate is manually adjusted during model training. Large learning rates ($\\alpha$) make the model learn faster but at the same time it may cause us to miss the minimum loss function and only reach the surrounding of it. In cases where the learning rate is too large, the optimizer overshoots the minimum and the loss updates will lead to divergent behaviours. On the other hand, choosing lower $\\alpha$ values gives a better chance of finding the local minima with the trade-off of needing larger number of epochs and more time. \n",
    "\n",
    "\n",
    "```{figure} ./loss-lr.gif\n",
    "----\n",
    "name: loss_lr\n",
    "width: 1000px\n",
    "alt: Effect of learning rate on loss.\n",
    "----\n",
    "Effect of learning rate on loss. \n",
    "```\n",
    "    \n",
    "Note that we can almost never plot the loss as a function of weight's space (as shown in {numref}`loss_lr`), and this makes finding the reasonable $\\alpha$ tricky.  With a proper constant $\\alpha$, the model can be trained to a passable yet still unsatisfactory accuracy, because the constant $\\alpha$ can be overlarge, especially in the last few epochs. Alternatively, $\\alpha$ can be adaptively adjusted in response to the performance of the model. This is also known as learning rate decay schedule. Some commonly applied decay schedules include linear (step), exponential, polynomial and cyclic. By starting at a larger learning rate, we achieve the rarely discussed benefit of allowing our model to escape the local minima that overfits, and find a broader minimum as learning rate decreases over the number of epochs. If you are using [`Keras`](https://keras.io/), you can either define your own scheduler or use any from [{obj}`tf.keras.optimizers.schedules`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules) and add it as a callback when you train your model. Some other adaptive learning rate schedulers that might help with reducing tuning efforts can be found in {cite}`khodamoradi2021aslr,yang2020pacl,vidyabharathi2021achieving`.\n",
    "\n",
    "\n",
    "```{figure} ./decay.gif\n",
    "----\n",
    "name: decay_lr\n",
    "width: 750px\n",
    "alt: Decay schedules effect on training\n",
    "----\n",
    "Decay schedules on learning rate can possibly help escaping the local minima. \n",
    "```\n",
    "\n",
    "### Momentum\n",
    "\n",
    "Another tweak that can help escaping the local minima is the addition of the history to the weight update. \"The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction\" {cite}`goodfellow2017deep`. Simply speaking, this means that rather than using only the gradient of the current step to guide the search in the weights' space, momentum also accumulates the gradient of the past steps to determine which direction to go to. Momentum has the effect of smoothing the optimization process, allowing for slower updates to continue in the previous directions instead of oscillating or getting stuck. Momentum has a value greater than zero and less than one, but common values used in practice range between 0.9 and 0.99. Note that momentum is rarely changed in deep learning as a hyperparameter, as it does not really help with configuring the proper learning rate. Instead, it can increase the speed of the optimization process by increasing the likelihood of obtaining a better set of weights in fewer training epochs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch Size\n",
    "\n",
    "The batch size is always a trade-off between computational efficiency and accuracy. By reducing the batch size, you are essentially reducing the number of samples based on which loss is calculated at each training iteration. Considering model evaluation metrics, smaller batch sizes generalize well on the unobserved data in validation and test set. But why do large batch sizes result a poorer generalization? [This thread](https://stats.stackexchange.com/questions/164876/what-is-the-trade-off-between-batch-size-and-number-of-iterations-to-train-a-neu) on Stack Exchange has some great hypotheses:\n",
    "\n",
    "- Gradient descent-based optimization makes linear approximation of the loss function, and this approximation will not be the best for highly nonlinear loss functions, thus, having a smaller batch size helps.\n",
    "- Large-batch training methods are shown to converge to sharp minimizers of the training and testing functions, and that sharp minima results poorer generalization {cite}`keskar2016large,hoffer2017train,lin2020extrapolation`.\n",
    "- Since smaller samples can have more variations from one another, they can add more noise to convergence. Thus, by keeping the batch size small, we are more likely to escape the local minima and find a more broader one.\n",
    "\n",
    "\n",
    "\n",
    "As mentioned in the previous section, decaying the learning rate is a common practice in training ML frameworks. But, it has actually been shown that we can obtain the same benefits and learning curve by scaling up the batch size during training instead {cite}`smith2017don`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Design-Related Hyperparameters\n",
    "\n",
    "### Number of hidden layers\n",
    "\n",
    "The overall structure of neural networks are determined based on the number of hidden layers ($d$). Before describing this, let us talk about the traditional disagreement on how the total number of layers are counted. This disagreement centers around whether or not the input layer is counted. There is an argument that suggests it should not be counted since inputs are not active. We go by this convention, which is also recommended in {cite}`reed1999neural`. A single-layer can only be used to represent linearly separable functions, in very simple problems. On the other hand, deep learning models with more layers are more likely to capture more complex features are obtain a relative higher accuracy. As a common sense, you can keep adding layers until the test error does not improve anymore.\n",
    "\n",
    "### Number of nodes in each hidden layer\n",
    "\n",
    "The number of nodes ($w$) in each layer should be carefully considered to avoid overfitting and underfitting. Small ($w$) may result underfitting as the model lacks complexity. On the other hand, too many nodes (large $w$) can cause overfitting and increase training time. Here are some [rules-of-thumb](https://www.heatonresearch.com/2017/06/01/hidden-layers.html) that can be a good start for tuning the number of nodes:\n",
    "\n",
    "1. $w_{input}$ $<$ $w$ $<$ $w_{output}$\n",
    "\n",
    "2. $w = \\frac{2}{3}w_{input}$ $+$ $w_{output}$\n",
    "\n",
    "3. $w < 2 w_{output}$\n",
    "\n",
    "### Regularization\n",
    "\n",
    "Regularization is applied to counteract the additional model complexity that comes as a result of adding more nodes in deep neural networks. The most common regularization methods ($L_1$ and $L_2$) are explained in {doc}`../ml/regression`. Hyperparameter $\\lambda$ determines the magnitude of the regularization term in the loss function. Too large $\\lambda$ pushes the weights closer to zero and oversimplifies the structure of the deep learning model, where as undersized $\\lambda$ is not strong enough to reduce the weights. Thanks to its computational efficiency $L_2$ regularization is more widely used, however, $L_1$ regularization prevails over $L_2$ on sparse properties. A brief comparison between $L_1$ and  $L_2$ are presented in the table below.\n",
    "\n",
    "\n",
    "|                         $L_1$                        |                  $L_2$                 |\n",
    "|:--------------------------------------------------:|:-----------------------------------:|\n",
    "| Penalizes the sum of the absolute value of weights | Penalizes the sum of square weights |\n",
    "|          Unable to learn complex patterns          | Able to learn complex data patterns |\n",
    "|                 Robust to outliers                 |        Not robust to outliers       |\n",
    "|             Built-in feature selection             |         No feature selection        |\n",
    "|                  Multiple solution                 |             One solution            |\n",
    "|                   Sparse solution                  |          Nonsparse solution         |\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperparameter Optimization\n",
    "\n",
    "Mathematically, hyperparameter optimization (HPO) is the process of finding the set of hyperparamters to either achieve minimum loss or maximum accuracy of the objective model. The general philosophy is the same for all HPO algorithms: determine which hyperparameters to tune and their corresponding search space, adjust them from coarse to fine and obtain the optimal combination.\n",
    "\n",
    "The state-of-the-art HPO algorithms are classified into two categories: search algorithms and trial schedulers. In general, search algorithms are applied for sampling and trial schedulers deal with the early stopping methods for model evaluation.\n",
    "\n",
    "### Search Algorithms \n",
    "\n",
    "#### Grid Search\n",
    "As long as you have sufficient computational resources, grid search is a straightforward method for HPO. It performs an exhaustive search on the hyperparameters sets defined by the user. Grid search is applicable for cases where we have limited number of hyperparameters with limited search space.\n",
    "\n",
    "#### Random Search\n",
    "\n",
    "Random search is a more effective version of grid search but still computationally exhaustive. It performs a randomized search over hyperparameters from certain distributions over possible parameter values. The searching process continues until the desired accuracy is reached or unitl the predetermined computational budget is exhausted. Random search works better than grid search considering two benefits: First, a budget can be assigned independently based on the distribution of the search space, whereas in grid search the budget for each hyperparameter set is a fixed value. This makes random search perform better than grid search, especially in search patterns where some hyperparameters are not uniformly distributed. Secondly, a larger time consumption of random search quite certainly will lead to a larger probability finding the best hyperparameter set (this is known as Monte Carlo techniques)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Bayesian Optimization\n",
    "\n",
    "Bayesian optimization (BO) is a sequential model-based optimization that aims at becoming less wrong with more data by finding the global optimum by balancing exploration and exploitation that minimizes the number of trials. BO outperforms random and grid search in two aspects: 1. There is no need to have some preliminary knowledge of the distribution of hyperparameters. 2. Unlike random and grid search, the posterior probability is obtained based of a relevant search space, meaning that the algorithm discards the hyperparameter ranges that will most likely not deliver promising solutions according to the previous trials. 3. Another remarkable advantage is that BO is applicable to different settings, where the derivative of the objective function is unknown or expensive to calculate, whether it is stochastic or discrete, or convex or non-convex.\n",
    "\n",
    "BO consists of two key ingredients: 1. A Bayesian probability surrogate model to model the expensive objective function. A **surrogate** mother is a women who agrees to bear a child for another person, so in context, a surrogate function is a less expensive approximation of the objective function. A popular surrogate model for BO are Gaussian processes (GPs).\n",
    "2. An acquisition function that acts as a metric function to determine the next optimal sampling point. This is where BO provides a balanced trade-off between exploitation and exploration. Exploitation means sampling where the surrogate model predicts a high objective,  given the current available solutions. Exploration means sampling at locations where the prediction uncertainty is high. These both correspond to high acquisition function values, and the goal is to determine the next sampling point by maximizing the acquisition function.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Trial Schedulers\n",
    "\n",
    "HPO is a time-consuming process and in realistic scenarios, it is necessary to obtain the best hyperparameter with limited available resources. When it comes to training the hyperparameters by hand, by experience we can narrow the search space down, evaluate the model during training and decide whether to stop the training or keep going. An early stopping strategy tries to mimic this behavior and maximize the computational resource budget for promising hyperparameter sets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Median Stopping\n",
    "\n",
    "Median stopping is a straightforward early termination policy that makes the stopping decision based on the average primary metrics, such as accuracy or loss reported by previous runs. A trial $X$ is halted at step $S$ if the best objective value by step $S$ is strictly worse than the median value of the running average of all completed trials objective values reported at step $S$ {cite}`golovin2017google`.\n",
    "\n",
    "#### Curve Fitting\n",
    "\n",
    "Curve Fitting is another early stopping algorithm rule that predicts the final accuracy or loss using a performance curve regressed from a set of completed or partially completed trials {cite}`domhan2015speeding`. A trial $X$ will be stopped at step $S$ if the extrapolation of the learning curves is worse than the tolerant value of the optimal in the trial history. Unlike median stopping, where we don't have any hyperparameters, curve fitting is a model with parameters and it also requires a training process.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Successive Halving\n",
    "\n",
    "Successive Halving (SHA) converts the hyperparameter optimization problem into a non-stochastic best-arm identification and tries to allocate more resources only to those hyperparameter sets that are more promising {cite}`jamieson2016non`. In SHA, user defines and fixed budget ($B$) and a fixed number of trials ($n$). The hyperparameter sets are uniformly queried for a portion of the intial budget and the corresponding model performances are evaluated for all trials. The worst promising half is dropped while the budget is doubled for the other half and this is done successively until one trial remains. One drawback of SHA is how resources are allocated. There is a trade-off between the total budget ($B$) and number of trials ($n$). If $n$ is too large, each trial may result in premature termination, whereas too small $n$ would not provide enough optional choices.  \n",
    "\n",
    "Compared with Bayesian optimization, SHA is easier to understand and it is more computationally efficient as it evaluate the intermediate model results and determines whether to terminate it or not. \n",
    "\n",
    "#### HyperBand\n",
    "\n",
    "HyperBand is an extension of SHA that tries to solve the resource allocation problem by considering several possible $n$ values and fixing $B$ {cite}`hyperband`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running This Notebook\n",
    "\n",
    "\n",
    "Click the &nbsp;<i aria-label=\"Launch interactive content\" class=\"fas fa-rocket\"></i>&nbsp; above to launch this page as an interactive Google Colab. See details below on installing packages, either on your own environment or on Google Colab\n",
    "\n",
    "````{tip} My title\n",
    ":class: dropdown\n",
    "To install packages, execute this code in a new cell\n",
    "\n",
    "```\n",
    "!pip install dmol-book\n",
    "```\n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperparameter Tuning "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we know more about different HPO techniques, we develop a deep learning model to predict hemolysis in peptides and tune its hyperparameters. Hemolysis is defined as the disruption of erythrocyte membranes that decrease the life span of red blood cells and causes the release of Hemoglobin. Identifying hemolytic antimicrobial is critical to their applications as non-toxic and safe measurements against bacterial infections. However, distinguishing between hemolytic and non-hemolytic peptides is complicated, as they primarily exert their activity at the charged surface of the bacterial plasma membrane. Timmons and Hewage {cite}`timmons2020happenn` differentiate between the two whether they are active at the zwitterionic eukaryotic membrane, as well as the anionic prokaryotic membrane. The model for hemolytic prediction is trained using data from the Database of Antimicrobial Activity and Structure of Peptides (DBAASP v3 {cite}`pirtskhalava2021dbaasp`). The activity is defined by extrapolating a measurement assuming a dose response curves to the point at which 50\\% of red blood cells (RBC) are lysed. If the activity is below $100\\frac{\\mu g}{ml}$, it is considered hemolytic. Each measurement is treated independently, so sequences can appear multiple times. The training data contains 9,316 positive and negative sequences of only L- and canonical amino acids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import matplotlib as mpl\n",
    "import warnings\n",
    "import urllib\n",
    "import dmol"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "urllib.request.urlretrieve(\n",
    "    \"https://github.com/ur-whitelab/peptide-dashboard/raw/master/ml/data/hemo-positive.npz\",\n",
    "    \"positive.npz\",\n",
    ")\n",
    "urllib.request.urlretrieve(\n",
    "    \"https://github.com/ur-whitelab/peptide-dashboard/raw/master/ml/data/hemo-negative.npz\",\n",
    "    \"negative.npz\",\n",
    ")\n",
    "with np.load(\"positive.npz\") as r:\n",
    "    pos_data = r[list(r.keys())[0]]\n",
    "with np.load(\"negative.npz\") as r:\n",
    "    neg_data = r[list(r.keys())[0]]\n",
    "\n",
    "# create labels and stich it all into one\n",
    "# tensor\n",
    "labels = np.concatenate(\n",
    "    (\n",
    "        np.ones((pos_data.shape[0], 1), dtype=pos_data.dtype),\n",
    "        np.zeros((neg_data.shape[0], 1), dtype=pos_data.dtype),\n",
    "    ),\n",
    "    axis=0,\n",
    ")\n",
    "features = np.concatenate((pos_data, neg_data), axis=0)\n",
    "# we now need to shuffle before creating TF dataset\n",
    "# so that our train/test/val splits are random\n",
    "i = np.arange(len(labels))\n",
    "np.random.shuffle(i)\n",
    "labels = labels[i]\n",
    "features = features[i]\n",
    "full_data = tf.data.Dataset.from_tensor_slices((features, labels))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_model(reg=0.1, add_dropout=False):\n",
    "    model = tf.keras.Sequential()\n",
    "    # make embedding and indicate that 0 should be treated specially\n",
    "    model.add(\n",
    "        tf.keras.layers.Embedding(\n",
    "            input_dim=21, output_dim=16, mask_zero=True, input_length=pos_data.shape[-1]\n",
    "        )\n",
    "    )\n",
    "\n",
    "    # now we move to convolutions and pooling\n",
    "    model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=5, activation=\"relu\"))\n",
    "    model.add(tf.keras.layers.MaxPooling1D(pool_size=4))\n",
    "\n",
    "    model.add(\n",
    "        tf.keras.layers.Conv1D(\n",
    "            filters=16,\n",
    "            kernel_size=3,\n",
    "            activation=\"relu\",\n",
    "        )\n",
    "    )\n",
    "    model.add(tf.keras.layers.MaxPooling1D(pool_size=2))\n",
    "\n",
    "    model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation=\"relu\"))\n",
    "    model.add(tf.keras.layers.MaxPooling1D(pool_size=2))\n",
    "\n",
    "    # now we flatten to move to hidden dense layers.\n",
    "    # Flattening just removes all axes except 1 (and implicit batch is still in there as always!)\n",
    "\n",
    "    model.add(tf.keras.layers.Flatten())\n",
    "    if add_dropout:\n",
    "        model.add(tf.keras.layers.Dropout(0.3))\n",
    "    model.add(\n",
    "        tf.keras.layers.Dense(\n",
    "            256, activation=\"relu\", kernel_regularizer=tf.keras.regularizers.l2(reg)\n",
    "        )\n",
    "    )\n",
    "    if add_dropout:\n",
    "        model.add(tf.keras.layers.Dropout(0.3))\n",
    "    model.add(\n",
    "        tf.keras.layers.Dense(\n",
    "            64, activation=\"tanh\", kernel_regularizer=tf.keras.regularizers.l2(reg)\n",
    "        )\n",
    "    )\n",
    "    if add_dropout:\n",
    "        model.add(tf.keras.layers.Dropout(0.3))\n",
    "    model.add(tf.keras.layers.Dense(1, activation=\"sigmoid\"))\n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = build_model(reg=0, add_dropout=False)\n",
    "# now split into val, test, train\n",
    "N = pos_data.shape[0] + neg_data.shape[0]\n",
    "print(N, \"examples\")\n",
    "split = int(0.1 * N)\n",
    "test_data = full_data.take(split).batch(64)\n",
    "nontest = full_data.skip(split)\n",
    "val_data, train_data = nontest.take(split).batch(64), nontest.skip(split).shuffle(\n",
    "    1000\n",
    ").batch(64)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_model(\n",
    "    model, lr=1e-3, Reduced_LR=False, Early_stop=False, batch_size=32, epochs=20\n",
    "):\n",
    "    tf.keras.backend.clear_session()\n",
    "    callbacks = []\n",
    "\n",
    "    if Early_stop:\n",
    "        early_stopping = tf.keras.callbacks.EarlyStopping(\n",
    "            monitor=\"val_auc\",\n",
    "            mode=\"max\",\n",
    "            patience=5,\n",
    "            min_delta=1e-2,\n",
    "            restore_best_weights=True,\n",
    "        )\n",
    "        callbacks.append(early_stopping)\n",
    "    opt = tf.optimizers.Adam(lr)\n",
    "    if Reduced_LR:\n",
    "        # decay learning rate on plateau\n",
    "        reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(\n",
    "            monitor=\"val_loss\", factor=0.9, patience=5, min_lr=1e-5\n",
    "        )\n",
    "        # add a callback to print lr at the begining of every epoch\n",
    "        callbacks.append(reduce_lr)\n",
    "    model.compile(\n",
    "        opt,\n",
    "        loss=\"binary_crossentropy\",\n",
    "        metrics=[\n",
    "            tf.keras.metrics.AUC(from_logits=False),\n",
    "            tf.keras.metrics.BinaryAccuracy(threshold=0.5),\n",
    "        ],\n",
    "    )\n",
    "    history = model.fit(\n",
    "        train_data,\n",
    "        validation_data=val_data,\n",
    "        epochs=epochs,\n",
    "        batch_size=batch_size,\n",
    "        verbose=0,\n",
    "        callbacks=callbacks,\n",
    "    )\n",
    "    #     print(\n",
    "    #         f\"Train Loss: {history.history['loss'][-1]:.3f}, Test Loss: {history.history['val_loss'][-1]:.3f}\"\n",
    "    #     )\n",
    "    return history\n",
    "\n",
    "\n",
    "def plot_losses(history, test_data):\n",
    "    plt.figure(dpi=100)\n",
    "    plt.plot(history.history[\"loss\"], label=\"training\")\n",
    "    plt.plot(history.history[\"val_loss\"], label=\"validation\")\n",
    "    plt.legend()\n",
    "    plt.xlabel(\"Epoch\")\n",
    "    plt.ylabel(\"Loss\")\n",
    "    plt.show()\n",
    "    result = model.evaluate(test_data, verbose=0)\n",
    "    print(f\" Test AUC: {result[1]:.3f}  Test Accuracy: {result[2]:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So if you take a look at the `train_model` function defined above, we are using the validation data **exclusively** for the hyperparameter search. The test data is only used in `plot_losses` and gives us an estimation on model's generalization error."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Baseline Model\n",
    "\n",
    "We first start off with a baseline model with similar structure to the model used in {doc}`layers`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "history = train_model(model, Reduced_LR=False)\n",
    "plot_losses(history, test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So with the default hyperparameters, the **baseline** model above is clearly ovefitting to the training data. We first try adding l2 regularization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "model = build_model(reg=0.01, add_dropout=False)\n",
    "history = train_model(model, Reduced_LR=False, Early_stop=False)\n",
    "plot_losses(history, test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! Now We have a lower test loss and better AUC."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### Early Stopping\n",
    "\n",
    "We can use early stopping regularization and return best weights based on maximum obtained AUC value for the **validation data**. This is done by adding the early stopping callback, when compiling the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "early_stopping = tf.keras.callbacks.EarlyStopping(\n",
    "    monitor=\"val_auc\", mode=\"max\", patience=5, min_delta=1e-2, restore_best_weights=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how the model performs with early stopping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = build_model(reg=0.01, add_dropout=False)\n",
    "history = train_model(model, Reduced_LR=False, Early_stop=True)\n",
    "plot_losses(history, test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have almost the same performance but in **fewer** number of epochs. Note that for learning purposes, we have limited the number of epochs to $20$ in this example. Early stopping regularization becomes more relevant when we typically have a large number of epochs, as it halts the training and saves computational budget, unless there is gain in more training."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### Reduced Learning Rate on Plateau and Dropout\n",
    " \n",
    " Now let's try reducing the learning rate and dropout. Since the training epochs is already limited to $20$, we don't use early stopping here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = build_model(reg=0.01, add_dropout=True)\n",
    "history = train_model(model, Reduced_LR=True, Early_stop=False, epochs=20)\n",
    "plot_losses(history, test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discussion\n",
    "\n",
    "In this chapter, we explored means of reducing overfitting and enhancing feature selection in deep learning models. Techniques suggested can give you a good head start on tuning your model's hyperparameters. What is important is that, you need to experiment to find a good set of hyperparameters. This can be time-consuming, so use your own judgment and intuition to come up with a smart search strategy. Before you start hypertuning, make sure you obtain a baseline model and slowly add more pieces to the puzzle, based on training and validation loss, AUC or other metrics.\n",
    "\n",
    "There are also some toolkits for hyperparameter optimization that might be handy:\n",
    "\n",
    "- [Ray Tune for PyTorch](https://www.heatonresearch.com/2017/06/01/hidden-layers.html) {cite}`liaw2018tune`\n",
    "- [Keras-Tuner for Keras](https://keras.io/keras_tuner/) {cite}`omalley2019kerastuner`\n",
    "- [Optuna](https://optuna.org/) {cite}`optuna_2019`\n",
    "- [Hyperopt](https://github.com/hyperopt/hyperopt) {cite}`bergstra2013making`\n",
    "- [Scikit-Optimize](https://scikit-optimize.github.io/stable/auto_examples/hyperparameter-optimization.html) {cite}`louppe2017bayesian`\n",
    "- [Microsoft's Neural Network Intelligence](https://github.com/Microsoft/nni) \n",
    "- [Google's Vizer](https://cloud.google.com/ai-platform/optimizer/docs/overview) {cite}`golovin2017google`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cited References\n",
    "\n",
    "```{bibliography}\n",
    ":style: unsrtalpha\n",
    ":filter: docname in docnames\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3.10.7 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.7"
  },
  "vscode": {
   "interpreter": {
    "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}