{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Machine Learning\n",
"\n",
"There are probably a thousand articles called *introduction to machine learning*. Rather than rewrite this, I will instead introduce the main ideas focused on a chemistry example. Here are some introductory sources, and please do recommend new ones to me:\n",
"\n",
"1. The book I first read in grad school about machine learning by Ethem Alpaydin{cite}`alpaydin2020introduction`\n",
"1. Nils Nillson's online book [Introductory Machine Learning](https://ai.stanford.edu/~nilsson/mlbook.html)\n",
"2. Two reviews of machine learning in materials{cite}`fung2021benchmarking,balachandran2019machine`\n",
"3. A review of machine learning in computational chemistry{cite}`gomez2020machine`\n",
"4. A review of machine learning in metals{cite}`nandy2018strategies`\n",
"\n",
"I hope you learn from these sources about how machine learning is a method of modeling data, typically with predictive functions. Machine learning includes many techniques, but here we will focus on only those necessary to transition into deep learning. For example, random forests, support vector machines, and nearest neighbor are widely-used machine learning techniques that are effective but not covered here.\n",
"\n",
"```{admonition} Audience & Objectives\n",
"This chapter is intended for novices of machine learning with familiarity of chemistry and python. It is recommended that you look over one of the above recommended introductory articles. This specific article assumes a very small amount of knowledge of the `pandas` library (loading and selecting a column), awareness of `rdkit` (how we draw molecules), and that we store/retrieve molecules as [SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) {cite}`weininger1988smiles`. After reading this, you should be able to:\n",
"\n",
" * Define features, labels\n",
" * Distinguish between supervised and unsupervised learning\n",
" * Understand what a loss function is and how it can be minimized with gradient descent\n",
" * Understand what model is and its connection to features and labels\n",
" * Be able to cluster data and describe what it tells us about data\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Ingredients \n",
"\n",
"Machine learning is about constructing models by fitting them to data. Firstly, definitions:\n",
"\n",
"**Features** \n",
"\n",
" set of $N$ vectors $\\{\\vec{x}_i\\}$ of dimension $D$. Can be reals, integers, etc.\n",
"\n",
"**Labels** \n",
"\n",
" set of $N$ integers or reals $\\{y_i\\}$. $y_i$ is usually a scalar\n",
" \n",
"**Labeled Data** \n",
"\n",
" set of $N$ tuples $\\{\\left(\\vec{x}_i, y_i\\right)\\}$ \n",
"\n",
"**Unlabeled Data** \n",
"\n",
" set of $N$ features $\\{\\vec{x}_i\\}$ that may have unknown $y$ labels\n",
"\n",
"**Model**\n",
"\n",
" A function $f(\\vec{x})$ that takes a given feature vector in and returns a predicted $\\hat{y}$\n",
"\n",
"**Predictions**\n",
"\n",
" $\\hat{y}$, our predicted output for a given input $\\vec{x}$.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Supervised Learning\n",
"\n",
"Our first task will be **supervised learning**. Supervised learning means predicting $y$ from $\\vec{x}$ with a model trained on data. It is *supervised* because we tell the algorithm what the labels are in our dataset. Another method we'll explore is **unsupervised learning** where we do not tell the algorithm the labels. We'll see this supervised/unsupervised distinction can be more subtle later on, but this is a great definition for now. \n",
"\n",
"To see an example, we will use a dataset called AqSolDB{cite}`Sorkun2019` that is about 10,000 unique compounds with measured solubility in water (label). The dataset also includes molecular properties (features) that we can use for machine learning. The solubility measurement is solubility of the compound in water in units of log molarity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running This Notebook\n",
"\n",
"\n",
"Click the above to launch this page as an interactive Google Colab. See details below on installing packages.\n",
"\n",
"````{tip} My title\n",
":class: dropdown\n",
"To install packages, execute this code in a new cell. \n",
"\n",
"```\n",
"!pip install dmol-book\n",
"```\n",
"\n",
"If you find install problems, you can get the latest working versions of packages used in [this book here](https://github.com/whitead/dmol-book/blob/master/package/requirements.txt)\n",
"\n",
"````"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Data\n",
"\n",
"Download the data and load it into a [Pandas](https://pandas.pydata.org/) data frame. The hidden cells below sets-up our imports and/or install necessary packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"hide-cell"
]
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import numpy as np\n",
"import jax.numpy as jnp\n",
"import jax\n",
"from jax.example_libraries import optimizers\n",
"import sklearn.manifold, sklearn.cluster\n",
"import rdkit, rdkit.Chem, rdkit.Chem.Draw\n",
"import dmol"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# soldata = pd.read_csv('https://dataverse.harvard.edu/api/access/datafile/3407241?format=original&gbrecs=true')\n",
"# had to rehost because dataverse isn't reliable\n",
"soldata = pd.read_csv(\n",
" \"https://github.com/whitead/dmol-book/raw/master/data/curated-solubility-dataset.csv\"\n",
")\n",
"soldata.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Exploration\n",
"\n",
"```{margin} EDA\n",
"If doing EDA as a way to choose features, you should do the train/test/(valid) split prior to EDA to avoid\n",
"contaminating model selection with test data.\n",
"```\n",
"\n",
"We can see that there are a number of features like molecular weight, rotatable bonds, valence electrons, etc. And of course, there is the label **solubility**. One of the first things we should always do is get familiar with our data in a process that is sometimes called **exploratory data analysis** (EDA). Let's start by examining a few specific examples to get a sense of the range of labels/data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot one molecule\n",
"mol = rdkit.Chem.MolFromInchi(soldata.InChI[0])\n",
"mol"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is first molecule in the dataset rendered using [rdkit](https://rdkit.org/).\n",
"\n",
"Let's now look at the extreme values to get a sense of the **range** of solubility data and the molecules that make it. First, we'll histogram (using {obj}`seaborn.distplot`) the solubility which tells us about the shape of its probability distribution and the extreme values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.distplot(soldata.Solubility)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above we can see the histogram of the solubility with kernel density estimate overlaid. The histogram shows that the solubility varies from about -13 to 2.5 and is not normally distributed. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get 3 lowest and 3 highest solubilities\n",
"soldata_sorted = soldata.sort_values(\"Solubility\")\n",
"extremes = pd.concat([soldata_sorted[:3], soldata_sorted[-3:]])\n",
"\n",
"# We need to have a list of strings for legends\n",
"legend_text = [\n",
" f\"{x.ID}: solubility = {x.Solubility:.2f}\" for x in extremes.itertuples()\n",
"]\n",
"\n",
"# now plot them on a grid\n",
"extreme_mols = [rdkit.Chem.MolFromInchi(inchi) for inchi in extremes.InChI]\n",
"rdkit.Chem.Draw.MolsToGridImage(\n",
" extreme_mols, molsPerRow=3, subImgSize=(250, 250), legends=legend_text\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The figure of extreme molecules shows highly-chlorinated compounds have the lowest solubility and ionic compounds have higher solubility. Is A-2918 an **outlier**, a mistake? Also, is NH$_3$ really comparable to these organic compounds? These are the kind of questions that you should consider *before* doing any modeling.\n",
"\n",
"```{margin} Outliers\n",
"\n",
"Outliers are extreme values that fall outside of your normal data distribution. They can be mistakes or be from a different distribution (e.g., metals instead of organic molecules). Outliers can have a strong effect on model training.\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature Correlation\n",
"Now let's examine the features and see how correlated they are with solubility. Note that there are a few columns unrelated to features or solubility: `SD` (standard deviation), `Ocurrences` (how often the molecule occurred in the constituent databases), and `Group` (where the data came from)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features_start_at = list(soldata.columns).index(\"MolWt\")\n",
"feature_names = soldata.columns[features_start_at:]\n",
"\n",
"fig, axs = plt.subplots(nrows=5, ncols=4, sharey=True, figsize=(12, 8), dpi=300)\n",
"axs = axs.flatten() # so we don't have to slice by row and column\n",
"for i, n in enumerate(feature_names):\n",
" ax = axs[i]\n",
" ax.scatter(\n",
" soldata[n], soldata.Solubility, s=6, alpha=0.4, color=f\"C{i}\"\n",
" ) # add some color\n",
" if i % 4 == 0:\n",
" ax.set_ylabel(\"Solubility\")\n",
" ax.set_xlabel(n)\n",
"# hide empty subplots\n",
"for i in range(len(feature_names), len(axs)):\n",
" fig.delaxes(axs[i])\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's interesting that molecular weight or hydrogen bond numbers seem to have little correlation, at least from this plot. MolLogP, which is a calculated descriptor related to solubility, does correlate well. You can also see that some of these features have low **variance**, meaning the value of the feature changes little or not at all for many data points (e.g., \"NumHDonors\")."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Linear Model\n",
"\n",
"Let's begin with one of the simplest approaches — a linear model. This is our first type of supervised learning and is rarely used due to something we'll see — the difficult choice of features. \n",
"\n",
"\n",
"```{margin} Autodiff\n",
"[Autodiff](https://en.wikipedia.org/wiki/Automatic_differentiation) is a computer program tool\n",
"that can compute analytical gradients with respect to two variables in a program. \n",
"```\n",
"\n",
"Our model will be defined by this equation:\n",
"\n",
"\\begin{equation}\n",
" y = \\vec{w} \\cdot \\vec{x} + b\n",
"\\end{equation}\n",
"\n",
"which is defined for a single data point. The shape of a single feature vector, $\\vec{x}$, is 17 in our case (for the 17 features). $\\vec{w}$ is a vector of adjustable parameters of length 17 and $b$ is an adjustable scalar (called **bias**).\n",
"\n",
"We'll implement this model using a library called [``jax``](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html) that is very similar to numpy except it can compute analytical gradients easily via autodiff.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def linear_model(x, w, b):\n",
" return jnp.dot(x, w) + b\n",
"\n",
"\n",
"# test it out\n",
"x = np.array([1, 0, 2.5])\n",
"w = np.array([0.2, -0.5, 0.4])\n",
"b = 4.3\n",
"\n",
"linear_model(x, w, b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{margin} Loss\n",
"A loss is a function which takes in a model prediction $\\hat{y}$,\n",
"labels $y$, and computes a scalar representing how poor the fit is.\n",
"Our goal is to minimize loss.\n",
"```\n",
"\n",
"Now comes the critical question: *How do we find the adjustable parameters $\\vec{w}$ and $b$*? The classic solution for linear regression is computing the adjustable parameters directly with a pseudo-inverse, $\\vec{w} = (X^TX)^{-1}X^{T}\\vec{y}$. You can read more about [this here](https://nbviewer.jupyter.org/github/whitead/numerical_stats/blob/master/unit_12/lectures/lecture_1.ipynb#Extending-Least-Squares-to-Multiple-Dimensions-in-Domain---OLS-ND). We'll use an **iterative** approach that mirrors what we'll do in deep learning. This is not the correct approach for linear regression, but it'll be useful for us to get used to the iterative approach since we'll see it so often in deep learning. \n",
"\n",
"To iteratively find our adjustable parameters, we will pick a **loss** function and minimize with **gradients**. Let's define these quantities and compute our loss with some initial random w and b."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# convert data into features, labels\n",
"features = soldata.loc[:, feature_names].values\n",
"labels = soldata.Solubility.values\n",
"\n",
"feature_dim = features.shape[1]\n",
"\n",
"# initialize our paramaters\n",
"w = np.random.normal(size=feature_dim)\n",
"b = 0.0\n",
"\n",
"# define loss\n",
"def loss(y, labels):\n",
" return jnp.mean((y - labels) ** 2)\n",
"\n",
"\n",
"# test it out\n",
"y = linear_model(features, w, b)\n",
"loss(y, labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow! Our loss is terrible, especially considering that solubilities are between -13 and 2. But, that's right since we just guessed our initial parameters. \n",
"\n",
"\n",
"\n",
"### Gradient Descent\n",
"\n",
"We will now try to reduce loss by using information about how it changes with respect to the adjustable parameters. If we write our loss as:\n",
"\n",
"\\begin{equation}\n",
" L = \\frac{1}{N}\\sum_i^N \\left[y_i - f(\\vec{x}_i, \\vec{w}, b)\\right]^2\n",
"\\end{equation}\n",
"\n",
"This loss is called **mean squared error**, often abbreviated MSE. We can compute our loss gradients with respect to the adjustable parameters:\n",
"\n",
"```{margin} jax.grad\n",
"[jax.grad](https://jax.readthedocs.io/en/latest/jax.html#jax.grad) computes an analytical derivative of a Python function. \n",
"It takes two arguments: the function and which args to \n",
"take the derivative of. For example, consider `f(x, y, z)`, then `jax.grad(f,(1,2))`\n",
"gives $\\frac{\\partial f}{\\partial x}, \\frac{\\partial f}{\\partial y}$. Note too that\n",
"$x$ may be a tensor. \n",
"```\n",
"\n",
"\\begin{equation}\n",
" \\frac{\\partial L}{\\partial w_i}, \\frac{\\partial L}{\\partial b}\n",
"\\end{equation}\n",
"\n",
"where $w_i$ is the $i$th element of the weight vector $\\vec{w}$. We can reduce the loss by taking a step in the direction of its negative gradient:\n",
"\\begin{equation}\n",
" (w_i, b') = \\left(w_i - \\eta \\frac{\\partial L}{\\partial w_i}, b - \\eta\\frac{\\partial L}{\\partial b}\\right)\n",
"\\end{equation}\n",
"\n",
"where $\\eta$ is **learning rate**, which an adjustable but not trained parameter (an example of a **hyperparameter**) which we just guess to be $1\\times10^{-6}$ in this example. Typically, it's chosen to be some power of 10 that is at most 0.1. Values higher than that cause stability problems. Let's try this procedure, which is called **gradient descent**.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# compute gradients\n",
"def loss_wrapper(w, b, data):\n",
" features = data[0]\n",
" labels = data[1]\n",
" y = linear_model(features, w, b)\n",
" return loss(y, labels)\n",
"\n",
"\n",
"loss_grad = jax.grad(loss_wrapper, (0, 1))\n",
"\n",
"# test it out\n",
"loss_grad(w, b, (features, labels))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've computed the gradient. Now we'll minimize it over a few steps."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loss_progress = []\n",
"eta = 1e-6\n",
"data = (features, labels)\n",
"for i in range(10):\n",
" grad = loss_grad(w, b, data)\n",
" w -= eta * grad[0]\n",
" b -= eta * grad[1]\n",
" loss_progress.append(loss_wrapper(w, b, data))\n",
"plt.plot(loss_progress)\n",
"\n",
"plt.xlabel(\"Step\")\n",
"plt.yscale(\"log\")\n",
"plt.ylabel(\"Loss\")\n",
"plt.title(\"Full Dataset Training Curve\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training Curve\n",
"\n",
"The figure above is called a **training curve**. We'll see these frequently in this book and they show us if the loss is decreasing, indicating the model is learning. Training curves are also called **learning curves**. The x-axis may be example number, total iterations through dataset (called epochs), or some other measure of amount of data used for training the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Batching\n",
"\n",
"```{margin} batch\n",
"A batch is a subset of your data of size *batch size*. Batch size is usually as a power of 2 (e.g., 16, 128).\n",
"Having random batches of data is how gradient descent becomes stochastic gradient descent.\n",
"```\n",
"\n",
"This is making good progress. But let's try to speed things up with a small change. We'll use **batching**, which is how training is actually done in machine learning. The small change is that rather than using all data at once, we only take a small **batch** of data. Batching provides two benefits: it reduces the amount of time to compute an update to our parameters, and it makes the training process random. The randomness makes it possible to escape local minima that might stop training progress. This addition of batching makes our algorithm **stochastic** and thus we call this procedure **stochastic gradient descent** (SGD). SGD, and variations of it, are the most common methods of training in deep learning.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# initialize our paramaters\n",
"# to be fair to previous method\n",
"w = np.random.normal(size=feature_dim)\n",
"b = 0.0\n",
"\n",
"loss_progress = []\n",
"eta = 1e-6\n",
"batch_size = 32\n",
"N = len(labels) # number of data points\n",
"data = (features, labels)\n",
"# compute how much data fits nicely into a batch\n",
"# and drop extra data\n",
"new_N = len(labels) // batch_size * batch_size\n",
"\n",
"# the -1 means that numpy will compute\n",
"# what that dimension should be\n",
"batched_features = features[:new_N].reshape((-1, batch_size, feature_dim))\n",
"batched_labels = labels[:new_N].reshape((-1, batch_size))\n",
"# to make it random, we'll iterate over the batches randomly\n",
"indices = np.arange(new_N // batch_size)\n",
"np.random.shuffle(indices)\n",
"for i in indices:\n",
" # choose a random set of\n",
" # indices to slice our data\n",
" grad = loss_grad(w, b, (batched_features[i], batched_labels[i]))\n",
" w -= eta * grad[0]\n",
" b -= eta * grad[1]\n",
" # we still compute loss on whole dataset, but not every step\n",
" if i % 10 == 0:\n",
" loss_progress.append(loss_wrapper(w, b, data))\n",
"\n",
"plt.plot(np.arange(len(loss_progress)) * 10, loss_progress)\n",
"plt.xlabel(\"Step\")\n",
"plt.yscale(\"log\")\n",
"plt.ylabel(\"Loss\")\n",
"plt.title(\"Batched Loss Curve\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are three changes to note:\n",
"\n",
"1. The loss is lower than without batching\n",
"2. There are more steps, even though we iterated over our dataset once instead of 10 times\n",
"3. The loss doesn't always go down\n",
"\n",
"The reason the loss is lower is because we're able to take more steps even though we only see each data point once. That's because we update at each batch, giving more updates per iteration over the dataset. Specifically if $B$ is batch size, there are $N / B$ updates for every 1 update in the original gradient descent. The reason the loss doesn't always go down is that each time we evaluate it, it's on a different set of data. Some molecules are harder to predict than others. Also, each step we take in minimizing loss may not be correct because we only updated our parameters based on one batch. Assuming our batches are mixed though, we will always improve in expectation (on average). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Standardize features\n",
"\n",
"It seems we cannot get past a certain loss. If you examine the gradients you'll see some of them are very large and some are very small. Each of the features have different magnitudes. For example, molecular weights are large numbers. The number of rings in a molecule is a small number. Each of these must use the same learning rate, $\\eta$, and that is ok for some but too small for others. If we increase $\\eta$, our training procedure will explode because of these larger feature gradients. A standard trick we can do is make all the features have the same magnitude, using the equation for standardization you might see in your statistics textbook:\n",
"\n",
"\\begin{equation}\n",
" x_{ij} = \\frac{x_{ij} - \\bar{x_j}}{\\sigma_{x_j}}\n",
"\\end{equation}\n",
"\n",
"where $\\bar{x_j}$ is column mean and $\\sigma_{x_j}$ is column standard deviation. To be careful about contaminating training data with test data -- leaking information between train and test data -- we should only use training data in computing the mean and standard deviation. We want our test data to approximate how we'll use our model on unseen data, so we cannot know what these unseen features means/standard deviations might be and thus cannot use them at training time for standardization. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fstd = np.std(features, axis=0)\n",
"fmean = np.mean(features, axis=0)\n",
"std_features = (features - fmean) / fstd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# initialize our paramaters\n",
"# since we're changing the features\n",
"w = np.random.normal(scale=0.1, size=feature_dim)\n",
"b = 0.0\n",
"\n",
"\n",
"loss_progress = []\n",
"eta = 1e-2\n",
"batch_size = 32\n",
"N = len(labels) # number of data points\n",
"data = (std_features, labels)\n",
"# compute how much data fits nicely into a batch\n",
"# and drop extra data\n",
"new_N = len(labels) // batch_size * batch_size\n",
"num_epochs = 3\n",
"\n",
"# the -1 means that numpy will compute\n",
"# what that dimension should be\n",
"batched_features = std_features[:new_N].reshape((-1, batch_size, feature_dim))\n",
"batched_labels = labels[:new_N].reshape((-1, batch_size))\n",
"indices = np.arange(new_N // batch_size)\n",
"\n",
"# iterate through the dataset 3 times\n",
"for epoch in range(num_epochs):\n",
" # to make it random, we'll iterate over the batches randomly\n",
" np.random.shuffle(indices)\n",
" for i in indices:\n",
" # choose a random set of\n",
" # indices to slice our data\n",
" grad = loss_grad(w, b, (batched_features[i], batched_labels[i]))\n",
" w -= eta * grad[0]\n",
" b -= eta * grad[1]\n",
" # we still compute loss on whole dataset, but not every step\n",
" if i % 50 == 0:\n",
" loss_progress.append(loss_wrapper(w, b, data))\n",
"\n",
"plt.plot(np.arange(len(loss_progress)) * 50, loss_progress)\n",
"plt.xlabel(\"Step\")\n",
"plt.yscale(\"log\")\n",
"plt.ylabel(\"Loss\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice we safely increased our learning rate to 0.01, which is possible because all the features are of similar magnitude. We also could keep training, since we're gaining improvements. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analyzing Model Performance\n",
"\n",
"This is a large topic that we'll explore more, but the first thing we typically examine in supervised learning is a **parity plot**, which shows our predictions vs. our label prediction. What's nice about this plot is that it works no matter what the dimensions of the features are. A perfect fit would fall onto the line at $y = \\hat{y}$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predicted_labels = linear_model(std_features, w, b)\n",
"\n",
"plt.plot([-100, 100], [-100, 100])\n",
"plt.scatter(labels, predicted_labels, s=4, alpha=0.7)\n",
"plt.xlabel(\"Measured Solubility $y$\")\n",
"plt.ylabel(\"Predicted Solubility $\\hat{y}$\")\n",
"plt.xlim(-13.5, 2)\n",
"plt.ylim(-13.5, 2)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Final model assessment can be done with loss, but typically other metrics are also used. In regression, a **correlation coefficient** is typically reported in addition to loss. In our example, this is computed as"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# slice correlation between predict/labels\n",
"# from correlation matrix\n",
"np.corrcoef(labels, predicted_labels)[0, 1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"# THIS CELL IS USED TO GENERATE A FIGURE\n",
"# AND NOT RELATED TO CHAPTER\n",
"# YOU CAN SKIP IT\n",
"from myst_nb import glue\n",
"\n",
"glue(\"corr\", np.round(np.corrcoef(labels, predicted_labels)[0, 1], 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A correlation coefficient of {glue:}`corr` is OK, but not great."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Unsupervised Learning\n",
"\n",
"In unsupervised learning, the goal is to predict $\\hat{y}$ *without* labels. This seems like an impossible task. How do we judge success? Typically, unsupervised learning can be broken into three categories:\n",
"\n",
"**Clustering**\n",
"\n",
" Here we assume $\\{y_i\\}$ is a class variable and try to partition our features into these classes. In clustering we are simultaneously learning the definition of the classes (called clusters) and which cluster each feature should be assigned to.\n",
"\n",
"```{margin} Class\n",
"In machine learning, a class is a type of label like ``dog`` or ``cat``. Formally,\n",
"we have a set of possible labels (e.g., all animals) and each feature vector has one (hard) or a \n",
"probability distribution of classes (soft).\n",
"```\n",
"\n",
"**Finding Signal**\n",
"\n",
" $x$ is assumed to be made of two components: noise and signal ($y$). We try to separate the signal $y$ from $x$ and discard noise. Highly-related with **representation learning**, which we'll see later.\n",
"\n",
"\n",
"**Generative**\n",
"\n",
" Generative methods are methods where we try to learn $P(\\vec{x})$ so that we can sample new values of $\\vec{x}$. It is analogous to $y$ being probability and we're trying to estimate it. We'll see these more later.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Clustering\n",
"\n",
"Clustering is historically one of the most well-known and still popular machine learning methods. It's always popular because it can provide new insight from data. Clustering gives class labels where none existed and thus can help find patterns in data. This is also a reason that it has become less popular in chemistry (and most fields): there is no right or wrong answer. Two people doing clustering independently will often arrive at different answers. Nevertheless, it should be a tool you know and can be a good exploration strategy.\n",
"\n",
"```{margin} cluster labels\n",
"Clustering comes in many variants and some blur what exactly $y_i$ is. For example, in some clustering methods $y_i$ can include no assignment or $y_i$ is not a single class, but a tree of classes.\n",
"```\n",
"\n",
"We'll look at the classic clustering method: k-means. Wikipedia has a [great article](https://en.wikipedia.org/wiki/K-means_clustering) on this classic algorithm, so I won't try to repeat that. To make our clustering actually visible, we'll start by projecting our features into 2 dimensions. This will be covered in representation learning, so don't worry about these steps."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get down to 2 dimensions for easy visuals\n",
"embedding = sklearn.manifold.Isomap(n_components=2)\n",
"# only fit to every 25th point to make it fast\n",
"embedding.fit(std_features[::25, :])\n",
"reduced_features = embedding.transform(std_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to zoom into the middle 99th percentile of the data since some of the points are extremely far away (though that is interesting!). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"xlow, xhi = np.quantile(reduced_features, [0.005, 0.995], axis=0)\n",
"\n",
"plt.figure(dpi=300)\n",
"plt.scatter(\n",
" reduced_features[:, 0],\n",
" reduced_features[:, 1],\n",
" s=4,\n",
" alpha=0.7,\n",
" c=labels,\n",
" edgecolors=\"none\",\n",
")\n",
"plt.xlim(xlow[0], xhi[0])\n",
"plt.ylim(xlow[1], xhi[1])\n",
"cb = plt.colorbar()\n",
"cb.set_label(\"Solubility\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{margin} Dimensionality Reduction\n",
"Reducing $\\vec{x}$, your feature vectors to a low\n",
"dimensional space. The classic example is PCA, which is a \n",
"linear operator. However, most prefer nonlinear methods \n",
"like [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html).\n",
"```\n",
"\n",
"\n",
"\n",
"The dimensionality reduction has made our features only 2 dimensions. We can see some structure, especially with the solubility as the coloring. Note in these kind of plots, where we have reduced dimensions in someway, we do not label the axes because they are arbitrary.\n",
"\n",
"Now we cluster. The main challenge in clustering is deciding how many clusters there should be. There are a number of methods out there, but they basically come down to intuition. You, as the chemist, should use some knowledge outside of the data to intuit what is the cluster number. Sounds unscientific? Yeah, that's why clustering is hard."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remove-output"
]
},
"outputs": [],
"source": [
"# cluster - using whole features\n",
"kmeans = sklearn.cluster.KMeans(n_clusters=4, random_state=0)\n",
"kmeans.fit(std_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Very simple procedure! Now we'll visualize by coloring our data by the class assigned. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(dpi=300)\n",
"point_colors = [f\"C{i}\" for i in kmeans.labels_]\n",
"plt.scatter(\n",
" reduced_features[:, 0],\n",
" reduced_features[:, 1],\n",
" s=4,\n",
" alpha=0.7,\n",
" c=point_colors,\n",
" edgecolors=\"none\",\n",
")\n",
"# make legend\n",
"legend_elements = [\n",
" plt.matplotlib.patches.Patch(\n",
" facecolor=f\"C{i}\", edgecolor=\"none\", label=f\"Class {i}\"\n",
" )\n",
" for i in range(4)\n",
"]\n",
"plt.legend(handles=legend_elements)\n",
"plt.xlim(xlow[0], xhi[0])\n",
"plt.ylim(xlow[1], xhi[1])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Choosing Cluster Number\n",
"\n",
"How do we know we had the correct number? Intuition. There is one tool we can use to help us, called an **elbow plot**. The k-means clusters can be used to compute the mean squared distance from cluster center, basically a version of loss function. However, if we treat cluster number as a trainable parameter we'd find the best fit at the cluster number being equal to number of data points. Not helpful! However, we can see when the slope of this loss becomes approximately constant and assume that those extra clusters are adding no new insight. Let's plot the loss and see what happens. Note we'll be using a subsample of the dataset to save time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# make an elbow plot\n",
"loss = []\n",
"cn = range(2, 15)\n",
"for i in cn:\n",
" kmeans = sklearn.cluster.KMeans(n_clusters=i, random_state=0)\n",
" # use every 50th point\n",
" kmeans.fit(std_features[::50])\n",
" # we get score -> opposite of loss\n",
" # so take -\n",
" loss.append(-kmeans.score(std_features[::50]))\n",
"\n",
"plt.plot(cn, loss, \"o-\")\n",
"plt.xlabel(\"Cluster Number\")\n",
"plt.ylabel(\"Loss\")\n",
"plt.title(\"Elbow Plot\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where is the transition? If I squint, maybe at 6? 3? 4? 7? Let's choose 4 because it sounds nice and is plausible based on the data. The last task is to get some insight into what the clusters actually are. We can extract the most centered data points (closest to cluster center) and consider them to be representative of the cluster. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"alt": "Grid of rendered molecular structures that are representative cluster centers"
},
"outputs": [],
"source": [
"# cluster - using whole features\n",
"kmeans = sklearn.cluster.KMeans(n_clusters=4, random_state=0)\n",
"kmeans.fit(std_features)\n",
"\n",
"cluster_center_idx = []\n",
"for c in kmeans.cluster_centers_:\n",
" # find point closest\n",
" i = np.argmin(np.sum((std_features - c) ** 2, axis=1))\n",
" cluster_center_idx.append(i)\n",
"cluster_centers = soldata.iloc[cluster_center_idx, :]\n",
"\n",
"legend_text = [f\"Class {i}\" for i in range(4)]\n",
"\n",
"# now plot them on a grid\n",
"cluster_mols = [rdkit.Chem.MolFromInchi(inchi) for inchi in cluster_centers.InChI]\n",
"rdkit.Chem.Draw.MolsToGridImage(\n",
" cluster_mols, molsPerRow=2, subImgSize=(400, 400), legends=legend_text\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So what exactly are these classes? Unclear. We intentionally did not reveal solubility (unsupervised learning) so there is not necessarily any connection with solubility. These classes are more a result of which features were chosen for the dataset. You could make a hypothesis, like class 1 is all negatively charged or class 0 is aliphatic, and investigate. Ultimately though there is no *best* clustering and often unsupervised learning is more about finding insight or patterns and not about producing a highly-accurate model.\n",
"\n",
"The elbow plot method is one of many approaches to selecting cluster number {cite}`pham2005selection`. I prefer it because it's quite clear that you are using intuition. More sophisticated methods sort-of conceal the fact that there is no right or wrong answer in clustering. \n",
"\n",
"\n",
"\n",
"```{note}\n",
"This process does not result in a function that predicts solubility. We might try to gain insight about predicting solubility with our predicted classes, but that is not the goal of clustering.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chapter Summary\n",
"\n",
"* Supervised machine learning is building models that can predict labels $y$ from input features $\\vec{x}$.\n",
"* Data can be labeled or unlabeled. \n",
"* Models are trained by minimizing loss with stochastic gradient descent.\n",
"* Unsupervised learning is building models that can find patterns in data.\n",
"* Clustering is unsupervised learning where the model groups the data points into clusters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises\n",
"\n",
"### Data\n",
"\n",
"1. Using `numpy` reductions `np.amin`, `np.std`, etc. (not `pandas`!), compute the mean, min, max, and standard deviation for each feature across all data points. \n",
"\n",
"2. Use rdkit to draw the 2 highest molecular weight molecules. Note they look strange. \n",
"\n",
"### Linear Models\n",
"\n",
"1. Prove that a nonlinear model like $y = \\vec{w_1} \\cdot \\sin\\left(\\vec{x}\\right) + \\vec{w_2} \\cdot \\vec{x} + b$ could be represented as a linear model.\n",
"\n",
"2. Write out the linear model equation in Einstein notation in batched form. **Batched form** means we explicitly have an index indicating batch. For example, the labels will be $y_{bi}$ where $b$ indicates the batch of data and $i$ indicates the individual data point.\n",
"\n",
"\n",
"### Minimizing Loss\n",
"\n",
"1. We standardized the features, but not the labels. Would standardizing the labels affect our choice of learning rate? Prove your answer.\n",
"\n",
"2. Implement a loss that is mean absolute error, instead of mean squared error. Compute its gradient using `jax`.\n",
"\n",
"2. Using the standardized features, show what effect batch size has on training. Use batch sizes of 1, 8, 32, 256, 1024. Make sure you re-initialize your weights in between each run. Plot the log-loss for each batch size on the same plot. Describe your results.\n",
"\n",
"### Clustering\n",
"\n",
"1. We say that clustering is a type of unsupervised learning and that it predicts the labels. What exactly are the predicted labels in clustering? Write down what the predicted labels might look like for a few data points.\n",
"\n",
"2. In clustering, we predict labels from features. You can still cluster if you have labels, by just pretending they are features. Give two reasons why it would not be a good idea to do clustering in this manner, where we treat the labels as features and try to predict new labels that represent class.\n",
"\n",
"3. On the isomap plot (reduced dimension plot), color the points by which group they fall in (G1, G2, etc.). Is there any relationship between this and the clustering?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cited References\n",
"\n",
"```{bibliography}\n",
":style: unsrtalpha\n",
":filter: docname in docnames\n",
"```"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}