Machine learning 101: How to build your first neural network

Author

Michael Parker

This is the second part of my series of machine learning blog posts (see part one here), looking at some of the foundational techniques used to build products like Cerella, Auto-Modeller, or Inspyra. In this post, I’m going to look at how we can build a simple neural network from scratch.

What are neural networks?

Neural networks (NNs) in various forms are very common nowadays, and specific architectures are used for text generation, image recognition, text to speech, and many other applications. We’re going to look at one of the simplest and earliest types of neural network, but one which is still very relevant and in common usage today: the multi-layer perceptron (MLP).

MLP architecture diagram. — Figure 1: MLP architecture. Inputs come from the left, pass through the input layer, the hidden layer and finally the output layer. The output layer returns an estimate of the property we’re modelling. This is what we’ll build in this blog, except with way more inputs.

MLPs use stacked layers of neurons, where the output of each neuron in any layer is used as input to all neurons in the next layer. Each neuron has a set of weights and a bias, and an activation function. The neuron multiplies all the inputs by their respective weights, adds the bias, and then passes the result through the activation function to generate an output. The process of training a neural network is essentially just gradually adjusting those weights and biases until the final output matches the correct answers.

There are various libraries and other tools you can use for building NNs (you can even do it in Excel, if you hate yourself), but the most popular at the moment is PyTorch (https://pytorch.org/), which I’ll use for this demo. As with the previous blog post, I’m going to skip over some boring stuff (setting up python, installing packages etc., there are plenty of other resources to learn these from), and I’m going to re-use the same dataset from last time (the Caco-2 dataset from Wang et al. 2016).

Step 1: Data setup

The first step is setting up the data, calculating descriptors and splitting it into train and test sets is the same as last time.

Code showing data setup for step 1 of building your own neural networks.

Step 2: Input data scaling

We have to do one extra step here: scaling the input data. Last time we trained a random forest, and one of the advantages of random forests and other tree-based methods is that they don’t care about the scale or distribution of the data. Neural networks, on the other hand, are much more sensitive to this, and we’ll get much better performance if our input data is scaled to a mean of zero and a standard deviation of 1. We can do this with SciKit Learn’s scalers.

Python code using SciKit Learn's scalers to scale input data for a neural network.

Step 3: Build your neural network

Now we’re on to the good stuff! It’s time to actually build our neural network. Because we’re making a simple linear network with no fancy architecture we can use PyTorch’s sequential class to define our model, letting us add the layers and activations in order.

The first layer (lin1) is our input layer. This has ten neurons, each of which gets all our descriptors as input. We then add an activation function, using the rectified linear unit (ReLU). The activation function is crucial, because it introduces a non-linearity to the network, allowing it to fit non-linear functions. Without this, our MLP would become a convoluted linear regressor. There are various options for activation functions, of which ReLU is one of the simplest and most common. Applied to the output x of each neuron in the preceding layer, it returns zero if x≤0, or x if x≥0. Here’s a picture of it. It’s quite boring.

A graph showing the activation function. — *Figure 2: Rectified linear activation unit. Quite boring.*

We repeat the same steps for our hidden layer, lin2. This takes the output from the input layer as input, and has the same activation function. The number of hidden layers is arbitrary, and is a hyperparameter that we can optimise for our specific problem. Generally the more complex the problem and the larger the amount of available data, the more neurons per layer and the more layers you’ll need. In this case, because a) this is a simple problem, with limited data, and b) I’m running this on a laptop and don’t want to wait all week, I’m using a small and simple network.

Finally, we have our output layer. It takes the hidden layer output as inputs, and in this case only has a single neuron. This is because we only want a single value (ideally the Caco-2 value of the corresponding compound) and we get one output per neuron. For more complex models predicting multiple properties you will need more neurons in the output layer. The other difference in our output layer is that I’m not applying an activation function – this is because I do not want to enforce limits on what numbers the model can predict. If I used ReLU, for example, the model would be unable to predict negative numbers. This is not necessarily always the case: if I wanted the model to return a probability, for example, I could use a sigmoid activation to ensure that it returns a value between 0 and 1.

Step 4: Training your neural networks

The next step is to train our neural network. The weights and biases are initialised randomly, so our fancy machine learning model is just a random number generator at this point. There are a couple more things we have to define before we can start training: a loss function, which gives the model output a score so that it knows how well it’s doing and how to adjust its predictions, and an optimiser, which chooses how much to adjust each of the network parameters during training.

You can train your neural network using Python coding.

I’m using a mean-squared-error (MSE) loss and the Adam optimizer (https://arxiv.org/pdf/1412.6980). These are common choices, and are likely to give good results with little interference.

To train the network, we can use a simple python for-loop. At each step (or epoch), we predict the y values, calculate the loss function by comparing them to the true values, propagate that loss backwards through the model to identify which parameters to adjust, and in which direction, call the optimizer to adjust the parameters, and then clear the optimizer again.

I’ve printed the train loss during training here, and you can see how the loss goes down during training (this is good). As with our random forest model, we need to evaluate the model on the test set to see how it’s likely to perform in the real world:

Interestingly, this is almost exactly the same as the R-squared of 0.7 that we got for the random forest model. Not exactly surprising, since they’re fitting the same data, but it highlights how competitive the models are. Despite being mechanically very different, they’ve arrived at very similar solutions. In reality, the boring task of obtaining, cleaning and curating the best possible training data is significantly more important to our final model performance than the much more interesting job of building and optimising fancy machine learning models.

This also demonstrates another important point: neural networks aren’t magic. A lot of the time, they’re no better than any other ML algorithm, and sometimes they’re worse. The key advantage of neural networks is scaling: we can add more neurons and more layers, and train them on ever larger datasets, and the performance keeps improving. This means that they can handle more complex problems than other models, such as image recognition or text generation, provided you can get hold of enough data.

On-demand

AI in early drug discovery: from promise to practice

In this webinar, Jeff Blaney (Senior Director of Discovery Chemistry, Genentech), Darren Green (Head of Cheminformatics & Data Science, GlaxoSmithKline), Julian Levell (Head of Discovery, New Equilibrium Biosciences), Matthew Segall (CEO, Optibrium) discuss the state of AI in early drug discovery from hit to preclinical candidate and share their experiences with and expectations of AI, including predictive modelling, synthesis prediction, and generative chemistry. Hear about the successes of AI drug discovery and an outlook on what AI needs to achieve to really transform the industry.

On-demand

Cerella™: Reduce the time and cost of your discovery cycles with deep learning

In this webinar, we demonstrate how Cerella™ (AI drug discovery software) highlights new opportunities and guides more efficient compound optimisation.

On-demand

Deep learning for peptide property prediction

Ready to transform to AI-guided drug discovery?

Simply complete the form and our team will be happy to chat about how Cerella could improve your discovery workflows.

Michael Parker, PhD

Michael is a Principal AI Scientist at Optibrium, applying advanced AI techniques to accelerate drug discovery and improve decision-making. With a Ph.D. in Astronomy and Astrophysics from the University of Cambridge, he brings a data-driven approach to solving complex scientific challenges. Michael is also a thought leader, contributing to discussions on the impact of AI in pharmaceutical research.

What are neural networks?

Step 1: Data setup

Step 2: Input data scaling

Step 3: Build your neural network

Step 4: Training your neural networks

How Optibrium uses neural networks

AI in early drug discovery: from promise to practice

Cerella™: Reduce the time and cost of your discovery cycles with deep learning

Deep learning for peptide property prediction

Ready to transform to AI-guided drug discovery?

About the author

Michael Parker, PhD

More AI in drug discovery

Machine Learning 101: How to train your first QSAR model

Data integration in AI-guided drug discovery

How much do AI drug discovery platforms cost?