The optimization of machine learning algorithms is one of the most important research. Here we present a breif introduction of Bayesian Optimization (BO), a novel search algorithm that can minimize black-box functions.

Introduction

The goal of an optimization problem is to find the best solution among all potential solutions. In the field of ML, the optimization problem is associated with the search for the values of the parameters of a model that better described the problem, e.g. minimizing a loss function. Synthetic chemistry also has optimization problems, for example varying the reaction conditions to increase percent yield.

The optimization of a function is also a supervised learning problem, but instead of finding the best global representation of function formula, the goal is to find the formula where formula is minimum,

Figure

The most common optimization algorithm for continuous functions is gradient descent (GD). GD algorithm is designed to minimize a function iteratively by displacing the current point in the direction of the negative gradient of the function,

Figure

where the parameter formula is known as the learning rate. formula is also related in the trade-off between exploitation and exploration and plays a key role in the convergence of the algorithm.
For example, when formula is small GD is exploiting Figure; where as Figure is related to exploration. GD is one of the first optimization algorithms used to train NNs, backpropagation algorithm.

GD has been a widely successful optimization algorithm. However, not every function can be optimized using GD. For example, there is no analytic function that describes the relation between the percent yield given some experimental conditions for a chemical reaction, therefore one can not use GD to increase the percent yield. There are many other problems that are described by non-analytic functions or black-box functions, where evaluations are point-wise. BO is designed to tackle the optimization of black-box functions where gradients are not available.
For obvious reasons, trying to find the minimum of formula by randomly sampling is not the smartest strategy, since it may take a large number of evaluations from formula before finding the minimum. BO tries to infer the location of the minimum of a black-box function by proposing a smarter iterative sampling scheme. In the case of GD we assume that the gradient gives us the information of where to sample the next point in order to get closer to the minimum. Considering that black-box functions do not have a gradient, it is necessary to propose a metric that quantifies the informational gain as a function of the space.

The core of BO relays in two components,

  1. Figure –> model that mimics the black-box function.
  2. formula –> acquisition function that quantifies the information gain for a given point.

To mimic the unknown function formula we can use any supervised learning algorithm, like NNs. However, if Figure is not capable to learn at every iteration, we may waist some of the evaluations because of the lack robustness of the model. GP models are a great candidate for Figure due to the accuracy and robustness to interpolate any continuous function. Additionally, the ability of GP models to quantify the prediction’s uncertainty formula without the need of extra data is what makes them the strongest candidate for BO. In a previous post we introduced Gaussian Processes, a probabilist regression model capable of learning complex functions.

The following figure illustrates how BO works to find the minimum of formula without using gradients. The maximum of the acquisition function is the query point where the black-box function is evaluated next, formula, and at each iteration we add the new point formula to the training data and retrain the GP model.

Figure

Pseudocode of BO,

Figure

Acquisition function

In this section we explain different acquisition functions that are used in BO. BO is an optimization algorithm designed for problems where gradients are not available. As it was mention above, the acquisition function is designed to repre- sent which point in the space has the most information. By iteratively evaluating the black-box function where the acquisition function is maximum we learn a more certain representation of formula where the minimum could be. There are many different acquisition functions, here we cover the three most used,

  1. Probability of improvement (PI)
  2. Expected Improvement (EI)
  3. Upper confidence bound (UCB)

Probability of improvement (PI)

In 1964 H. Kushner proposed as an acquisition function to maximize the probability of improvement, the probability when Figure. H. Kushner showed that if formula is Gaussian distributed, Figure can be written as,

Figure

where formula is the normal cumulative distribution function, formula and formula are the predicted mean and standard deviation of a GP model trained with the data set Figure, and formula is the target value. Since the goal of BO is to find formula we can approximate it with the best known value in the set Figure, for example Figure. If formula is greater than the current value of formula, we update formula. PI is know as a greedy acquisition function, however if we relax the value of formula by adding a constant, formula, we can make exploratory moves. In the following figure we illustrate how the maximum of formula changes for different values of formula.

Figure

Expected Improvement (EI)

The expected improvement is one of the most known acquisition functions. The improvement is defined as the difference between the predicted point and the best known point (formula,

Figure

and the expected improvement is defined as,

Figure

where Figure is the probability distribution over the improvement. If we consider that Figure is Gaussian distributed with a mean formula, the expectation integral has a closed-form expression

Figure

where Figure. formula and formula are the mean and standard deviation of a GP. formula is the normal cumulative distribution and formula is the normal probability distribution.

Figure

Upper confidence bound (UCB)