Pricing and hedging derivative securities with neural networks: Bayesian regularization, early stopping, and bagging


Reference Hutchinson provide evidence that NNs may be more accurate and computationally more efficient when the assumptions of the Black-Scholes option pricing formula are violated. However, like any other nonparametric methods, NNs are subject to overfitting if the data contain irrelevant information or a substantial amount of noise. Recently, Garcia and Gencay have demonstrated that the NN with homogeneity hint can produce a smaller out-ofsample pricing error and a more robust average delta hedging error compared to the networks without hint. Here, we provide evidence that Bayesian regularization, early stopping, and bagging are alternative methods that work effectively to prevent overfitting and to improve prediction accuracy.

Neural Networks(NN)

Let be the unknown underlying function (linear or nonlinear) through which the vector of explanatory variables relates to the dependent variable , i.e., .In our model, is (call price divided by strike price), is (S&P 500 Index divided by strike price), and is time to maturity. Then can be approximated by a three-layer NN model. Our model is a typical three-layer feedforward NN where is the number of units in the hidden layer that varies from 1 to 10 in our study, is a logistic transfer function defined as , represents a vector of parameters from the hidden-layer units to the outputlayer units, denotes a matrix of parameters from the input-layer units to the hidden-layer units, and is the error term. The error term can be made arbitrarily small if sufficiently many explanatory variables are included and if is chosen to be large enough. However, if n is too large, the NN may overfit in which case the in-sample errors can be made very small but the out-of-sample errors may be large. The choice of depends on the number of explanatory variables and the nature of the underlying relationship. In the present study, we use a cross-validation procedure to select . In particular, we estimate 10 NN models with the number of hidden-layer units varying from 1 to 10 using the training data, and the one that performs the best on the validation data is then utilized to generate out-of-sample prediction results based on the testing data.

In general there is no analytical solution to this minimization problem and the parameters have to be estimated numerically. Because the Levenberg-Marquardt algorithm is by far the fastest algorithm for moderate-sized (up to several hundred free parameters) feedforward NNs, we use it to estimate the parameters. The initial values of the parameters are generated with method in which the initial values are assigned such that the active regions of the layer¡¦s units are roughly evenly distributed over the range of the explanatory variable space. The benefit is that fewer units are wasted and that the network converges faster compared to purely random initial parameter values.

Learn to perform the characteristic and comparison of the algorithm

Learn of the algorithm Initial weight Learning Method Neuron characteristic
Hebbian 0 unsupervised any
Perceptron any supervised Binary biopolar
Delta any supervised Continuity
Widrow-Hoff any supervised any
Correlation 0 supervised any
Winner-take-all any(Regularization) unsupervised Continuity
Grossberg 0 supervised Continuity

Multilayer percaptron (MLP) is generally used performs learning algorithms of  Error Back Propagation (EBP). The associations of MLP and EBP, are called BPN. BPN usually includes linking the weight parameter of a large number of to need adjusting, and train the course, goal function and regular meeting fall into local minimum, in order to improve the training result of BPN. In general, we can use the following several methods to improve the efficiency of searching of EBP.

  • Add momentum term.
  • Batch learning.
  • Adopt the speed of learning of the change.
  • Different transfer functions.
  • Different goal function.
  • Example important factor.
  • Add one random perturbation amount in order to jump out some minimum.

Bayesian Regularization(BR)

An ideal NN model is one that has small errors not only in sample, but also out of sample. To produce a network that generalizes well, MacKay  proposes a method to constrain the size of the network parameters through so called regularization. The idea is that the true underlying function is assumed to have a degree of smoothness. When the parameters in a network are kept small, the network response will be smooth.

The optimal regularization parameter can be determined by Bayesian techniques.2 In the Bayesian framework the weights of the network are considered random variables.

The Bayesian optimization of the regularization parameters requires the computation of the Hessian matrix of at the minimum point . Reference Fresee and Hagan propose using the Gauss-Newton approximation to Hessian matrix, which is readily available if the Levenberg-Marquardt optimization algorithm is used to locate the minimum point. The additional computation required of the regularization is thus minimal. Just like training without BR, we use the same cross-validation procedure to select the optimal number of hidden-layer units when training the NN with BR.

Early Stopping

It is well known that a multilayer feedforward NN trained with certain algorithm learns in stages, moving from the realization of fairly simple to more complex mapping functions as the training progresses. This is reflected in the observation that the mean-square error decreases with an increasing number of iterations during training. With a goal of good generalization, it is difficult to decide when it is best to stop training by just looking at the learning curve for training by itself. It is possible to overfit the training data if the training session is not stopped at the right point.

The onset of overfitting can be detected through cross validation in which the available data are divided into training, validation, and testing subsets. The training subset is used for computing the gradient and updating the network weights. The error on the validation set is monitored during the training session. The validation error will normally decrease during the initial phase of training (see Fig. 1), as does the error on the training set. However, when the network begins to overfit the data, the error on the validation set will typically begin to rise. When the validation error increases for a specified number of iterations, the training is stopped, and the weights at the minimum of the validation error are returned.


In bagging (or bootstrap aggregating), multiple versions of a predictor is generated and they are used to get an aggregated predictor. The multiple versions are formed by making bootstrap replicates of the training set and using these as new training sets. When predicting a numerical outcome, the aggregation takes the average over the multiple versions that are generated from bootstrapping. According to Bagging's Machine Learning, both theoretical and empirical evidence suggests that bagging can greatly improve the forecasting performance of a good but unstable model where a small change in the training data can result in large changes in model, but can slightly degrade the performance of stable models. NNs, classification and regression trees, and subset selection in linear regression are unstable, while -nearest neighbor methods are stable. In the present study we use NNs for option pricing and hedging, thus bagging becomes relevant.

We slightly modify the bagging procedure to deal with the cross validation performed on ten NN models with the number of hidden layer units varying between 1 to 10.

First, the available data are divided into the training, validation, and testing subsets as in cross validation and early stopping.

Second, a bootstrap sample is selected from the training set. The bootstrap sample is then used to train the NN with 1 to 10 hidden layer units. The validation set is used to select the best NN that has the optimal number of hidden layer units, and the best model is used to generate one set of prediction on the testing set. This is repeated 25 times giving 25 sets of predictions.

Third, the bagging prediction is theaverage across the 25 sets of predictions, and the predictionerror is computed as the difference between the actual and thebagging prediction values.

­º­¶ | Methodology