 
Reference Hutchinson provide evidence that NNs may be more
accurate and computationally more efficient when the assumptions of the BlackScholes
option pricing formula are violated. However, like any other nonparametric
methods, NNs are subject to overfitting if the data contain irrelevant
information or a substantial amount of noise. Recently, Garcia and Gencay have
demonstrated that the NN with homogeneity hint can produce a smaller outofsample
pricing error and a more robust average delta hedging error compared to the
networks without hint. Here, we provide evidence that
Bayesian regularization,
early stopping, and bagging
are alternative methods that work effectively to prevent overfitting and to
improve prediction accuracy.
Neural Networks(NN)
Let be the unknown underlying function (linear or nonlinear)
through which the vector of explanatory variables relates to the dependent
variable , i.e., .In our model, is (call price divided by strike price), is
(S&P 500 Index divided by strike price), and is time to maturity. Then can
be approximated by a threelayer NN model. Our model is a typical
threelayer feedforward NN where is the number of units in the hidden layer
that varies from 1 to 10 in our study, is a logistic transfer function
defined as , represents a vector of parameters from the hiddenlayer units
to the outputlayer units, denotes a matrix of parameters from the
inputlayer units to the hiddenlayer units, and is the error term. The
error term can be made arbitrarily small if sufficiently many explanatory
variables are included and if is chosen to be large enough. However, if n is
too large, the NN may overfit in which case the insample errors can be made
very small but the outofsample errors may be large. The choice of depends
on the number of explanatory variables and the nature of the underlying
relationship. In the present study, we use a crossvalidation procedure to
select . In particular, we estimate 10 NN models with the number of
hiddenlayer units varying from 1 to 10 using the training data, and the one
that performs the best on the validation data is then utilized to generate
outofsample prediction results based on the testing data.
In general there is no analytical solution to this
minimization problem and the parameters have to be estimated numerically.
Because the LevenbergMarquardt algorithm is by far the fastest algorithm
for moderatesized (up to several hundred free parameters) feedforward NNs,
we use it to estimate the parameters. The initial values of the parameters
are generated with method in which the initial values are assigned such that
the active regions of the layer¡¦s units are roughly evenly distributed over
the range of the explanatory variable space. The benefit is that fewer units
are wasted and that the network converges faster compared to purely random
initial parameter values.
Learn to perform the characteristic and comparison of the algorithm 
Learn
of the algorithm 

Initial weight 
Learning Method 
Neuron characteristic 
Hebbian 

0 
unsupervised 
any 
Perceptron 

any 
supervised 
Binary biopolar 
Delta 

any 
supervised 
Continuity 
WidrowHoff 

any 
supervised 
any 
Correlation 

0 
supervised 
any 
Winnertakeall 

any(Regularization) 
unsupervised 
Continuity 
Grossberg 

0 
supervised 
Continuity 
Multilayer percaptron (MLP)
is generally used performs learning algorithms of Error Back
Propagation (EBP).
The associations of MLP and EBP, are called BPN. BPN usually includes
linking the weight parameter of a large number of to need adjusting, and
train the course, goal function and regular meeting fall into local minimum,
in order to improve the training result of BPN. In general, we can use the
following several methods to improve the efficiency of searching of EBP.

Add momentum term.
 Batch learning.

Adopt the speed of learning of the change.

Different transfer functions.

Different goal function.

Example important factor.

Add one random perturbation amount in order to jump out some minimum.
Bayesian Regularization(BR)
An ideal NN model is one that has small errors not only in
sample, but also out of sample. To produce a network that generalizes well,
MacKay proposes a method to constrain the size of the network
parameters through so called regularization. The idea is that the true
underlying function is assumed to have a degree of smoothness. When the
parameters in a network are kept small, the network response will be smooth.
The optimal regularization parameter can be determined by
Bayesian techniques.2 In the Bayesian framework the weights of the
network are considered random variables.
The Bayesian optimization of the regularization parameters
requires the computation of the Hessian matrix of at the minimum point .
Reference Fresee and Hagan propose using the GaussNewton approximation to
Hessian matrix, which is readily available if the LevenbergMarquardt
optimization algorithm is used to locate the minimum point. The additional
computation required of the regularization is thus minimal. Just like
training without BR, we use the same crossvalidation procedure to select
the optimal number of hiddenlayer units when training the NN with BR.
Early
Stopping
It is well known that a multilayer feedforward NN trained
with certain algorithm learns in stages, moving from the realization of
fairly simple to more complex mapping functions as the training progresses.
This is reflected in the observation that the meansquare error decreases
with an increasing number of iterations during training. With a goal of good
generalization, it is difficult to decide when it is best to stop training
by just looking at the learning curve for training by itself. It is possible
to overfit the training data if the training session is not stopped at the
right point.
The onset of overfitting can be detected through cross
validation in which the available data are divided into training,
validation, and testing subsets. The training subset is used for computing
the gradient and updating the network weights. The error on the validation
set is monitored during the training session. The validation error will
normally decrease during the initial phase of training (see Fig. 1), as does
the error on the training set. However, when the network begins to overfit
the data, the error on the validation set will typically begin to rise. When
the validation error increases for a specified number of iterations, the
training is stopped, and the weights at the minimum of the validation error
are returned.
Bagging
In bagging (or bootstrap aggregating), multiple versions of
a predictor is generated and they are used to get an aggregated predictor.
The multiple versions are formed by making bootstrap replicates of the
training set and using these as new training sets. When predicting a
numerical outcome, the aggregation takes the average over the multiple
versions that are generated from bootstrapping. According to Bagging's
Machine Learning, both theoretical and empirical evidence suggests that
bagging can greatly improve the forecasting performance of a good but
unstable model where a small change in the training data can result in large
changes in model, but can slightly degrade the performance of stable models.
NNs, classification and regression trees, and subset selection in linear
regression are unstable, while nearest neighbor methods are stable. In the
present study we use NNs for option pricing and hedging, thus bagging
becomes relevant.
We slightly modify the bagging procedure to deal with the
cross validation performed on ten NN models with the number of hidden layer
units varying between 1 to 10.
First, the available data are divided into the training,
validation, and testing subsets as in cross validation and early stopping.
Second, a bootstrap sample is selected from the training
set. The bootstrap sample is then used to train the NN with 1 to 10 hidden
layer units. The validation set is used to select the best NN that has the
optimal number of hidden layer units, and the best model is used to generate
one set of prediction on the testing set. This is repeated 25 times giving
25 sets of predictions.
Third, the bagging prediction is theaverage across the 25
sets of predictions, and the predictionerror is computed as the difference
between the actual and thebagging prediction values.
