Interconnected neural network and the back propagation algorithm
27 September 2007, 02:43 AM
So you heard of artificial neural network, did some general readings here and there and they just wow'ed. Naturally you decided to learn this little thing called neural network and started net-roaming for related information. But unfortunately for you, like any other new structure and algorithm, neural network and its learning algorithm is just as hard to understand, all thanks to the nearly unreadable papers and articles written by mathematician for mathematician. To a programmer, one of the deadliest obstacle, setting public speaking aside, is when asked to apply some flashy new algorithm into existing system directly from a mathematician - or based on guidelines written by one. Most of the time, the underlying idea is really not THAT hard, and could sometime be just matter of facts. What is so difficult is the distortion of knowledge by programmers when interpreting a mathematician words, papers, or articles.

Now my intention is for you to understand the structure of fully-interconnected neural network (see diagram below) and how learning is achieved through the back propagation algorithm. Process flow is the most important aspect of programming, which is always lacking in mathematical articles. Because of this every process step will be explained so the reader (the programmer) will always know of his or her footing in the whole process, and hopefully by the end has a mental image of the network, the connection, and can mentally "see" the flow of input information up and down the layers. Some math equations will be presented, together with how to use them so no worrying on guessing what values to plug into the equation.

Without further ado, lets create a neural network version of the popular logic gate XOR.

Suppose a XOR function is to be estimated using a network such as below. Two truth tables are presented in the diagram, a XOR truth table, and a NOT-XOR truth table. The XOR logic gate has always been the choice when experimenting with neural network, so this time I decided on using its inverse, the NOT-XOR logic gate. However, I'd like to stress that it doesn't matter which table we choose, the network structure and learning algorithm remains the same. So if you feel the need to estimate a XOR function instead of a NOT-XOR go ahead, the article is fine with that. It is a good time to keep a mental image of above diagram. The crucial part is to remember how many nodes there is and what node connects to what node. Getting the labels is a plus but not really necesary. To make this easier remember that each node from a lower layer must connect to all nodes in the next higher layer with the exception of bias nodes (the black one). A bias node will always be connected with nodes in a higher layer node, but NO node regardless of type connects to any bias node.

Neural network learning is all about the modification of weights. A weight, in a programmer's perspective, is a floating value between two nodes, representing the the strength of connection between the two. Notice the labels w1, w2, ..., wbias4, are the weights for this network. The whole idea is to find values for all weights so when input A and B is presented output O1 will be close to the expected output Y. What are the black nodes? They are called the bias nodes, and are there to account for some special cases that cause the algorithm to break. With no biasing, there is a chance for all weights to 'turn off', basically causing most if not all weights to approach zero. The method we use to modify all the weights will be the well known back propagation algorithm - a simple and straightforward error feedback algorithm.

From the given NOT-XOR truth table (or XOR table if you like), there are four sets of training sample, obvious enough. The simplest of doing back propagation learning is to present all four sets one by one, and repeating the same process until a minimum global error is reached. In other word, until it closely estimate the NOT-XOR function. Here is the back propagation method,

For each pair of training sample [X = { A, B }, Y]
- Propagate X from bottom layer to top layer to get output O
- Calculate error D between actual output O and the expected output Y
- Propagate error D backward from top layer to bottom layer

Once all four sample sets has been presented, one epoch or iteration is completed. Back propagation often requires *many* epochs to reach a minimal global error.

Ok, that was simple enough. But how does it relates with our network diagram? What does it mean by propagate?

Propagation is the process of stimulating a network with an input pattern, and in turn the network generates a new output pattern. This new output pattern, in the highest layer is the output O1 itself, however it should be noted that an intermediate output pattern is also generated in the second layer which in turn propagated further forward to generate the output O1. In our network, there are three layers of nodes, the left most layer, the center layer, and the right most layer. We will call them the input, hidden, and output layer respectively. With three layers of nodes, there will be two forward propagation processes, and two backward propagation processes, totaling of four steps of processing. The steps are, forward propagation from input to hidden layer, forward propagation from hidden to output layer, backward propagation from output to hidden layer, and backward propagation from hidden to input layer. These are the actual process flow and must be done in order.

The first step - forward propagation from input to hidden layer Referring to the above diagram, the objective is to propagate input A and B to generate a new intermediate output pattern, called the output vector Z. There are three nodes (bias not included) in the hidden layer thus making our output vector Z to have three elements, Z = { Z1, Z2, Z3 }. The hidden layer's bias node's output is always constant at 1. Now to actually calculate values for Z, there is no other way but to introduce some math. The equation is called sigmoid. This will be used because of its simplicity and easily and linearly differentiable. Furthermore this one is the most known activation function because of its popularity in teaching materials. Expanding the equation gives, Where, Refering back to the network diagram, the three equations for Z1, Z2, and Z3 must relate with the connections between the input and hidden layer. Help yourself by confirming this. Keep the values of vector Z for later use.

The second step - forward propagation from hidden to output layer

In this step, the method is the same with the first one, in fact you may be able to work this out for yourself. However just for completeness sake, it will be explained anyway. From the above diagram, the objective is to propagate the intermediate output activation Z to generate a new output activation called O1. The output vector O is a vector with one element because there is only one output node, Making vector O = { O1 }. To find O1, use the same sigmoid function from before. Once the output vector O = { O1 }, is calculated. The two step forward propagation is complete, and the output vector O is now called the actual output. At this point, vector O is actually the prediction itself, the catch is we can be sure of it being far off from the expected output. To know how big the difference is, the error vector D is to be calculated with, where vector Y is the expected output. For example by training using sample A=0, B=0, and Y=1, the expected output vector will be Y = { Y1 } = { 1 }. To calculate for D1, This error vector D = { D1 } will be propagated backward starting from output layer down to hidden layer, modifying the weights between them accordingly.

The third step - backward propagation from output to hidden layer

Now the error D is known, it is time to propagate this error backward and make some weights adjustment. What to do in this step is to adjust all weights (w7, w8, w9, wbias4) between hidden and output layer by a certain rate. The rate of change for each weight is given by, where n is the learning rate, Dj the error, and Xi is the output activation from previous layer. In this case, vector D is the error vector D calculated earlier, and vector X is the output activation of the hidden layer, or the vector Z = { Z1, Z2, Z3}. Thus the rate of change equation is,  Applying the equations for weights modification gives, With this, backward propagation from output down to hidden layer is complete.

An important note: do NOT modify the weights yet (i.e w7, w8 and w9) unless all error deltas has been calculated. Keep reading to know what a delta is.

Next is to continue propagating error backward from hidden down to input layer. It gets a bit tricky, unlike the output layer, finding error vector D for the hidden layer is not as straightforward as finding it in the output layer. To avoid confusion, assume the error vector for hidden layer as the symbol delta. Refer diagram below for the deltas, To find the delta error vector, Expanding the equation to get the deltas, An important note, to calculate the delta error, the weights used in the equation (i.e w7, w8, and w9) must be the original value *before* weight adjustment. In practice, find all error vector before modifying any weights.

Now that we have the error vector for hidden layer, propagating the error backward from hidden down to input layer is now possible.

The fourth step - backward propagation from hidden down to input layer The method is still the same, except use the delta error and actual input sample, Expanding the equation to find rate of change for all other weights, That's it, the four steps are complete for one sample set. Repeat all four steps for all other remaining sample sets to complete an epoch, and finallly do as many epochs until you are satisfied with the estimation. An estimation is the output O1 once the second step is completed. In performance mode or testing mode, only the first two steps are carried out to get an estimation or a prediction.

A few things not mentioned beforehand are the initial values for all weights and the value for learning rate n. All weights must be initialized with random polar values in the range (-1.0, 1.0) or (-0.1, 0.1) or could be a smaller range. The learning rate n is recommended at 0.35. Why three hidden nodes was chosen instead of two is due to practical reason - this simple implementation of back propagation algorithm has a very high chance of not converging with just two.

Epochs needed to complete the learning is around (5000-6000) epochs - so drop that assumption of only a few hundreds.