0%

The Back Propagation Algorithm

Artificial Neural Networks

Artificial Neural Networks (ANNs) are loosely modeled after the brain. ANNs are composed of units (also called nodes) that are modeled after neurons, with weighted links interconnecting the units together.

What ANN?

Overall, the ANN contains input layer, hidden layer and the output layer.

Each hidden layer has an input, an output and a number of ANN units while the ANN unit is composed of two main parts: the first part sums the input and sends it to the threshold function.

If the activation is greater than 0 then the unit activates and sends a “1” as the output, otherwise it sends a 0 (or –1). The X0X_0 can be set to any value so that instead of tuning the threshold function to activate at some fixed point Y, X0 can be set to -Y.

Why ANN?

  • Can approximate any function(when target function is not known)

  • Is a superset of a logistic regression(when output is a vector of continuous or discrete values)

Forward Propagation

Let’s take one ANN unit out from the ANN. From the defintion, an ANN unit is a composition of a summation function and a threshold function(a.k.a activation function). Before describing the forward propagation, let’s define some terms and defintions.

Threshold Function

For each ANN unit, the summation result will be passed to a thresold function for an output. Let’s say aika_i^k is the ithi^{th} input in the kthk^{th} layer and the corresponding output is oiko_i^k. Below is an example of input and output in layer #2

A vector can be used to describe all scalar input/output in the same layer.

ak=[a1ka2ka3k],ok=[a1ka2ka3k] a_k = \left[ \begin{matrix} a_1^k \\ a_2^k \\ a_3^k \end{matrix} \right], o_k = \left[ \begin{matrix} a_1^k \\ a_2^k \\ a_3^k \end{matrix} \right]

Applying the same threshold function ff to all inputs in the same layer will give us the relationship between input and output.

ok=[a1ka2ka3k]=[f(a1k)f(a2k)f(a3k)]=f(ak) o_k = \left[ \begin{matrix} a_1^k \\ a_2^k \\ a_3^k \end{matrix} \right] = \left[ \begin{matrix} f(a_1^k) \\ f(a_2^k) \\ f(a_3^k) \end{matrix} \right] = f(\bm{a_k})

Then we have our first formula,

ok=f(ak)(1) o_k = f(\bm{a_k}) \tag{1}

Weights and Summation

Weight is the connection between each layer and its previous layer. Below is an example of weights in a 3-layer ANN.

If we’d like to calculate the input of node #5, we will need to sum up four input node with its weight. Let’s say oio_i is the ithi_{th} input node, wi5w_{i5} is the weigth between node #1 and #5.

a5=w15o1+w25o2+w35o3+w45o4=[w15,w25,w35,w45][o1o2o3o4]a_5 = w_{15} * o_1 + w_{25} * o_2 + w_{35} * o_3 + w_{45} * o_4 = \left[ \begin{matrix} w_{15}, w_{25}, w_{35}, w_{45} \end{matrix} \right] * \left[ \begin{matrix} o_1 \\ o_2 \\ o_3 \\ o_4 \end{matrix} \right]

Put all nodes in the layer#1 together will give us a more complex matrix expression. Take o0o_0 as input, a1a_1 be the output of the summation(input to threshold function in layer #1)

a1=[w15o10+w25o20+w35o30+w45o40w16o10+w26o20+w36o30+w46o40w17o10+w27o20+w37o30+w47o40]=[w15,w25,w35,w45w16,w26,w36,w46w17,w27,w37,w47][o10o20o30o40]=W1o0\bm{a_1} = \left[ \begin{matrix} w_{15} * o_1^0 + w_{25} * o_2^0 + w_{35} * o_3^0 + w_{45} * o_4^0 \\ w_{16} * o_1^0 + w_{26} * o_2^0 + w_{36} * o_3^0 + w_{46} * o_4^0 \\ w_{17} * o_1^0 + w_{27} * o_2^0 + w_{37} * o_3^0 + w_{47} * o_4^0 \\ \end{matrix} \right] = \left[ \begin{matrix} w_{15}, w_{25}, w_{35}, w_{45} \\ w_{16}, w_{26}, w_{36}, w_{46} \\ w_{17}, w_{27}, w_{37}, w_{47} \\ \end{matrix} \right] * \left[ \begin{matrix} o_1^0 \\ o_2^0 \\ o_3^0 \\ o_4^0 \end{matrix} \right] = \bm{W_1} * \bm{o_0}

Every layer follows the same pattern, then we have our second formula,

Define WkW_k be the weight matrix of kthk^{th} layer, ok1\bm{o_{k-1}} be the output from the previous layer, and ak\bm{a_k} be the output of summation(input to layer #k)

Wkok1=ak(2)\bm{W_k}*\bm{o_{k-1}} = \bm{a_k} \tag{2}

Forward Propagation Algorithm

Let’s have a quick recap on two important formula from the previous section,

ok=f(ak)(1) o_k = f(\bm{a_k}) \tag{1}

Wkok1=ak(2)\bm{W_k}*\bm{o_{k-1}} = \bm{a_k} \tag{2}

The formula #1 changes a\bm{a} to o\bm{o} while formula #2 push result to the next layer.

With the input x\bm{x} and the weight matrix W\bm{W}, we can calculate the output easily by using formula #1 and formula #2

o=f(W2f(W1x)\bm{o} = f(\bm{W_2}f(\bm{W_1}\bm{x})

Back Propagation

The back propagation is an algorithm to update the weight in order to minimize the difference between the expected output and the actual output from forward propagation in the training set.

Goal

Let’s define loss function first. Usually we use distance between expected output and the actual output. Say that o\bm{o} is the output calculated from forward propagation while y\bm{y} is the expected output from the training set.

C=12(oy)2C = \frac{1}{2}(\bm{o} - \bm{y})^2

Our goal is to minimize the C by updating W\bm{W} in each layer. What we usually do is gradient descent.

Wk=WkηCWk(3)\bm{W_k} = \bm{W_k} - \eta \frac{\partial C}{\partial \bm{W_k}} \tag{3}

The question is how to find the connection of W1W_1 with CC(and W2W_2 with CC)

Chain Rule

Since C is the function of o\bm{o}, C=12(oy)2C = \frac{1}{2}(\bm{o} - \bm{y})^2, o\bm{o} is the function of a\bm{a}, ok=f(ak)o_k = f(\bm{a_k}) and a\bm{a} is the function of W\bm{W}, Wkok1=ak\bm{W_k}*\bm{o_{k-1}} = \bm{a_k}, we can transfer the partial derivative of CC with WW into with a\bm{a} by chain rule.

CWk=CakakWk\frac{\partial C}{\partial \bm{W_k}} = \frac{\partial C}{\partial \bm{a_k}} \frac{\partial a_k}{\partial \bm{W_k}}

From formula {2}, Wkok1=ak\bm{W_k}*\bm{o_{k-1}} = \bm{a_k}

akWk=Wkok1Wk=ok1T\frac{\partial a_k}{\partial \bm{W_k}} = \frac{\partial \bm{W_k} * \bm{o_{k-1}}}{\partial \bm{W_k}} = \bm{o_{k-1}}^T

The next step is to find out Cak\frac{\partial C}{\partial \bm{a_k}}

Error Term

Let’s define error term as

ϵk=Cak\epsilon_k = \frac{\partial C}{\partial \bm{a_k}}

It doesn’t seem like we can calculate ϵk\epsilon_k directly, but what if we know ϵk+1\epsilon_{k+1}

From formula #1 and #2,

ok=f(ak)(1) o_k = f(\bm{a_k}) \tag{1}

Wk+1ok=ak+1(2)\bm{W_{k+1}}*\bm{o_{k}} = \bm{a_{k+1}} \tag{2}

we can transfer our original task to the following equation by chain rule

Back Propagation Algorithm

The BP algorithm is calculated in a reverse direction compared to the forward propagation. The algorithm is described as follows,

For each layer kk from the right to left,

  1. calculate the error term by the error rate of k+1k + 1 layer.

ϵk=ϵk+1Wk+1Tf(ak)\epsilon_k = \epsilon_{k+1}W_{k+1}^Tf'(a_k)

  • The rightmost layer(output layer) error term, y’ is the actual output while y is the expected output

ϵ=(yy)f(ak)\epsilon = (y' - y)f'(a_k)

  1. update the weights by the error term and output of kk layer

Wk=Wkηϵkok\bm{W_k} = \bm{W_k} - \eta\epsilon_ko_k

XOR Problem

The XOr, or “exclusive or”, problem is a classic problem in ANN research. It is the problem of using a neural network to predict the outputs of XOr logic gates given two binary inputs. An XOr function should return a true value if the two inputs are not equal and a false value if they are equal.

Traning Set

All possible inputs and predicted outputs are shown below.

Input 1 Input 2 Output
0 0 0
0 1 1
1 0 1
1 1 0

Traning ANN

2 dimensional inputs with 1 dimensional output(binary classification) and 1 hidden layer with 2 ANN units.

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
def tanh(x):
return np.tanh(x)


def tanh_derivative(x):
return 1 - np.tanh(x) * np.tanh(x)


def logistic(x):
return 1 / (1 + np.exp(-x))


def logistic_derivative(x):
return logistic(x) * (1 - logistic(x))


class NeuralNetwork:
def __init__(self, layers, activation='logistic', la=0):
if activation == 'logistic':
self.activation = logistic
self.activation_deriv = logistic_derivative
elif activation == 'tanh':
self.activation = tanh
self.activation_deriv = tanh_derivative

self.weights = []
self.regularization = Regularization(la)
for i in range(1, len(layers) - 1):
# Return random floats in the half-open interval [-0.25, 0.25]
self.weights.append((2 * np.random.random((layers[i - 1] + 1, layers[i] + 1)) - 1) * 0.25)
self.weights.append((2 * np.random.random((layers[i] + 1, layers[i + 1])) - 1) * 0.25)

def fit(self, x_train_mat, y_train_mat, learning_rate=0.02, epochs=20000):
""" gradient for linear regression
Parameters
----------
x_train_mat: matrix
y_train_mat: array
learning_rate: matrix in n * 1
n features
epochs:
"""
x_train_mat = NeuralNetwork.append_ones_at_last(x_train_mat)
self._gradient_descent(x_train_mat, y_train_mat, learning_rate, epochs)

def predict(self, x_predict_mat):
x_predict_mat = NeuralNetwork.append_ones_at_last(x_predict_mat)
a = NeuralNetwork.feed_forward(self.weights, x_predict_mat, self.activation)

return a[-1]

@staticmethod
def cost_function(weights, x_train_mat, y_train_mat, activation):
a = NeuralNetwork.feed_forward(weights, x_train_mat, activation)

return np.dot((y_train_mat - a[-1]).T, (y_train_mat - a[-1]))[0, 0]

@staticmethod
def feed_forward(weights, x_train_mat, activation):
a = [x_train_mat]

for l in range(len(weights)):
a.append(activation(np.dot(a[l], weights[l])))

return a

@staticmethod
def deltas(weights, a, m, y_train_mat, activation_deriv, regularization):
error = y_train_mat - a[-1]
deltas = [error * activation_deriv(a[-1])]
for l in range(len(a) - 2, 0, -1):
gradient = np.multiply(deltas[-1].dot(weights[l].T), activation_deriv(a[l]))
regular_term = regularization.regularization_term_gradient(m, weights[l])
deltas.append(gradient + regular_term.T)

deltas.reverse()

return deltas

def _gradient_descent(self, x_train_mat, y_train_mat, learning_rate=0.2, epochs=20000):
m = x_train_mat.shape[0]
loss_list = []
epochs_list = []
for k in range(epochs):
i = np.random.randint(x_train_mat.shape[0])
a = NeuralNetwork.feed_forward(self.weights, x_train_mat[i], self.activation)
cost = NeuralNetwork.cost_function(self.weights, x_train_mat, y_train_mat, self.activation)
loss_list.append(cost)
epochs_list.append(k)
deltas = NeuralNetwork.deltas(self.weights, a, m, y_train_mat[i],
self.activation_deriv, self.regularization)

for i in range(len(self.weights)):
layer = np.atleast_2d(a[i])
delta = np.atleast_2d(deltas[i])
self.weights[i] += learning_rate * layer.T.dot(delta)

plt.plot(epochs_list, loss_list)
plt.xlabel('epochs')
plt.ylabel('loss')
plt.show()

@staticmethod
def append_ones_at_last(x_train_mat):
temp = np.ones([x_train_mat.shape[0], x_train_mat.shape[1] + 1])
temp[:, 0:-1] = x_train_mat
return temp

0,0,0
0,1,1
1,0,1
1,1,0

1
2
3
4
5
6
7
8
def test_xor(data):
x_train_mat = data[0]
y_train_mat = data[1]
nn = NeuralNetwork([2, 2, 1], 'tanh')
nn.fit(x_train_mat, y_train_mat)
test_cases = [([0, 0], 0), ([0, 1], 1), ([1, 0], 1), ([1, 1], 0)]
for test_case in test_cases:
np.testing.assert_almost_equal(test_case[1], nn.predict(np.asmatrix(test_case[0])), decimal=1)

Experiment Result