Gökçe Aydos

- apply object-oriented programming principles to implement a neural network
- understand the components of a neural network

income ($) | house-age (years) | … | house-value ($) |
---|---|---|---|

83252 | 41 | … | 452600 |

83014 | 21 | … | 358500 |

… | … | … | … |

. . .

Goal learn weights \(w_1, w_2, \ldots\) such that:

\[ w_1 \cdot 83252 + w_2 \cdot 41 + ... \approx 452600 \] \[ w_1 \cdot 83014 + w_2 \cdot 21 + ... \approx 358500 \]

This is an excerpt from a real life dataset from a US census in 1990
that contains ~20000 rows. Each row represents aggregated data about an
area in California. Each column excluding the `house-value`

represents a feature. The house value represents the labels (called
*label vector*, in case of many labels a *label
matrix*).

❓ Why are \(w_i\)s the same across all rows?

❗ Our goal is to predict the house value/s from features. We will use \(w_i\)s to predict using new feature data in the future.

The data can be conveniently fetched using Scikit-learn which also includes the data description. Note that the actual data measures the income in tens of thousands and the house value in hundreds of thousands.

User:Wiso, Public domain, via Wikimedia Commons, edited

Neural networks make predictions

Goal: Implementation using object-oriented programming (OOP)

Class ideas?

. . .

`Operation`

`ParameterOperation`

`Layer`

`NeuralNetwork`

`Optimizer`

`Trainer`

…

- Does it make sense to inherit
`Loss`

from`Operation`

? - What happens if we use no activation function (a linear activation)? Try with single and many layers.
- What happens if we take the the sum of the errors as a loss function?
- What happens if we don’t standardize the features before we use them for training?
- What happens if we choose a learning rate of 1?

- We could do this, however we don’t feed the
`backward()`

method of`Loss`

with any value which differentiates`Loss`

from`Operation`

. Python does not allow to create two versions of a method. - A neural network with a output layer (no activation which corresponds to a linear activation) and no hidden layer corresponds to a linear regression. Additional hidden layers with linear activation probably won’t increase its capabilities over a linear regression (TODO try it).
- A negative and positive error could erase themselves out. (2) Error
can be also negative, so the optimizer must aim for 0 instead of
*only*decreasing the error (by going in the opposite direction of the gradient). In this case the*absolute value function*helps which has its minimum at zero.

- A negative and positive error could erase themselves out. (2) Error
can be also negative, so the optimizer must aim for 0 instead of
- A feature with a large value can dominate during optimizing.
- The absolute value of gradients and thus loss may increase after each step.

`breakpoint()`

is useful for debugging while interacting with the program in`ipython`

- many bugs through Numpy broadcasting, etc scalar multiplying a
`(3, 1)`

array with a`(3,)`

array. Assertions help.

- Code
- 2019,
Weidman, Deep Learning from Scratch
- there are occasional mistakes in the book, refer to the errata in doubt
- German version

\(x\) | \(y\) |
---|---|

1 | 2 |

2 | 4 |

3 | 6 |

4 | ? |

-1 | ? |

How did we do that? We noticed: \(y_i = x_i \cdot 2\) and applied this observation to 4 and -1.

Example: You are a child and sit near the a saleswoman who sells figs. If she sells one fig, the customer gives 2 coins, if she sells 2, the customer gives 4 coins etc. You would learn that the number of figs should be multiplied by 2.

\(x\) | \(y\) |
---|---|

1 | 1.99 |

2 | 4.02 |

3 | 5.98 |

4 | ? |

-1 | ? |

This time the numbers are a bit off, however we can still roughly say \(y_i = x_i \cdot 2\) and accept some error for a simple model.

SHOW: Draw the values on a graph and draw a line.

Using the same principle the machines can *learn* from
observations (by finding patterns) and make *predictions*. This
is called machine learning.

\(x_1\) | \(x_2\) | \(y\) |
---|---|---|

1 | 1 | 1.5 |

2 | 1 | 2.5 |

2 | 2 | 3 |

3 | 1 | ? |

-1 | 2 | ? |

The same idea also works with more input variables. \(y_i = x_{1i} + 0.5 x_{2i}\).

In general: \(y_i = w_1 x_{1i} + w_2
x_{2i}\). In our case \(y_i\) is
a *linear combination* of \(x_i\)s. We can have many input variables as
well as many predictions.

❓ Assume the machine learned some \(w_i\)s. Are the predictions correct? How do we test?

. . .

❗ We can compare each prediction \(p_i\) (using \(w_i\)s) with the actual house value (\(y_i\)).

❓ How do we test the prediction quality using math?

. . .

❗ For example by using *mean squared error* (MSE).

\[ \frac{(y_1 - p_1)^2 + (y_2 - p_2)^2 + ... + (y_n - p_n)^2}{n} \]

MSE is - a *loss function* - measures how *erroneous*
the prediction is

Another example is *mean absolute error* (MAE). Compared to
MSE, MAE emphasizes large error by squaring them and small errors, i.e.,
\(<1\) get even smaller.

Goal: minimum loss by training:

- Pick random parameters (weights)
- Make predictions for a batch of inputs
- Compute loss
- Find the parameters (weights) that minimize the loss

❓ How can we find these parameters?

. . .

❗ Taking the partial derivative with respect to each parameter
(*gradient*). However we typically cannot find the exact minimum,
because the loss function can get very complex.

❓ What do we do now?

Alternative perspective follows:

❓ You are stuck on a mountain. How would you get down?

❗ Look around, move in direction of the descending path, repeat.

Approach: Find out how much we should change each parameter so that the loss decreases.

❓ How can we find out how much the loss \(L\) changes if we e.g., increase the parameter \(w_1\) by 1?

. . .

❗ By computing \(\frac{\partial
L}{\partial w_1}(w_1=1)\). If the result is positive, then we
decrease \(w_1\) and vice-versa. This
is called *back-propagation*.

Procedure: Pick a random point, move in direction of the descending path: ❓ Where would these three points end up?

Note that the rightmost point will end up in a local minimum.

Goal: minimum loss by training:

- Pick random parameters (weights)
- Repeat:
- Make predictions for a batch of inputs
- Compute loss
- Back-propagate
- Modify the parameters so that the loss decreases a bit
- Stop if the loss does not decrease significantly or after a timeout ##

- encapsulation of features in a single component — more convenient for humans to classify components of a program
- reusability of components, extensibility