From operations to neural network

Principles of object-oriented programming & neural networks

Gökçe Aydos

Learning goals

apply object-oriented programming principles to implement a neural network
understand the components of a neural network

Fundamentals

income ($)	house-age (years)	…	house-value ($)
83252	41	…	452600
83014	21	…	358500
…	…	…	…

. . .

Goal learn weights $w_1, w_2, \ldots$ such that:

\[ w_1 \cdot 83252 + w_2 \cdot 41 + ... \approx 452600 \] \[ w_1 \cdot 83014 + w_2 \cdot 21 + ... \approx 358500 \]

This is an excerpt from a real life dataset from a US census in 1990 that contains ~20000 rows. Each row represents aggregated data about an area in California. Each column excluding the house-value represents a feature. The house value represents the labels (called label vector, in case of many labels a label matrix).

❓ Why are $w_i$s the same across all rows?

❗ Our goal is to predict the house value/s from features. We will use $w_i$s to predict using new feature data in the future.

The data can be conveniently fetched using Scikit-learn which also includes the data description. Note that the actual data measures the income in tens of thousands and the house value in hundreds of thousands.

a neural network with a single hidden layer

User:Wiso, Public domain, via Wikimedia Commons, edited

Neural networks make predictions

Implementation

Goal: Implementation using object-oriented programming (OOP)

Class ideas?

. . .

Operation
ParameterOperation
Layer
NeuralNetwork
Optimizer
Trainer

Code

…

Questions to ponder 🤔

Does it make sense to inherit Loss from Operation?
What happens if we use no activation function (a linear activation)? Try with single and many layers.
What happens if we take the the sum of the errors as a loss function?
What happens if we don’t standardize the features before we use them for training?
What happens if we choose a learning rate of 1?

We could do this, however we don’t feed the backward() method of Loss with any value which differentiates Loss from Operation. Python does not allow to create two versions of a method.
A neural network with a output layer (no activation which corresponds to a linear activation) and no hidden layer corresponds to a linear regression. Additional hidden layers with linear activation probably won’t increase its capabilities over a linear regression (TODO try it).
1. A negative and positive error could erase themselves out. (2) Error can be also negative, so the optimizer must aim for 0 instead of only decreasing the error (by going in the opposite direction of the gradient). In this case the absolute value function helps which has its minimum at zero.
A feature with a large value can dominate during optimizing.
The absolute value of gradients and thus loss may increase after each step.

Takeaways

breakpoint() is useful for debugging while interacting with the program in ipython
many bugs through Numpy broadcasting, etc scalar multiplying a (3, 1) array with a (3,) array. Assertions help.

References

Code
2019, Weidman, Deep Learning from Scratch
- there are occasional mistakes in the book, refer to the errata in doubt
- German version

Appendix

$x$	$y$
1	2
2	4
3	6
4	?
-1	?

How did we do that? We noticed: $y_i = x_i \cdot 2$ and applied this observation to 4 and -1.

Example: You are a child and sit near the a saleswoman who sells figs. If she sells one fig, the customer gives 2 coins, if she sells 2, the customer gives 4 coins etc. You would learn that the number of figs should be multiplied by 2.

$x$	$y$
1	1.99
2	4.02
3	5.98
4	?
-1	?

This time the numbers are a bit off, however we can still roughly say $y_i = x_i \cdot 2$ and accept some error for a simple model.

SHOW: Draw the values on a graph and draw a line.

Using the same principle the machines can learn from observations (by finding patterns) and make predictions. This is called machine learning.

$x_1$	$x_2$	$y$
1	1	1.5
2	1	2.5
2	2	3
3	1	?
-1	2	?

The same idea also works with more input variables. $y_i = x_{1i} + 0.5 x_{2i}$.

In general: $y_i = w_1 x_{1i} + w_2 x_{2i}$. In our case $y_i$ is a linear combination of $x_i$s. We can have many input variables as well as many predictions.

❓ Assume the machine learned some $w_i$s. Are the predictions correct? How do we test?

. . .

❗ We can compare each prediction $p_i$ (using $w_i$s) with the actual house value ($y_i$).

❓ How do we test the prediction quality using math?

. . .

❗ For example by using mean squared error (MSE).

\[ \frac{(y_1 - p_1)^2 + (y_2 - p_2)^2 + ... + (y_n - p_n)^2}{n} \]

MSE is - a loss function - measures how erroneous the prediction is

Another example is mean absolute error (MAE). Compared to MSE, MAE emphasizes large error by squaring them and small errors, i.e., $<1$ get even smaller.

Linear vs non-linear activation

Goal: minimum loss by training:

Pick random parameters (weights)
Make predictions for a batch of inputs
Compute loss
Find the parameters (weights) that minimize the loss

❓ How can we find these parameters?

. . .

❗ Taking the partial derivative with respect to each parameter (gradient). However we typically cannot find the exact minimum, because the loss function can get very complex.

❓ What do we do now?

Alternative perspective follows:

❓ You are stuck on a mountain. How would you get down?

❗ Look around, move in direction of the descending path, repeat.

Approach: Find out how much we should change each parameter so that the loss decreases.

❓ How can we find out how much the loss $L$ changes if we e.g., increase the parameter $w_1$ by 1?

. . .

❗ By computing $\frac{\partial L}{\partial w_1}(w_1=1)$. If the result is positive, then we decrease $w_1$ and vice-versa. This is called back-propagation.

Gradient descent II

Procedure: Pick a random point, move in direction of the descending path: ❓ Where would these three points end up?

Note that the rightmost point will end up in a local minimum.

Revised algorithm

Goal: minimum loss by training:

Pick random parameters (weights)
Repeat:
- Make predictions for a batch of inputs
- Compute loss
- Back-propagate
- Modify the parameters so that the loss decreases a bit
- Stop if the loss does not decrease significantly or after a timeout ##

Implementation example with every component

Why OOP?

encapsulation of features in a single component — more convenient for humans to classify components of a program
reusability of components, extensibility

\(x\)	\(y\)
1	2
2	4
3	6
4	?
-1	?

\(x_1\)	\(x_2\)	\(y\)
1	1	1.5
2	1	2.5
2	2	3
3	1	?
-1	2	?