# Laboratory Assignment N. 5

*Solve the following assignments, whose completion is required to access the oral examination. Send the assignments all-together (once you have completed all the labs, not only this single one) as a compressed folder including one subfolder for each laboratory (e.g. the name of the subfolder should be lab1 for the first laboratory, then lab2, etc..).*

*The subfolder for this lab should include all the Matlab scripts requested in the assignments below. You can organize the code as you wish, implementing all the helper functions that you need provided that these are included in the subfolder and are appropriately called in the scripts. To check that you have successfully completed the assignments I will only run the requested scripts (no debugging, no function chasing, the scripts should work like a charm when I run them).*

*Bonus track assignments are meant to be for those who finish early, but they are not formally required for completing the Lab Assignment.*

## Restricted Boltzmann Machines

Write down a Matlab-script *rbm.m* implementing a Restricted Boltzmann Machine receiving input from the *data* matrix (samples are on the rows, inputs on the columns) in the *digitRBM.mat* file compressed in this archive.

Once loaded the *digitRBM.mat* file in Matlab you fill find two variables in your workspace

*data*- a 10000×784 matrix containing 10000 graylevel images (28×28 pixels) of handwritten digits in $[0,9]$;*targets*- a 10000×10 matrix associating a class to each image using a one-of-10 encoding, i.e. each sample is a 1×10 vector with all zeros except for the element corresponding to the class of the image that is set to 1. For instance, if the first element of this vector is 1, the corresponding image portrays a 0, if the second element is 1, it portrays a 1, etc…

The graylevel image vectors in *data* are normalized in $[0,1]$: the i-th image in the dataset can be easily visualized by `imshow(reshape(data(i,:),28,28));`

Your assignment is to write a script *rbm.m* that implements training of a binary RBM model using Contrastive Divergence of length 1 ($CD_1$) to estimate the two terms of the weight gradient:
\[
\Delta M_{ij}^{n} = <v_i^n h_j^n>_0 - <v_i^n h_j^n>_1
\]
where $M_{ij}$ is the weight between visible unit $i$ and hidden unit $j$. The number of input units is fixed by the problem: the number of hidden units in your code must be set in variable `nhidden`

which you can set to 10 (why?).

The term $\Delta M_{ij}^{n}$ is the weight update in response to image $n$ from the dataset. To obtain the overall update to the weight you must sum the contribution for all images:
\[
M_{ij}' = M_{ij} + \epsilon \sum_{n} \Delta M_{ij}^{n}
\]
where $\epsilon$ is an appropriately chosen learning rate. Note that you have to update the weights only after having seen all the data (batch update): in other words $M_{ij}'$ is the new weight value obtained after one *epoch* of training (i.e. a full swipe through the data). Similarly $M_{ij}$ is the old weight value which have been used to compute $\Delta M_{ij}^{n}$.

You can initialize the weight matrix with small (normally distributed) random values (use `randn(…)`

). Hidden and visible biases can be initialized to 0.

To implement the training process, iterate for a number of epochs (i.e. full swipes on the data) given by `maxepoch`

(e.g. equal to 10, 20.. experiment a little). At each iteration, compute, for each image $n$

- The stochastic output of each visible unit $v_i^n$ (remember, this is a binary value obtained by confronting the visible input with a randomly drawn sample).
- The hidden unit firing probability $p_j = P(h_j^n = 1 | v_1^n, \dots, v_i^n, \dots)$ and the corresponding stochastic output $h_j^n$ (same idea as for the input).
- The visible unit reconstruction probability $P(v_i^n = 1 | h_1^n, \dots, h_j^n, \dots)$ and the corresponding stochastic output for the reconstruction ($\overline{v}_i^n$).
- The hidden unit firing in response to the reconstruction $ \overline{p}_j = P(h_j^n = 1 | \overline{v}_1^n, \dots, \overline{v}_i^n, \dots)$.

Rember that the operator < > is performing an expectation on the data samples (i.e. averaging if you want) mediated by the hidden state posterior probability. This considered, you might want to try using probability $p_j$ (in place of the stochastic output $h_j$) to compute \[ <v_i h_j>_0 \approx v_i \cdot p_j \] and something similar (guess what) for $<v_i h_j>_1$ (think about what advantages you can expect by such an approach).

For each image compute the mean error between the original input and the reconstructed input. Then use it to compute the total error on the dataset for the current epoch. Once training is completed, plot the total error as a function of the epochs.

Then, plot the $784$ bottom-up final weights $M_{\cdot j}$ for each hidden unit $j$, using one different figure for each hidden unit (hint, try reshaping the weight vector as a matrix and plot it as an image).

Finally, sample one image from each class (number) and perform again contrastive divergence to obtain its reconstruction $\overline{v}_i^n$. Plot both the original and reconstructed images.

## Bonus Track

Try generalizing the script to implement a $CD_k$, the Contrastive Divergence algorithm with $k$ steps. Note that (like in $CD_1$) to compute the weight update you will be using only the initial time and final time statistics, i.e. $<v_i h_j>_0$ and $<v_i h_j>_k$