1. Consider the sigmoid transformation at the output layer of deep model trained by optimizing: i) mean square error ii) negative log-likelihood. Derive the gradients of both loss functions with respect to the input of the sigmoid layer. Which of the two losses is more suitable for classification task? Why? Provide the code for computing gradients using the numpy library. Assume that data comes in variable size minibatches. ------------------- 2. Consider training multiclass logistic regression by optimizing cross-entropy loss using stochastic gradient descent with learning rate set to δ = 1. Training data is given as follows: : x1 = (1, 1, 1, 0), y1 = [1, 0], x2 = (0, 1, 1, 1), y2 = [0, 1]. Weights and bias are initialized to: W0 = [[0.5 -0.5 -0.5 0.5],[0.5 -0.5 0.5 -0.5]], b0 = [ 0 0 ]. (a) Derive the equations for gradients with respect to all parameters. (b) Compute gradients and update the parameters using the data sample (x1, y1) without any kind of regularization. (c) Compute one learning epoch starting with parameters W0 and b0 with dropout regularization of input features with probability p(mu_xi) = 0.5, i = 1, 2, 3, 4. Assume the following dropout schedule: first minibatch is (x1, y1), and first and third input are off, second minibatch is (x2, y2) and second and fourth input are off. Compute gradients and update the parameters. (d) Compute the outputs of trained network for next inputs: xa = (0, 1, 0, 1) i xb = (1, 0, 1, 0). ------------------- 3. Consider restricted Boltzmann machine with two elements in the visible layer and three elements in the hidden layer. Gibbs sampling for minibatch of three samples v(1), v(2) and v(3) is shown in the picture. Compute the corrections: i) for weight w13 connecting elements v(1) and h(3) and ii) for weight w23 connecting elements v(2) and h(3). Use the CD-3 algorithm and learning rate eta=0.01. ------------------- 4. We chose the Rayleigh distribution for the output layer of a variational autoencoder. The distribution looks like this: p(x|sigma) = (x / (sigma^2)) * exp((-x^2) / (2sigma^2)) Derive the expression for the second component of the loss function: Expectation__q_phi(z|x(i))[ log( p_theta(x(i)|z) ) ] You can assume that the mini-batches are large enough so that it is enough to sample the hidden layer only once. Draw a sketch/diagram of the corresponding decoder. -------------------- 5. A teacher called Svjetlana has a big problem - her class is really mischevious. When they stand in line so the teacher can count them, some of them go back to the end of the line after being counted and the teacher, forgetful as she is, counts them again. Because of that, it sometimes appears that there are more children present than the total number of children in the class - which is impossible. Svjetlana heard about a new technology called RNNs in the newspaper and decided to try it out to solve her problem. She set the problem up this way: Every index/neuron of the hidden state of the RNN corresponds to one student. It contains the information whether the student appeared in class that day (1) or not (0). Besides that, she wants the RNN to output (1 - number of missing children). So, for instance, if all the children showed up to class, the network would output 1; if there was one child missing, it would output 0; if there were 2 children missing, it would output -1, etc. In Svjetlana's class, there are only 3 children, and each one of them has been assigned a one-hot code. The inputs that the RNN receives are the one-hot codes of the children. For simplicity, the activation function at the output is the identity (instead of the softmax). The dimensionality of the output is 1. The hidden state of the network has been initialized to a vector of all zeros. If we know that the matrix U is an identity matrix, that the matrix W is diagonal, calculate/determine the bias vectors b and c, the weight matrices W and V, and the activation function of the hidden layer f.