1.
Consider the sigmoid transformation at the output layer of deep model trained by optimizing:
i) mean square error 
ii) negative log-likelihood. 
Derive the gradients of both loss functions with respect to the input of the sigmoid layer. Which of the two losses is more suitable for classification task? Why? Provide the code for computing gradients using the numpy library. Assume that data comes in variable size minibatches.

-------------------

2.
Consider training multiclass logistic regression by optimizing cross-entropy loss using stochastic gradient descent with learning rate set to δ = 1. 
Training data is given as follows: : 
x1 = (1, 1, 1, 0), y1 = [1, 0], 
x2 = (0, 1, 1, 1), y2 = [0, 1]. 
Weights and bias are initialized to: 
W0 = [[0.5 -0.5 -0.5 0.5],[0.5 -0.5 0.5 -0.5]], 
b0 = [ 0 0 ]. 

(a) Derive the equations for gradients with respect to all parameters. 
(b) Compute gradients and update the parameters using the data sample (x1, y1) without any kind of regularization. 
(c) Compute one learning epoch starting with parameters W0 and b0
with dropout regularization of input features
with probability p(mu_xi) = 0.5, i = 1, 2, 3, 4.
Assume the following dropout schedule: 
  first minibatch is  (x1, y1), and first and third input are off, 
  second minibatch is (x2, y2) and second and fourth input are off. 
Compute gradients and update the parameters. 
(d) Compute the outputs of trained network for next inputs: xa = (0, 1, 0, 1) i xb = (1, 0, 1, 0).

--------------------

5.

A teacher called Svjetlana has a big problem - her class is really mischevious. When they stand in line so the teacher can count them, some of 
them go back to the end of the line after being counted and the teacher, forgetful as she is, counts them again. Because of that, it sometimes appears
that there are more children present than the total number of children in the class - which is impossible. Svjetlana heard about a new technology called RNNs in the newspaper and decided to try it out to solve her problem.

She set the problem up this way:
Every index/neuron of the hidden state of the RNN corresponds to one student. It contains the information whether the student appeared in class that day (1) or not (0).
Besides that, she wants the RNN to output (1 - number of missing children). So, for instance, if all the children showed up to class, the network would output 1; if there was one child missing, it would output 0; if there were 2 children missing, it would output -1, etc.

In Svjetlana's class, there are only 3 children, and each one of them has been assigned a one-hot code. The inputs that the RNN receives are the one-hot codes of the children. For simplicity, the activation function at the output is the identity (instead of the softmax). The dimensionality of the output is 1. The hidden state of the network has been initialized to a vector of all zeros.
If we know that the matrix U is an identity matrix, that the matrix W is diagonal, calculate/determine the bias vectors b and c, the weight matrices W and V, and the activation function of the hidden layer f.