Notes

Deconvolution Layer
Batch Normalization
Q-learning v SARSA
Policy Iteration v Value Iteration
Q Learning
Policy Gradients
Actor Critic methods
Trust Region Methods
Monte Carlo Tree Search
Inverse Reinforcement Learning
One shot learning
Meta learning
A3C
Distributed DL
MAC vs Digital Signatures
MLE and KL Divergence
Lipschitz Continuity
Exposure bias problem
Gini coefficient
Pareto distribution

Deconvolution Layer

torch.nn.ConvTranspose2d in PyTorch
ambiguous name, no deconvolutions
a deconvolution layer maps from a lower to higher dimension, a sort of upsampling
the transpose of a non-padded convolution is equivalent to convolving a zero-padded input
zeroes are inserted between inputs which cause the kernel to move slower, hence also called fractionally strided convolution
deconv layers allow the model to use every point in the small image to “paint” a square in the larger one
deconv layers have uneven overlap in the output, conv layers have overlap in the input
leads to the problem of checkerboard artifacts
resize-convolution instead transposed-convolution to avoid checkerboard artifacts

References

Batch Normalization

torch.nn.BatchNorm2d in PyTorch
normalizses the data in each batch to have zero mean and unit covariance
provides some consistency between layers by reducing internal covariate shift
allows a higher learning rate to be used, reduces the learning time
after normalizing the input, it is squased through a linear function with parameters gamma and beta
output of batchnorm = gamma * normalized_input + beta
having gamma and beta allows the network to choose how much ‘normalization’ it wants for every feature; shift and scale

References

Q-learning v SARSA

SARSA stands for state-action-reward-state-action
SARSA is on-policy; that is sticks to the policy it is learning. Q-learning is off-policy
SARSA improves the estimate of Q by using the transitions from the policy dervied from Q
Q-learning updates the Q estimate using the observed reward and the maximum reward possible \(max_a{a\prime} Q(s\prime, a\prime)\) for the next state

References

Policy Iteration v Value Iteration

PI: trying to converge the policy to optimal; VI: trying to converge the value function to optimal
PI: policy evaluation (calculating value function using \(v(s) \gets \sum_{s\prime} p(s\prime \mid s, \pi (s)) [r(s, \pi (s), s\prime) + \gamma v(s\prime)]\)) ) + policy improvement; repeat until policy is stable
VI: policy evaluation (calculating value function using \(v(s) \gets max_a \sum_{s\prime} p(s\prime \mid s,a) [r(s,a,s\prime) + \gamma v(s\prime)]\)); single policy update

References

StackOverFlow

Q Learning

Model free learning: the agent has no idea about the state transition and reward functions; it learns everything from experience by interacting with the environment
Q-Learning is based on Time-Difference Learning
\[Q(s_t, a_t) = Q(s_t, a_t) + \alpha[r(s,a) + \gamma * max_a Q(s_{t+1}, a) - Q(s_t, a_t)]\]
See notes on Q-Learning v SARSA
\(\epsilon\)-greedy approach: choose a random action with probability \(\epsilon\), or action according to the current estimate of Q-values otherwise; this approach controls the exploration vs exploitation

References

Sutton Book 1st Ed Page 148
Medium post

Policy Gradients

Run a policy for a while; see what actions led to higher rewards; increase their probability
Take the gradient of log probability of trajectory, then weight it by the final reward
Increase the probability of actions that lead to higher reward
With \(J(\theta)\) as the policy objective function \(\nabla_{\theta} J(\theta) = \sum_{t \geq 0} r(\tau) \;\nabla_{\theta} \;log\; \pi_{\theta} (a_t \mid s_t)\)
This suffers from high variance and is a simplistic view; credit assignment problem is hard
Baseline: whether a reward is better or worse than what you expect to get
A simple baseline: constant moving average of rewards experienced so far from all trajectories; Vanilla REINFORCE
Reducing variance further using better baselines -> Actor critic algorithm

References

Actor Critic Methods

Works well when there is an infinite input and output space
Requires much less training time than policy gradient methods
Actor => takes in the environment states and determines the best action to take
Critic => takes in the environment and the action from the actor and returns a score that represents how good the action is for the state
Both the actor (policy) and critic (Q function) are different neural networks

References

Trust Region Methods

A kind of local policy search algorithm
‘local’ because every new policy is somewhat closer to the earlier policy
TRPO uses policy gradients but has a constraint on how the polices are updated
Each new policy has to be close to the older one in terms of the KL-divergence
Since polices are nothing but probability distributions over the actions, KL divergence is a natural way to measure the distance
Constraint Policy Optimization (CPO) is another trust region method using contraints on the cost function to keep an agent’s action under a limit while maintaining optimal performance

References

Monte Carlo Tree Search

MCTS is based on two idea:
- a true value of an action may be evaluated using random simulation
- these values maybe used to efficiently adjust the policy towards a best-first strategy
THe algorithm builds a search tree till a computational budget - time or memory is exhausted
The algorthim has four parts which are applied per iteration
- Selection: descending down the root node till an expandable non-terminal node
- Expansion: adding child nodes towards the tree
- Simulation: simulate the default policy from the new node(s) to produce an output
- Backpropagation: the simulation result is ‘backed up’ through the selected nodes
Selection + Expansion => Tree policy; Simluation => Default policy
The backpropagatin step informs future tree policy decision

References

MCTS Survey paper
Sutton 2nd Edition 8.11 Page 153

Inverse Reinforcement Learning

Learning the reward fucntion by observing expert behaviour
Imitation learning (behaviour cloning and IRL) tries to copy the teacher’s actions
Learning the reward function can make the system robust to changes in the environment’s transition mechanics
Learning the reward function is also transferable from one type of agent to another, as it encodes all that is needed to excel in the envirnment
Think of IRL as a way to learn an abstraction or latent representation of the target
Another big motivation for IRL is that it is extremely difficult to manually specifiy a reward function to an agent, like in a self driving car
Instead of simply copying the expert behavior, we can then try to learn the underlying reward function which the expert is trying to optimize

References

Blog post

One Shot Imitation Learning

Trying to learn with very limited demonstrations
The model is given multiple demonstrations and conditioned on one instance of a task, to help learn that task, and so on similarly other tasks as well
Generalise the understanding of various tasks

References

Paper

Meta Learning

The agent learns a policy to learn policies
Given a task and an model, the agent can learn a policy to master that task
But it may fail if the task is altered
Meta Learning tries to devise methods to learn policies which can learn policies further and can therefore perform multiple tasks

References

Asynchronous Actor-Critic Agents (A3C)

Asychronous Advantage Actor-Critic
Asychronous: Unlike other learning agent algos like DQN, A3C has multiple worker agents interacting with the environment providing a more diverse experience to the learning phase
Advantage: like in PG methods
Actor-Critic: same as Actor Critic
The workers independently work by learning from the environment and update the global network

References

Paper

Distributed DL

Synchronous Distributed SGD (Centralised)
- Parameter Server
- Gradients are sent to the parameter server that computes the updates
- Workers reeive updated models
Synchronous Distributed SGD (Decentralised)
- All-Reduce the gradients to every worker
- Models on each node are updated with the same average gradients
Asynchronous Distributed SGD (Centralised)
- Asynchronous parameter udpates
- Lag problem
- Workers update when they complete their gradient calculation

References

Survey paper on Distributed DL

MAC vs Digital Signatures

Message Authentication Codes (MAC): detects modification of messages based on a ‘shared key’
- Symmetric key based algorithms for pretecting integrity
- Example: HMAC (key-hashed MAC), CBC-MAC / CMAC (block cipher based)
Digital Signatures: detects modification of messages based on a asymmetric key pair
- Asymmetric keys: public key and private key
- The sender signs with its private key, the receiver can verify the signature with the sender’s public key
MACs are faster and take less size; but Digital Signatures provide non-repudiation (if the recipient passes the message and the proof to a third party, can the third party be confident that the message originated from the sender ?)

References

MLE and KL Divergence

Maximum likelihood estimation (MLE): Given a dataset \(\mathcal{D}\) of sie n drawn from a distribution \(P_{\theta} \in \mathcal{P}\), the MLE estimate of \(\theta\) is defined as

\(\hat{\theta} = arg\;max_{\theta} \; \mathcal{L}(\theta, D)\) or \(\hat{\theta} = arg\;max_{\theta} \; log \; P_{\theta}(\mathcal{D}|\theta)\)

Equivalently, this can be formulated as iid samples, negative log-likelihood

\(NLL(\theta) = - \sum^N_{i=1} log \; p(y_i|x_i, \theta)\)

KL Divergence: Relative entropy, it measures the dissimilarity of two probability distributions

\(\mathcal{KL} (p||q) = \sum^K_{k=1} \; p_k \; log \; \frac{p_k}{q_k}\)

Expand above to get \(\mathcal{KL} = -\mathcal{H}(p) + \mathcal{H}(p,q)\)
In the limit, KL is same as MLE
In generative models, MLE isn’t suitable as the probability density under the trained model for any actual input is almost always zero
For more details: blog

References

Murphy’s book
Blog
StackExchange

Lipschitz Continuity

This property is often used in deep learning and differential equations over ‘funny’ functions
Lipschitz continuity is a simple way to bound the function values
\(|f(x) - f(y)| \leq K \ |x-y|\)
Refer to the wiki page for a more generalized defination
Notice, the Lipschitz constant \(K\) is the bound on the slope AKA derivative of the function in the specified domain
Using this condition provides a safe way to talk about differentiability of the function (Rademacher’s Theorem)

References

Exposure bias problem

Recurrent models are trained to predict the next word given the previous ground truth words as input
At test time, they are used to generate an entire sequence by predicting one word at a time, and by feeding the generated word back as input at the next time step
This is not good because the model was trained on a different distribution of inputs, namely, words drawn from the data distribution, as opposed to words drawn from the model distribution
The errors made along the way will quickly accumulate
This is knowns as exposure bias which occurs when a model is only exposed to the training data distribution, instead of its own predictions
This is the discrepancy between training and inference stages

References

Paper

Gini Coefficient

Gini coefficient is a single number aimed at measureing the degree of inequality in a distribution.
Given a group of people producing posts/comments, this can be used to estimate the dispersion in content production, i.e., most posts/comments come from a selectd few or from a diverse set of users.
A gini coefficient of 0 means perfect equality, and 1 means perfect concentration in a single individual.

\[G = \frac{\sum_{i=1}^{n}\sum_{j=1}^{n}|x_i-x_j|}{2n^2\hat{x}}\]

where n is the number of participating members, and \(x_i\) is the content produced, or wealth.
Alternatively, Gini coefficient can be thought of as the ratio of the area that lies between the line of equality and the Lorenz curve over the total area under the line of equality.
Points on the Lorenz curve is the proportion of overall income or wealth assumed by the botton x% of the people [economics]. See the income distribution graph on the Lorenz curve wiki page.
Palma ratio is another measure of inequality.

Pareto distribution

Pareto optimality is a situation that cannot be modified so as to make any one individual or preference criterion better off without making at least one individual or preference creiterion worse off.
Write down the value model equations, constraints. Define the objective function. Run a convex hull optimizer on simple grid search to get a set of solutions for the equations. Use a tie-breaker (a way to decide on trade-off, either objectively coded or using product sense) to choose amongst the solutions.