# Autodiff¶

We use the autodiff algorithm to generate the *backprop* computations for derivatives. Autodiff is based on symbolic differentiation via the chain rule.

## Computing derivatives in op-graph¶

Each `Op`

node in our graph implements the `generate_adjoints`

method, which defines the local gradient for that operation and propagates the deltas to its arguments. To compute the derivatives, we first perform a topological sort on the graph, then traverse the graph in order, calling each node’s `generate_adjoints`

method to add the required backprop computations to the graph.

The `generate_adjoints`

method accepts `adjoints`

, `delta`

, and the arguments of the op. The `adjoints`

contains partially computed backprop `Ops`

, while `delta`

is the complete adjoint of the `Op`

.

To implement `generate_adjoints`

for an `Op`

for the function

write out

Then:

```
def generate_adjoints(adjoints, delta, x1, x2, ..., xn):
x1.generate_add_delta(adjoints, a1 * delta)
x2.generate_add_delta(adjoints, a2 * delta)
...
xn.generate_add_delta(adjoints, an * delta)
```

For example:

So:

```
def generate_adjoints(self, adjoints, delta, x, y):
x.generate_add_delta(adjoints, y * delta)
y.generate_add_delta(adjoints, x * delta)
```

## Technical details¶

Although we write computations in a program as a series of expressions, the computations are converted into a series of steps that are each a function that produces a value from previously computed values. We will use the notation \(t_{mj}\) for a value that is computed from \(\{t_{ij} | i<m\}\). We start with the following independent variables:

Then we apply functions to these variables to obtain:

It is not necessary for each function to use all of the values, but it must only use previous values.

From the original values and the newly computed values we can compute new values, as seen below:

We proceed until we finally have \(y=t_{n0}\).

For each computation, we have:

where \(D_{jk}f_{mi}\) is the derivative of \(f_{mi}\) with respect to argument \(jk\).

If we continue expanding the \(dt_{mi}\) expressions we will eventually have an expression that can be written as seen here:

Since the layer 0 values are independent:

Therefore:

If we expand the computation level by level we find that at level \(m\) we have:

where \(a_{mj}\) is called the *adjoint* of \(dt_{mj}\).

If we expand the \(dt_{mj}\) terms, we will be left with only \(dt_{ij}\) terms for \(i<m\). During the expansion, we can push the \(a_{mij}\) adjoints down to the next level.

### Example¶

To compute \(y = ax^2+bx+c\) we have:

The derivatives of these computations are:

Now we start expanding:

In the expansion, we pushed the adjoint of 1 on \(dt_{30}\) down to the terms in the expansion.

We then expand the \(dt_{21}\) terms to get:

And finally, we expand the first level terms to get this:

### The Algorithm¶

Every intermediate value in the computation supports three adjoint methods: initialize, increment, and finalize. The *initialize* step is performed when the intermediate value is computed, the *increment* method is called when a node that uses the value sends a contribution to the adjoint, and *finalize* is called when there will be no more contributions to the adjoint and processing at its level is complete.

There are two ways to approach implementing the three methods.

- The
*initialize*and*finalize*methods do nothing, while the*increment*method propagates to increment methods at lower levels. - We associate an adjoint array of the same kind as the value.
*Initialize*initializes the adjoint to 0 (possibly also allocating it),*increment*increments the adjoint, and*finalize*propagates the appropriate values to increment methods for lower level adjoints, and possibly frees the adjoint storage.

For values at level 0 that we want derivatives for, we use the second approach. The remaining values at level 0 use the first approach, which ignores the updates. At higher levels, the approach depends on the computation and how many computations use the value. If the update is simple, or if the value is only used once, the first approach should be used. If it is cheaper to accumulate the adjoint and process it all at once, the second approach is used.

For example, if we have a computation \(t_m = t_a t_b\) then, since \(dt_m = t_b dt_a+t_a dt_b\), we perform:

where we use \(\overline{t}\) to denote the adjoint we are accumulating for \(t\).

We use the second method so that we only need to perform the multiplication once. Compare this with \(t_m=t_a+t_b\) with derivative \(dt_a+dt_b\). If there are two uses of the value, using the second approach requires allocating and initializing an array for the adjoint (we could have the first update perform the initialization), followed by one addition to the adjoint, and then two additions as the adjoint is passed to the next level. The first approach requires four additions to the adjoints at the next level, but no additional storage.