# Autodiff¶

We use the autodiff algorithm to generate the *backprop* computations for derivatives. Autodiff is based on symbolic differentiation via the chain rule.

## Computing Derivatives in Op-Graph¶

Each `Op`

node in our graph implements the `generate_adjoints`

method, which defines the local gradient for that operation and propagates the deltas to its arguments. To compute the derivatives, we first perform a topological sort on the graph, then traverse the graph in order, calling each node’s `generate_adjoints`

method to add the required backprop computations to the graph.

The `generate_adjoints`

method accepts `adjoints`

, `delta`

, and the arguments of the op.
The `adjoints`

contains
partially computed backprop `Ops`

, while `delta`

is the complete adjoint of the `Op`

.

To implement `generate_adjoints`

for an `Op`

for the function

write out

Then:

```
def generate_adjoints(adjoints, delta, x1, x2, ..., xn):
x1.generate_add_delta(adjoints, a1 * delta)
x2.generate_add_delta(adjoints, a2 * delta)
...
xn.generate_add_delta(adjoints, an * delta)
```

For example,

So:

```
def generate_adjoints(self, adjoints, delta, x, y):
x.generate_add_delta(adjoints, y * delta)
y.generate_add_delta(adjoints, x * delta)
```

## Technical Details¶

Although we write a computation in a program as a series of expressions, they are converted into a series of steps, each a function producing a value from previously computed values. We will use the notation \(t_{mj}\) for a value which is computed from \(\{t_{ij} | i<m\}\). We start with the independent variables,

Then we apply functions to these variables to obtain

It is not necessary for each function to use all of the values, but it must only use previous values.

From the original values and the newly computed values we compute new values,

We proceed until we finally have \(y=t_{n0}\).

For each computation, we have

where \(D_{jk}f_{mi}\) is the derivative of \(f_{mi}\) with respect to argument \(jk\).

If we continue expanding the \(dt_{mi}\) expressions we will eventually have an expression that can be written as

Since the layer 0 values are independent,

so

If we expand the computation level by level we find that at level \(m\) we have

where \(a_{mj}\) is called the *adjoint* of \(dt_{mj}\).

If we expand the \(dt_{mj}\) terms, we will be left with only \(dt_{ij}\) terms for \(i<m\). During the expansion, we can push the \(a_{mij}\) adjoints down to the next level.

### Example¶

To compute \(y = ax^2+bx+c\) we have:

The derivatives of these computations are:

Now we start expanding:

In the expansion, we pushed the adjoint of 1 on \(dt_{30}\) down to the terms in the expansion.

We then expand the \(dt_{21}\) terms to get:

Finally, we expand the first level terms to get

### The Algorithm¶

Every intermediate value in the computation supports three adjoint methods, initialize, increment, and finalize. The initialize step is performed when the intermediate value is computed, the increment is called when a node which uses the value sends a contribution to the adjoint, and finalize is called when there will be no more contributions to the adjoint; processing at its level is complete.

- There are two ways to implement the three methods.
- The initialize and finalize methods do nothing, while the increment method propagates to increment methods at lower levels.
- We associate an adjoint array of the same kind as the value. Initialize initializes the adjoint to 0 (possibly also allocating it), increment increments the adjoint, and finalize propagates the appropriate values to increment methods for lower level adjoints, and possibly frees the adjoint storage.

For values at level 0 that we want derivatives for we use the second approach, and the remaining values at level 0 use the first approach, which ignores the updates. At higher levels, the approach depends on the computation and how many computations use the value. If the update is simple, or if the value is only used once, the first approach should be used, while if it is cheaper to accumulate the adjoint and process it all at once, the second approach is used.

For example, if we have a computation \(t_m = t_a t_b\) then, since \(dt_m = t_b dt_a+t_a dt_b\), we perform

where we use \(\overline{t}\) to denote the adjoint we are accumulating for \(t\).

We use method 2 so that we only need to perform the multiplication once. Compare this with \(t_m=t_a+t_b\) with derivative \(dt_a+dt_b\). If there are two uses of the value, using approach 2 requires allocating and initializing an array for the adjoint (we could have the first update perform the initialization), followed by one addition to the adjoint, and then two additions as the adjoint is passed to the next level, while approach 1 requires four additions to the adjoints at the next level, but no additional storage.