Logistic Regression

This example performs logistic regression. The corresponding jupyter notebook is found here. We want to classify an observation \(x\) into one of two classes, denoted by \(y=0\) and \(y=1\). Using a simple linear model:

\[\hat{y}=\sigma(Wx)\]

we want to find the optimal values for \(W\). Here, we use gradient descent with a learning rate of \(\alpha\) and the cross-entropy as the error function.

Axes

The nervana graph uses Axes to attach shape information to tensors. The identity of Axis objects are used to pair and specify dimensions in symbolic expressions. The function ng.make_axis will create an Axis object with an optionally supplied name argument. For example:

import ngraph as ng
import ngraph.transformers as ngt

N = ng.make_axis(length=128, name='N')
C = ng.make_axis(length=4)

We add batch as a property to N to indicate that the axis is a batch axis. A batch axis is held out of the default set of axes reduced in reduction operations such as sums.

Building the graph

Our model has three placeholders: X, Y, and alpha, each of which need to have axes defined. alpha is a scalar, so we pass in empty axes:

alpha = ng.placeholder(axes=())

X and Y are tensors for the input and output data, respectively. Our convention is to use the last axis for samples. The placeholders can be specified as:

X = ng.placeholder(axes=[C, N])
Y = ng.placeholder(axes=[N])

We also need to specify the training weights, W. Unlike a placeholder, W should retain its value from computation to computation (for example, across mini-batches of training). Following TensorFlow, we call this a variable. We specify the variable with both Axes and also an initial value:

W = ng.variable(axes=[C], initial_value=0)

Now we can estimate y as Y_hat and compute the average loss L:

Y_hat = ng.sigmoid(ng.dot(W, X))
L = ng.cross_entropy_binary(Y_hat, Y, out_axes=()) / ng.batch_size(Y_hat)

Here we use several ngraph functions, including ng.dot and ng.sigmoid. Since a tensor can have multiple axes, we need a way to mark which axes in the first argument of ng.dot are to act on which axes in the second argument. Please also note that the W has been defined with one axis, while X has two axis. Every tensor component along C axis in X is being dot-producted with W, and the N results are stored in Y_hat, that has only one axis, the N axis.

Once Y_hat has been computed (the whole batch computation was defined above), we can move on and update the weights in W. Gradient descent requires computing the gradient, \(\frac{dL}{dW}\)

grad = ng.deriv(L, W)

The ng.deriv function computes the backprop using autodiff. We are almost done as we are now ready to update W. The update step (which is an Op that will be carried out at the time of real computation on the device) computes the new weight and assigns it to W:

update = ng.assign(W, W - alpha * grad / ng.tensor_size(Y_hat))

We will also need a way to generate input data. Below the input data, XS and YS, is synthetically generated as a mixture of two Gaussian distributions in 4-d space. We shape our entire dataset as 10 mini-batches of 128 samples each, which we create with a convenient function:

import gendata

g = gendata.MixtureGenerator([.5, .5], (C.length,))
XS, YS = g.gen_data(N.length, 10)

Computation

Now we create a transformer and define a computation for learning. In order to do so, we pass the ops from which we want to retrieve the results for, followed by the placeholders.

Here, the computation will return three values for the L, W, and update, given inputs to fill the placeholders, \(\alpha\) (Learning Rate), X (inputs), Y (expected outputs).

from contextlib import closing

with closing(ngt.make_transformer()) as transformer:
    update_fun = transformer.computation([L, W, update], alpha, X, Y)

    for i in range(10):
        for xs, ys in zip(XS, YS):
            loss_val, w_val, _ = update_fun(5.0 / (1 + i), xs, ys)
            print("W: %s, loss %s" % (w_val, loss_val))

Finally, we train the model across the 10 epochs, printing the loss and updated weights. Please note that we are using a decreasing policy (with the epoch number) for \(\alpha\). Also note that there is no need to specify the outputs when invoking update_fun, as they were specified at definition time. Now we need only to feed the inputs into the update_fun call:

Also see Part 2 of logistic regressions, which walks users through adding additional variables, computations, and dimensions.

Walk-through

This walk-through guides users through several key concepts for using the nervana graph. The corresponding jupyter notebook is found here.

Let’s begin with a very simple example: computing x+1 for several values of x using the ngraph API. We should think of the computation as being invoked from the host, but possibly taking place somewhere else, which we will refer to as the device.

The nervana graph currently uses a compilation model. Users first define the computations by building a graph of operations, then they are compiled and run. In the future, we plan an even more compiler-like approach, where an executable is produced that can later be run on various platforms, in addition to an interactive version.

Our first program will use ngraph to compute x+1 for each x provided.

The x+1 program

The complete program, which we will walk through, is:

from __future__ import print_function
from contextlib import closing
import ngraph as ng
import ngraph.transformers as ngt

# Build the graph
x = ng.placeholder(axes=())
x_plus_one = x + 1

# Select a transformer
with closing(ngt.make_transformer()) as transformer:

    # Define a computation
    plus_one = transformer.computation(x_plus_one, x)

    # Run the computation
    for i in range(5):
        print(plus_one(i))

We begin by importing ngraph, the Python module for graph construction, and ngraph.transformers, the module for transformer operations.

import ngraph as ng
import ngraph.transformers as ngt

Next, we create a computational graph, which we refer to as ngraph, for the computation. Following TensorFlow terminology, we use placeholder to define a port for transferring tensors between the host and the device. Axes are used to tell the graph the tensor shape. In this example, x is a scalar so the axes are empty.

x = ng.placeholder(axes=())

x can be thought as a dummy node of the ngraph, providing an entry point for data into the computational graph. The ngraph graph construction API uses functions to build a graph of Op objects, the ngraph. Each function may add operations to the ngraph, and will return an Op that represents the computation. Here below, using implicitly ngraph as it will be made evident at the next step, we are adding an Op to the ngraph that takes as input the variable tensor x just defined, and the constant number 1.

x_plus_one = x + 1

A bit of behind the scenes magic occurs with the Python number 1 in the expression above, which is not an Op. When an argument to a graph constructor is not an Op, nervana graph will attempt to convert it to an Op using ng.constant, the graph function for creating a constant.

Thus, what it is really happening (when we are defining x_plus_one as above) is:

x_plus_one = ng.add(x, ng.constant(1))

For more information about the Op hierarchy please visit: https://ngraph.nervanasys.com/docs/latest/building_graphs.html At this point, our computational graph has been defined with only one function to compute represented by x_plus_one. Once the ngraph is defined, we can compile it with a transformer. Here we use make_transformer to make a default transformer. We tell the transformer the function to compute, x_plus_one, and the associated input parameters, only x in our example. The constant needs not to be repeated here, as it is part of the definition of the function to compute. The current default transformer uses NumPy for execution.

# Select a transformer
with closing(ngt.make_transformer()) as transformer:

    # Define a computation
    plus_one = transformer.computation(x_plus_one, x)

    # Run the computation
    for i in range(5):
        print(plus_one(i))

The first time the transformer executes a computation, the ngraph is analyzed and compiled, and storage is allocated and initialized on the device. Once compiled, the computations are callable Python objects residing on the host. On each call to plus_one the value of x is copied to the device, 1 is added, and then the result is copied back from the device to the host.

The Compiled x + 1 Program

The compiled code, to be executed on the device, can be examined (currently located in /tmp folder) to view the runtime device model. Here we show the code with some clarifying comments.

class Model(object):
    def __init__(self):
        self.a_AssignableTensorOp_0_0 = None
        self.a_AssignableTensorOp_0_0_v_AssignableTensorOp_0_0_ = None
        self.a_AssignableTensorOp_1_0 = None
        self.a_AssignableTensorOp_1_0_v_AssignableTensorOp_1_0_ = None
        self.a_AddZeroDim_0_0 = None
        self.a_AddZeroDim_0_0_v_AddZeroDim_0_0_ = None
        self.be = NervanaObject.be

    def alloc_a_AssignableTensorOp_0_0(self):
        self.update_a_AssignableTensorOp_0_0(np.empty(1, dtype=np.dtype('float32')))

    def update_a_AssignableTensorOp_0_0(self, buffer):
        self.a_AssignableTensorOp_0_0 = buffer
        self.a_AssignableTensorOp_0_0_v_AssignableTensorOp_0_0_ = np.ndarray(shape=(), dtype=np.float32,
            buffer=buffer, offset=0, strides=())

    def alloc_a_AssignableTensorOp_1_0(self):
        self.update_a_AssignableTensorOp_1_0(np.empty(1, dtype=np.dtype('float32')))

    def update_a_AssignableTensorOp_1_0(self, buffer):
        self.a_AssignableTensorOp_1_0 = buffer
        self.a_AssignableTensorOp_1_0_v_AssignableTensorOp_1_0_ = np.ndarray(shape=(), dtype=np.float32,
            buffer=buffer, offset=0, strides=())

    def alloc_a_AddZeroDim_0_0(self):
        self.update_a_AddZeroDim_0_0(np.empty(1, dtype=np.dtype('float32')))

    def update_a_AddZeroDim_0_0(self, buffer):
        self.a_AddZeroDim_0_0 = buffer
        self.a_AddZeroDim_0_0_v_AddZeroDim_0_0_ = np.ndarray(shape=(), dtype=np.float32,
            buffer=buffer, offset=0, strides=())

    def allocate(self):
        self.alloc_a_AssignableTensorOp_0_0()
        self.alloc_a_AssignableTensorOp_1_0()
        self.alloc_a_AddZeroDim_0_0()

    def Computation_0(self):
        np.add(self.a_AssignableTensorOp_0_0_v_AssignableTensorOp_0_0_,
               self.a_AssignableTensorOp_1_0_v_AssignableTensorOp_1_0_,
               out=self.a_AddZeroDim_0_0_v_AddZeroDim_0_0_)

    def init(self):
        pass

Tensors have two components: - storage for their elements (using the convention a_ for the allocated storage of a tensor) and - views of that storage (denoted as a_...v_).

The alloc_ methods allocate storage and then create the views of the storage that will be needed. The view creation is separated from the allocation because storage may be allocated in multiple ways.

Each allocated storage can also be initialized to, for example, random Gaussian variables. In this example, there are no initializations, so the method init, which performs the one-time device initialization, is empty. Constants, such as 1, are copied to the device as part of the allocation process.

The method Computation_0 handles the plus_one computation. Clearly this is not the optimal way to add 1 to a scalar, so let’s look at a more complex example next in the Logistic Regression walk-through.

Logistic Regression Part 2

In this example, we extend the code from Part 1 with several important features: - Instead of just updating the weight matrix W, we add a bias b and use the .variables() method to compactly update both variables. - We attach an additional computation to the transformer to compute the loss on a held-out validation dataset. - We switch from a flat C-dimensional feature space to a W x H feature space to demonstrate multi-dimensional logistic regression.

The corresponding jupyter notebook is found here.

import ngraph as ng
import ngraph.transformers as ngt
import gendata

The axes creation is conceptually the same as before, except we now add a new axes H to represent the new feature space.

ax_W = ng.make_axis(length=2)
ax_H = ng.make_axis(length=2)  # new axis added.
ax_N = ng.make_axis(length=128, name='N')

Building the graph

Our model, as in the previous example, has three placeholders: X, Y, and alpha. But now, the the input X has shape (W, H, N):

alpha = ng.placeholder(())
X = ng.placeholder([ax_W, ax_H, ax_N])  # now X has shape (W, H, N)
Y = ng.placeholder([ax_N])

Similarly, the weight matrix is now multi-dimensional, with shape (W, H), and we add a new scalar bias variable. We want also to specify that, for the weight matrix W, both axes will be reduced when computing the element-wise product and summation with the inputs (so we add -1 to specify this)

W = ng.variable([ax_W, ax_H], initial_value=0).named('W')  # now the Weight Matrix W has shape (W, H)
b = ng.variable((), initial_value=0).named('b')

Our predicted output will now be including the bias b. Please note there here the + operation implicitly broadcasts b to the batch size N, the size of the only axis of Y_hat:

Y_hat = ng.sigmoid(ng.dot(W, X) + b)
L = ng.cross_entropy_binary(Y_hat, Y, out_axes=())

For the parameter updates, instead of explicitly specifying the variables W and b, we can call L.variables() to retrieve all the variables that the loss function depends on:

print([var.name for var in L.variables()])

For complicated ngraphs, the variables() method makes it easy to iterate over all its dependant variables. Our new parameter update is then:

updates = [ng.assign(v, v - alpha * ng.deriv(L, v) / ng.batch_size(Y_hat))
           for v in L.variables()]

Please note that this time we embedded the (call to the) gradient computation inside the definition of the weight update computation. As stated in the previous example, the ng.deriv function computes the backprop using autodiff. The update step computes the new weight and assigns it to W:

all_updates = ng.doall(updates)

Computation

We have our update computation as before, but we also add an evaluation computation that computes the loss on a separate dataset without performing the updates. Since the evaluation computation does not perform any update operation, we need not pass in the learning rate alpha

For convenience, we define a function that computes the average cost across the validation set.

def avg_loss(xs, ys):
    total_loss = 0
    for x, y in zip(xs, ys):
        loss_val = eval_fun(x, y)
        total_loss += loss_val
    return total_loss / x.shape[-1]

We then generate our training and evaluation sets and perform the updates with the same technique that we used in the previous example. We emit the average loss on the validation set during training. Please note that because the length of the axes W and H is 2 now (for both; before we had only one axis of lenght 4), the number of weights is the same as in the previous example

from contextlib import closing

with closing(ngt.make_transformer()) as transformer:

    update_fun = transformer.computation([L, W, b, all_updates], alpha, X, Y)
    eval_fun = transformer.computation(L, X, Y)

    g = gendata.MixtureGenerator([.5, .5], (ax_W.length, ax_H.length))
    XS, YS = g.gen_data(ax_N.length, 10)
    EVAL_XS, EVAL_YS = g.gen_data(ax_N.length, 4)

    print("Starting avg loss: {}".format(avg_loss(EVAL_XS, EVAL_YS)))
    for i in range(10):
        for xs, ys in zip(XS, YS):
            loss_val, w_val, b_val, _ = update_fun(5.0 / (1 + i), xs, ys)
        print("After epoch %d: W: %s, b: %s, avg loss %s" % (i, w_val.T, b_val, avg_loss(EVAL_XS, EVAL_YS)))