Derive a trainable model

Documentation in this section describes one of the possible ways to turn a DL model for inference into one that can be used for training.

Additionally, and to provide a more complete walk-through that also trains the model, our example includes the use of a simple data loader for uncompressed MNIST data.

Automating graph construction

In a Machine Learning ecosystem, it makes sense to use automation and abstraction where possible. nGraph was designed to automatically use the “ops” of tensors provided by a framework when constructing graphs. However, nGraph’s graph-construction API operates at a fundamentally lower level than a typical framework’s API, and writing a model directly in nGraph would be somewhat akin to programming in assembly language: not impossible, but not the easiest thing for humans to do.

To make the task easier for developers who need to customize the “automatic”, construction of graphs, we’ve provided some demonstration code for how this could be done. We know, for example, that a trainable model can be derived from any graph that has been constructed with weight-based updates.

The following example named mnist_mlp.cpp represents a hand-designed inference model being converted to a model that can be trained with nGraph.

Model overview

Due to the lower-level nature of the graph-construction API, the example we’ve selected to document here is a relatively simple model: a fully-connected topology with one hidden layer followed by Softmax.

Remember that in nGraph, the graph is stateless; values for the weights must be provided as parameters along with the normal inputs. Starting with the graph for inference, we will use it to create a graph for training. The training function will return tensors for the updated weights.

Note

This example illustrates how to convert an inference model into one that can be trained. Depending on the framework, bridge code may do something similar, or the framework might do this operation itself. Here we do the conversion with nGraph because the computation for training a model is significantly larger than for inference, and doing the conversion manually is tedious and error-prone.

Code structure

Inference

We begin by building the graph, starting with the input parameter X. We also define a fully-connected layer, including parameters for weights and bias:

    // Layer 0
    auto W0 = std::make_shared<op::Parameter>(element::f32,
                                              Shape{input_size, l0_size});
    auto b0 =
        std::make_shared<op::Parameter>(element::f32, Shape{l0_size});
    auto l0_dot = std::make_shared<op::Dot>(X, W0, 1);
    auto b0_broadcast = std::make_shared<op::Broadcast>(
        b0, Shape{batch_size, l0_size}, AxisSet{0});

Repeat the process for the next layer,

    // Layer 1
    auto W1 = std::make_shared<op::Parameter>(element::f32,
                                              Shape{l0_size, l1_size});
    auto b1 =
        std::make_shared<op::Parameter>(element::f32, Shape{l1_size});
    auto l1_dot = std::make_shared<op::Dot>(l0, W1, 1);
    auto b1_broadcast = std::make_shared<op::Broadcast>(
        b1, Shape{batch_size, l1_size}, AxisSet{0});
    auto l1 = l1_dot + b1_broadcast;

and normalize everything with a softmax.

    // Softmax
    auto softmax = std::make_shared<op::Softmax>(l1, AxisSet{1});

Loss

We use cross-entropy to compute the loss. nGraph does not currenty have a core op for cross-entropy, so we implement it directly, adding clipping to prevent underflow.

    auto labels =
        std::make_shared<op::OneHot>(Y, Shape{batch_size, output_size}, 1);
    auto softmax_clip_value = std::make_shared<op::Constant>(
        element::f32, Shape{}, std::vector<float>{log_min});
    auto softmax_clip_broadcast = std::make_shared<op::Broadcast>(
        softmax_clip_value, Shape{batch_size, output_size}, AxisSet{0, 1});
    auto softmax_clip =
        std::make_shared<op::Maximum>(softmax, softmax_clip_broadcast);
    auto softmax_log = std::make_shared<op::Log>(softmax_clip);
    auto prod = std::make_shared<op::Multiply>(softmax_log, labels);
    auto N = std::make_shared<op::Parameter>(element::f32, Shape{});
    auto loss = std::make_shared<op::Divide>(
        std::make_shared<op::Sum>(prod, AxisSet{0, 1}), N);

Backprop

We want to reduce the loss by adjusting the weights. We compute the adjustments using the reverse-mode autodiff algorithm, commonly referred to as “backprop” because of the way it is implemented in interpreted frameworks. In nGraph, we augment the loss computation with computations for the weight adjustments. This allows the calculations for the adjustments to be further optimized.

    // Each of W0, b0, W1, and b1
    auto learning_rate =
        std::make_shared<op::Parameter>(element::f32, Shape{});
    auto delta = -learning_rate * loss;

For any node N, if the update for loss is delta, the update computation for N will be given by the node

auto update = loss->backprop_node(N, delta);
    auto W0_next = W0 + adjoints.backprop_node(W0);
    auto b0_next = b0 + adjoints.backprop_node(b0);
    auto W1_next = W1 + adjoints.backprop_node(W1);
    auto b1_next = b1 + adjoints.backprop_node(b1);

The different update nodes will share intermediate computations. So to get the updated values for the weights as computed with the specified backend:

    // Get the backend
    auto backend = runtime::Backend::create("CPU");

    // Allocate and randomly initialize variables
    auto t_W0 = make_output_tensor(backend, W0, 0);
    auto t_b0 = make_output_tensor(backend, b0, 0);
    auto t_W1 = make_output_tensor(backend, W1, 0);
    auto t_b1 = make_output_tensor(backend, b1, 0);

    std::function<float()> rand(
        std::bind(std::uniform_real_distribution<float>(-1.0f, 1.0f),
                  std::default_random_engine(0)));
    randomize(rand, t_W0);
    randomize(rand, t_b0);
    randomize(rand, t_W1);
    randomize(rand, t_b1);

    // Allocate inputs
    auto t_X = make_output_tensor(backend, X, 0);
    auto t_Y = make_output_tensor(backend, Y, 0);

    auto t_learning_rate = make_output_tensor(backend, learning_rate, 0);
    auto t_N = make_output_tensor(backend, N, 0);
    set_scalar(t_N, static_cast<float>(batch_size), 0);

    // Allocate updated variables
    auto t_W0_next = make_output_tensor(backend, W0_next, 0);
    auto t_b0_next = make_output_tensor(backend, b0_next, 0);
    auto t_W1_next = make_output_tensor(backend, W1_next, 0);
    auto t_b1_next = make_output_tensor(backend, b1_next, 0);

    auto t_loss = make_output_tensor(backend, loss, 0);
    auto t_softmax = make_output_tensor(backend, softmax, 0);

Update

Since nGraph is stateless, we train by making a function that has the original weights among its inputs and the updated weights among the results. For training, we’ll also need the labeled training data as inputs, and we’ll return the loss as an additional result. We’ll also want to track how well we are doing; this is a function that returns the loss and has the labeled testing data as input. Although we can use the same nodes in different functions, nGraph currently does not allow the same nodes to be compiled in different functions, so we compile clones of the nodes.

    // Train
    // X, Y, learning_rate, W0, b0, W1, b1 -> loss, softmax, W0_next, b0_next, W1_next, b1_next
    NodeMap train_node_map;
    auto train_function = clone_function(
        Function(
            NodeVector{loss, softmax, W0_next, b0_next, W1_next, b1_next},
            ParameterVector{X, Y, N, learning_rate, W0, b0, W1, b1}),
        train_node_map);
    auto train_exec = backend->compile(train_function);