Train using multiple nGraph CPU backends with data parallel

In the previous section, we described the steps needed to create a “trainable” nGraph model. Here we demonstrate how to train a data parallel model by distributing the graph across devices.

To use this mode of training, create an nGraph build with the cmake flag -DNGRAPH_DISTRIBUTED_ENABLE=TRUE.

To deploy data-parallel training on backends supported by nGraph API, the AllReduce op should be added after the steps needed to complete the backpropagation.

    ngraph::autodiff::Adjoints adjoints(NodeVector{loss},
    auto grad_W0 = adjoints.backprop_node(W0);
    auto grad_b0 = adjoints.backprop_node(b0);
    auto grad_W1 = adjoints.backprop_node(W1);
    auto grad_b1 = adjoints.backprop_node(b1);

    auto avg_grad_W0 = std::make_shared<op::AllReduce>(grad_W0);
    auto avg_grad_b0 = std::make_shared<op::AllReduce>(grad_b0);
    auto avg_grad_W1 = std::make_shared<op::AllReduce>(grad_W1);
    auto avg_grad_b1 = std::make_shared<op::AllReduce>(grad_b1);

    auto W0_next = W0 + avg_grad_W0;
    auto b0_next = b0 + avg_grad_b0;
    auto W1_next = W1 + avg_grad_W1;
    auto b1_next = b1 + avg_grad_b1;

We need to initialize and finalize distributed training with Distributed object; see the full raw code.

Finally, to run the training using two nGraph devices, invoke mpirun which is a distributed with Intel MLSL library. This will launch two nGraph CPU backends.

$ mpirun -np 2 dist_mnist_mlp