MegEngine implements neural network#

What this tutorial covers

  • Think about the limitations of linear models, consider how to solve the linear inseparability problem, and introduce the concept of “activation function”;

  • Have a basic understanding of neural and fully connected network model structures;

  • Expose to different parameter initialization strategies, learn to use module module to improve model design efficiency;

  • According to the previous introduction, use MegEngine to implement a two-layer fully connected neural network to complete the Fashion-MNIST image classification task.

Get the original dataset#

In the last tutorial, we used MegEngine’s data.dataset module to obtain MNIST dataset and achieved over 90% classification accuracy using a linear classifier. Next, we will use the exact same model structure and optimization strategy on the similar Fashion-MNIST dataset to see if the linear model can still achieve the same good results.

Usually after 5 rounds of training, using a linear classifier (Logistic regression model) can achieve 83% accuracy on Fashion-MNIST.

Such results are not ideal compared to human level, and we need to design better machine learning models.

Limitations of Linear Models#

Linear models are simpler and therefore have many limitations. For example, when dealing with classification problems, the generated decision boundary is a hyperplane, which means that ideally, the sample points are linearly separable in the feature space. The most typical counter-example is that it cannot solve the exclusive-or (XOR) operation problem:

A dataset of four samples

Enter \(\boldsymbol{x}_i\)

feature :math:`{i1}

feature :math:`{i2}

output \(y_i\)

\(\boldsymbol{x}_1\)

0

0

0

\(\boldsymbol{x}_2\)

0

1

1

\(\boldsymbol{x}_3\)

1

0

1

\(\boldsymbol{x}_4\)

1

1

0

2D space representation

../../_images/XOR-Problem.png

Since the data are linearly inseparable, we cannot find such a linear decision boundary that separates the two classes of samples well.

Note

We are about to transition from a linear model to a neural network model, and you will find that everything has already happened quietly.

Introduce nonlinear factors#

Recall that from the last tutorial, our linear classifier had an operation (link function) that maps linear predictions to probability values in the forward computation. Since this step of calculation has no effect on the decision boundary of the samples in the model, we can still consider this as a generalized linear model (a more terminological interpretation is that we assume that the observed samples still obey an exponential family distribution, which is not discussed in this tutorial. would go into too many mathematical details).

It’s time to tell you a secret. The linear model itself can be:of as the simplest single-layer neural network!

../../_images/nn-2.svg

If the MNIST image sample flattened feature vector \(\boldsymbol{x}\) is regarded as the input layer (Input Layer), then for the binary classification problem, we can regard it as a neuron here, responsible for completing the linear prediction \(z= \boldsymbol{w}^T \boldsymbol{x} + b\) Calculations associated with link function \(a=\sigma(z)\). In binary classification problems, the output layer only needs one neuron to do the job, while multi-classification problems require multiple neurons.#

Activation function and hidden layer#

In the neuron computing model, the nonlinear function is called the activation function (Activation function), and the activation function that is often used in history is the Sigmoid function we have seen \(\sigma(\cdot)\). This brings us to For inspiration, the activation function is nonlinear, which also means that nonlinearity is introduced into our computational model through the connections between the synapses of multiple neurons. We can do this by introducing a Hidden Layer:

../../_images/nn-mlp.svg

There are 12 neurons in the hidden layer, and each neuron needs to be responsible for the corresponding nonlinear calculation and decide whether to activate or not.#

Let’s standardize the terminology, in the neural network model, the neural network above is called a 2-layer fully connected neural network. The input layer is a sample feature, and no actual calculation occurs, so it is not included in the number of model layers. There can be multiple hidden layers, and the number of neurons in each layer needs to be manually set. Since the neurons of the linear layer are completely connected with all the inputs of the previous layer, it is also called a fully connected layer (FC Layer). It is precisely because the activation function is added behind the fully connected layer that the neural network has the ability of nonlinear computing.

Use MegEngine to simulate this calculation process and feel it intuitively (only focus on shape changes here):

x = F.ones((16,))

W1 = F.zeros((16, 12))
b1 = F.zeros((12,))
z1 = F.matmul(x, W1) + b1  # Linear (full connected)
a1 = F.nn.sigmoid(z1)  # Activations

W2 = F.zeros((12, 10))
b2 = F.zeros((10))
z2 = F.matmul(a1, W2) + b2  # Linear (full connected)
output = F.softmax(z2)  # Logits
>>> output.shape
(10,)

In MegEngine, common nonlinear activation functions are implemented in the functional.nn module. Since there may be a large number of nonlinear calculations in the neural network, and unlike the classifier, which requires the output to be mapped to the probability interval, the more commonly used activation functions are the sigmoid function, it has the characteristics of simple calculation and derivation, and meets the characteristic requirements of nonlinear calculation. For a more specific explanation, check out the API documentation for the different activation functions.

In the field of deep learning, there are many researches and designs on activation functions. For the convenience of this tutorial, the ReLU activation function is used.

Multilayer Neural Network#

In addition to the selection of the activation function, the steps to define a fully connected neural network are mainly to design the number of hidden layers and the number of neurons in each layer.

We can make the model have stronger learning ability and expressive ability by stacking more hidden layers. From this point of view, the transformation in the linear model (accurately speaking, the affine transformation, but here we emphasize the difference between nonlinear and linear), no matter how superimposed, can finally be represented by an equivalent transformation, that is, a matrix The form of \(C=AB\) in arithmetic. Although we introduce nonlinearity through the activation function, the problem also arises:

  • We need to manage more model parameters, the solution in MegEngine will be given in this tutorial;

  • Neural network models can theoretically approximate arbitrary functions, and using deeper networks usually means stronger approximation capabilities. But we can’t design randomly, and we also need to make trade-offs between the amount of parameters in the model (the amount of calculation) and the final performance of the model.

See also

At present, the neural network model architecture that is only composed of fully connected layers (and activation functions) is also called Multilayer perceptron (MLP) in some materials. The starting point is to improve the perceptron algorithm, and then get the MLP. The two essentially refer to the same. We use Logistic regression to solve the binary classification problem, so we do not use the term perceptron.

random initialization strategy#

We need to pay attention to some characteristics of fully connected layers, such as each neuron in the layer will operate with all the inputs in the previous layer. Recall that when we introduced the linear model, we mentioned that the parameters in the model will be iteratively optimized from an initial value, and the simplest initialization strategy is adopted, all-zero initialization, that is, the values of all parameters in the model are initialized to 0 . This approach works in the case of single-layer model output + loss function, but will cause problems for multi-layer neural networks.

Suppose we initialize all the neuron parameters in the hidden layer to zero, which actually means that all the neurons are doing the same:

  • During forward calculation, since the input of the previous layer is the same, the output of the forward calculation of the neurons in the same layer will be the same;

  • When passing through the activation function, since the activation function ReLU has no randomness, the same output will be obtained and passed to the next layer;

  • In reverse calculation, all parameters will get the same gradient, and if the learning rate is the same, the parameters will be the same after updating.

This results in that all neurons in the fully connected layer are doing the same thing, and the ability to express is greatly reduced. The solution is to use a random initialization strategy.

MegEngine generates random data

random module is provided in MegEngine to randomly generate Tensor:

import megengine.random as rand

rand.seed(20200325)
x = rand.normal(0, 1, (3, 3))
>>> x
Tensor([[ 0.014   0.3366  0.877 ]
[ 0.4073 -0.0031  0.2638]
[-0.1826  1.4192  0.2758]], device=xpux:0)

Among them, random.seed can set a random seed, which is convenient to reproduce the random state in some cases. We randomly generated a Tensor of shape \((3, 3)\) from a standard normal distribution (mean 0, standard deviation 1) using the normal interface.

As a deep learning framework, MegEngine naturally needs to provide a convenient way to design models.

Use Module to define the model#

See also

The introduction of this section of this tutorial is relatively concise, please refer to Use Module to define the model structure for the complete content.

The module module in MegEngine provides a class of abstract mechanism for the structure in the neural network model, everything is Module class, except for the implementation of the default random initialization strategy for common modules, And provides common methods (such as the upcoming Module.parameters). This allows users to focus on designing the network structure, freeing them from details such as repetitive parameter initialization and management methods.

For example, for a linear layer operation like matmul, you can use Linear to represent:

import megengine.module as M

x = F.ones((num_features,))  # (784,)
fc = M.Linear(num_features, num_hidden)  # (784, 256)

It can be found that the fc module has the weight (Weight) and bias (Bias) parameters of the corresponding shape, and the initialization is automatically completed.

>>> fc.weight.shape
(256, 784)
>>> fc.bias.shape
(256,)

According to the API documentation, for the input sample \(x\), the calculation process is \(y=xW^T+b\).

Simply verify that the operation result of Linear is consistent with the result obtained by matmul (within the floating point error range):

>>> a = fc(x)
>>> b = F.matmul(x, fc.weight.transpose()) + fc.bias
>>> print(a.shape, b.shape)
(256,) (256,)
>>> a - b < 1e-6
Tensor([ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True], dtype=bool, device=xpux:0)

Module allows nested implementations like building blocks, so the fully connected neural network in this tutorial can be implemented:this.

  • All network structures are derived from the base class M.Module. In the constructor, you must first call super().__init__().

  • In the constructor, declare all layers/modules to be used;

  • In the forward function, define how the model will run, from input to output.

class NN(M.Module):
   def __init__(self):
      super().__init__()
      self.fc = M.Linear(num_features, num_hidden)
      self.classifier = M.Linear(num_hidden, num_classes)

   def forward(self, x):
      x = F.nn.relu(self.fc(x))
      x = self.classifier(x)
      return x

An iterator of model parameters can be obtained with the help of Module.parameters, which is provided to the gradient manager and optimizer:

>>> gm = autodiff.GradManager().attach(model.parameters())
>>> optimizer = optim.SGD(model.parameters(), lr=0.01)

Practice:Feedforward Neural Networks#

A feedforward neural network is the simplest type of neural network. Each neuron is arranged in layers, and each neuron is only connected to the neurons in the previous layer. Receive the output of the previous layer and output it to the next layer without feedback between layers. It is one of the most widely used and fastest-growing artificial neural networks.

In short, the feedforward neural network contains no other types of structures except the fully connected layer Linear, let’s implement it with MegEngine:

import megengine
import megengine.data as data
import megengine.data.transform as T
import megengine.functional as F
import megengine.module as M
import megengine.optimizer as optim
import megengine.autodiff as autodiff

def load_mnist(path, kind='train'):
    import os
    import gzip
    import numpy as np

    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               '%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,
                               '%s-images-idx3-ubyte.gz'
                               % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels


# Get train and test dataset and prepare dataloader
# Make sure that you have downloaded data and placed it in `DATA_PATH`
# GitHub link: https://github.com/zalandoresearch/fashion-mnist

from os.path import expanduser
DATA_PATH = expanduser("~/data/datasets/Fashion-MNIST")

X_train, y_train = load_mnist(DATA_PATH, kind='train')
X_test, y_test = load_mnist(DATA_PATH, kind='t10k')

mean, std = X_train.mean(), X_train.std()

train_dataset = data.dataset.ArrayDataset(X_train, y_train)
test_dataset = data.dataset.ArrayDataset(X_test, y_test)

train_sampler = data.RandomSampler(train_dataset, batch_size=64)
test_sampler = data.SequentialSampler(test_dataset, batch_size=64)

transform = T.Normalize(mean, std)

train_dataloader = data.DataLoader(train_dataset, train_sampler, transform)
test_dataloader = data.DataLoader(test_dataset, test_sampler, transform)

num_features = train_dataset[0][0].size
num_hidden = 256
num_classes = 10


# Define model
class NN(M.Module):
    def __init__(self):
        super().__init__()
        self.fc = M.Linear(num_features, num_hidden)
        self.classifier = M.Linear(num_hidden, num_classes)

    def forward(self, x):
        x = F.nn.relu(self.fc(x))
        x = self.classifier(x)
        return x


model = NN()

# GradManager and Optimizer setting
gm = autodiff.GradManager().attach(model.parameters())
optimizer = optim.SGD(model.parameters(), lr=0.01)


# Training and validation
nums_epoch = 5
for epoch in range(nums_epoch):
    training_loss = 0
    nums_train_correct, nums_train_example = 0, 0
    nums_val_correct, nums_val_example = 0, 0

    for step, (image, label) in enumerate(train_dataloader):
        image = F.flatten(megengine.Tensor(image), 1)
        label = megengine.Tensor(label).astype("int32")

        with gm:
            score = model(image)
            loss = F.nn.cross_entropy(score, label)
            gm.backward(loss)
            optimizer.step().clear_grad()

        training_loss += loss.item() * len(image)

        pred = F.argmax(score, axis=1)
        nums_train_correct += (pred == label).sum().item()
        nums_train_example += len(image)

    training_acc = nums_train_correct / nums_train_example
    training_loss /= nums_train_example

    for image, label in test_dataloader:
        image = F.flatten(megengine.Tensor(image), 1)
        label = megengine.Tensor(label).astype("int32")
        pred = F.argmax(model(image), axis=1)

        nums_val_correct += (pred == label).sum().item()
        nums_val_example += len(image)

    val_acc = nums_val_correct / nums_val_example

    print(f"Epoch = {epoch}, "
          f"train_loss = {training_loss:.3f}, "
          f"train_acc = {training_acc:.3f}, "
          f"val_acc = {val_acc:.3f}")

See also

The corresponding source code of this tutorial: examples/beginner/neural-network.py

After 5 rounds of training, a neural network model with an accuracy rate of over 83% (linear classifier) is usually obtained. In this tutorial we just want to demonstrate that introducing nonlinearity into the model leads to better expressiveness and predictive performance, so we didn’t spend time tuning hyperparameters and continuing to optimize our model. The Fashion-MNIST dataset officially maintains a `benchmark test <http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/>result. It can be found that there are test results obtained by using MLPClassifier, which can reach 87.7%. We can reproduce the model and experimental results according to the description of the relevant model. A neural network model with comparable performance is obtained.

In fact, from the moment we touch the neural network model, we have already started to enter the “tuning hyperparameters” mode. Neural network models require the right code and hyperparameter design to achieve very good results. For MegEngine beginners who have just come into contact with neural networks, trying more coding is the most recommended way to improve. Inside the Megvii Technology Research Institute, the process of training a model is called “alchemy”, and now this term has become an industry slang. In the process of completing the MegEngine introductory tutorial, you are actually accumulating the most basic alchemy experience.

Summary:and then explore the calculation diagram#

We mentioned computational graphs in the first tutorial, now let’s recall:

  • MegEngine is a deep neural network learning framework based on Computing Graph;

  • In the field of deep learning, any complex deep neural network model can essentially be represented by a computational graph.

When we run the neural network model training code in this tutorial, we can further imagine in our minds, what should this fully connected neural network be expressed as a computational graph? When we visualize the structure of a neural network model, we usually focus on the change process of data nodes, but the computing nodes, or operators, in the calculation graph are also very critical. Assuming that our operator can only be a linear operation such as matmul / Linear, then the model will also impose restrictions on the shape of the input data, which must be expressed as a feature vector ( i.e. a 1-d tensor). When we face more complex data representation forms, such as RGB 3-channel color images, can we continue to use fully connected neural networks?

In the next tutorial we will experiment with the CIFAR 10 color image dataset and see if a fully connected network works. At that time, we will introduce a new operator (keep it mysterious for now), and find that designing a neural network is actually like building blocks, there will be many different effective structures, which are suitable for different scenarios, and when designing an operator , some traditional domain knowledge can sometimes be helpful.

Expansion material#

references#