MegEngine Virtual Alchemy Challenge#

What this tutorial covers

  • Improve the convolutional neural network model in the previous tutorial and understand the official implementation of ResNet in the MegEngine/Models library;

  • Learn how to save and load models, and how to use pretrained models for inference;

  • Integrate common skills in model development, summarize this series of tutorials, and provide several possible learning directions for reference.

Warning

  • The style of this tutorial is rather unique. Suppose you are using MegEngine to participate in a competition, and I hope you can get the feeling of being there;

  • We don’t end up actually training a ResNet model from scratch (it took too long), but rather mastering the idea.

See also

Most of the demo code comes from: official/vision/classification/resnet (our purpose is to read it)

ImageNet dataset#

Recall that in the previous tutorials, our explanation ideas are similar, there are a lot of repeated code patterns:

  1. Usually we introduce a simple model and implement it, then use it on the dataset for the current tutorial;

  2. In the next tutorial, we just switched to a more complex dataset and found that the model did not work well;

  3. Therefore, it is necessary to design a better model (taking the opportunity to introduce new relevant knowledge), and train and verify it.

In step 2, what we need to modify is mainly the data loading part, such as calling different dataset acquisition interfaces; in step 3, data loading is no longer a concern, we focus on model design and hyperparameters Adjustment. In fact, for the same type of task, the same model design may be used for different datasets, and even just a little fine-tuning in the last trained model can achieve good results in similar tasks. This makes people think - how to optimize the model development process so that it can be more efficient in the face of different tasks?

This tutorial will also help you to think about some modularity and engineering issues, and to get in touch with possible workflows in production.

Learn about dataset information#

Warning

The most frequently used ILSVRC2012 image classification and localization dataset is currently available at Kaggle <https://www.kaggle.com/c/imagenet-object-localization-challenge/overview/description>. However, the complete ImageNet and other commonly used subsets cannot be obtained directly. Although MegEngine provides the corresponding :class:`~.ImageNet processing interface, it is only used for processing local data, not the original data set. download.

ImageNet evaluation metrics

  • Top-1 accuracy rate:predicts the classification probability, the category with the highest probability is consistent with the real mark, that is, it is considered correct;

  • Top-5 Accuracy:In the predicted classification probability, the top five categories in the probability contain the true marked category, that is, it is considered correct.

This corresponds to the topk_accuracy interface in MegEngine, or 1 - correct rate if expressed in error rate.

Attend Virtual ILSVRC#

Go back to 2015 in the parallel universe. As a member of the MegEngine alchemy furnace team, you just learned the convolutional neural network. Originally, you tried to use the LeNet5 model to participate in the ILSVRC competition. Look. Returning without a feather is definitely not the team’s goal, and paddling sightseeing is even more unacceptable. Now we need to find ways to improve it and strive to play a good role in this year’s challenge.

Kaiming, Xiangyu and Shaoqing have decided to join your team. The experienced teacher Sun Jian will serve as a guide to participate in the ILSVRC2015 Challenge!

../../_images/ILSVRC.jpg

So the question is, how to improve it? The idea of solving problems is very important, and everyone decides to think about solutions from different angles.

Mr. Sun said:”Let’s first take a look at what the ILSVRC image classification champion and runner-up in the past few years can provide.”

The related papers Xiangyu has already been thoroughly familiar with, and soon he gave several papers that need to be focused on:AlexNet, VGGNet, GoogleNet… “The processing ideas and model structures of these papers are quite novel and worthwhile. Take a look.” So everyone decided to start with AlexNet in chronological order.

Increase the firepower of alchemy#

Traditional neural network uses sigmoid or tanh as activation function, AlexNet uses relu, which you have already applied. Also you noticed that AlexNet uses 2 GPUs for training! “We need more GPUs to save time!” you exclaim excitedly.

See also

In official/vision/classification/resnet/train.py#L112 supports single or multiple GPUs for ResNet training, where each GPU device is seen as a worker. Multiple GPU devices for training When you need to pay attention to various data synchronization strategies, such as:

# Sync parameters and buffers
if dist.get_world_size() > 1:
    dist.bcast_list_(model.parameters())
    dist.bcast_list_(model.buffers())

# Autodiff gradient manager
gm = autodiff.GradManager().attach(
    model.parameters(),
    callbacks=dist.make_allreduce_cb("mean") if dist.get_world_size() > 1 else None,
)

From single card to multi-card, you need to use launcher decorator. For more introduction, please refer to Distributed Training .

I saw Teacher Sun wave:with a big hand, “No problem, I’ll give you eight cards with excellent physique and strong performance. Let’s fill up the firepower.”

Improve the quality of spiritual materials#

You are indulging in the whimsy of Doka’s magic, at this time Shaoqing reminds everyone:”AlexNet has also done data enhancement, we can try it too.”

data augmentation#

As the saying goes, “informed and well-informed”, larger datasets usually lead to better model performance, so data augmentation is a very common preprocessing method. However, the ImageNet competition does not allow the use of other data sets, so the approach that can be taken is to perform some random processing on the pictures in the original data set, such as random translation, flipping and so on. To the computer, such pictures can be seen as different, and random factors make each batch of data different. For example, the effect is as follows:

../../_images/chai-data-augmentation.png

The data.transform module of:implements common image data transformations, which can be performed when loading data.

train_dataloader = data.DataLoader(
    train_dataset,
    sampler=train_sampler,
    transform=T.Compose(
        [  # Baseline Augmentation for small models
           T.RandomResizedCrop(224),
           T.RandomHorizontalFlip(),
           T.Normalize(
                mean=[103.530, 116.280, 123.675], std=[57.375, 57.120, 58.395]
           ),  # BGR
           T.ToMode("CHW"),
        ]
    )
)

“Okay, this way you can randomly crop the length and width to 224 when loading data, and randomly flip horizontally.” Shaoqing quickly checked MegEngine’s API documentation and added these operations steadily. At the same time, he is also doing the normalization of Normalize, marking the channel order of the picture, “Good habit, this wave is really learned”, you silently gave a thumbs up in your heart .

Data cleaning#

In addition to data enhancement, you also think of a possibility:”Is there a problem with the quality of ImageNet’s own dataset?”

The content and labeling quality of the data will have an irreparable impact on the effect of the model. Since ImageNet is essentially a network image dataset, there will be a lot of dirty data in it - the image content is of poor quality and the format is inconsistent (grayscale image). mixed with color images), labelling errors, etc. You become a data cleaning expert and try to manually clean these dirty data, but the efficiency is too low, so you give up rationally.

See also

In fact, there are already many good data cleaning tools that can help us do similar work, such as `cleanlab <cleanlab/cleanlab>can help us find the wrong labels in the dataset, its official statement found close to 100,000 errors in the ImageNet dataset mark!

../../_images/imagenet_train_label_errors.jpg

Top label issues in the 2012 ILSVRC ImageNet train set identified using cleanlab. Label Errors are boxed in red. Ontological issues in green. Multi-label images in blue.#

In industry, you will see in some machine learning teams there are dedicated data teams responsible for improving dataset quality. Long live the cooperation!

Secret high-end pill recipe#

After you and Shaoqing were busy with data-related processing, simply compared the experimental results, and it did increase by a few points, effective! But this can only be regarded as standing on the same starting line as others. After all, similar operations can be used by everyone, and there is nothing special. Then everyone began to study the differences between the AlexNet and LeNet5 models. You found that Dropout is used in AlexNet to prevent overfitting. Its idea is very simple: randomly “drop” some neurons in each iteration, making it inactive and not participating in the calculation process come. In other words, it can also be understood that only a part of the neurons are trained during each training, which is ultimately equivalent to multiple weak classifiers being bagged together to perform work.

“But there are fewer neurons in the convolution kernel than the fully connected layer, so the benefits of using Dropout are not so obvious.” Kai Ming murmured a few words.

Batch Normalization#

Everyone decided to continue thinking about the data feature level. When inputting data, a transform.Normalize normalization is usually performed, which can make the feature distribution of the data more uniform and accelerate the convergence. However, there must be differences between the data, and the update of the parameters of the hidden layer in the neural network will cause the distribution of the output data to change. With the increase of the number of layers, this offset phenomenon will be more serious. “If the essence of the neural network is to learn the distribution of data, after the calculation of the hidden layer makes the data distribution change, it has to learn the changed distribution, which may cause the training to be unstable and difficult to converge.” Kai Ming raised doubts. At this time, Xiangyu added: “I remember that the model structure of VGGNet is as deep as 19 layers, but if it continues to deepen, it will be difficult to achieve better results.” Then he showed the model structure in the VGGNet paper.:

../../_images/vgg-config.png

VGGNet configuration, from the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”#

“In addition to the deeper structure, the difference between it and AlexNet is that the kernel_size of the convolution is not as large as that of AlexNet, and it is uniformly changed to a fixed \(3 \times 3\) convolution kernel, and some places are \(1 \times 1\).” You take a closer look and think about the discussion about data just now.

Teacher Sun gave suggestion:at this time, “Since normalization is useful in preprocessing, you should also do the same after each layer changes. Wouldn’t it help?” Xiangyu immediately began to write code, in Conv2d After the layer is calculated, the mean and variance are calculated based on the current Batch data, and the Normalize operation is performed. Then add this interface to MegEngine, named BatchNorm2d.

After adding the BN layer to VGGNet, although each Epoch has more calculations, the entire training process converges faster, and finally it really increases!

See also

In the VGG model code of BaseCls, you can see the VGGNet with BN layer and the VGGNet model without BN layer.

Warning

One thing that BN and Dropout have in common is that:is used for training, but not for actual evaluation and use. Module of MegEngine provides train and eval methods to switch between training and evaluation modes.

Fantastic initialization strategy#

Everyone realizes that the:neural network has to continue to deepen if it wants to improve its ability, but it is difficult to converge if it is deepened. The improvement of BN is only a small step.

One day in the middle of the night, the MegEngine alchemy furnace WeChat group suddenly received a message from Xiangyu:”I have a idea!” Then after ten minutes, a few more pictures came from the group. Scribbled handwritten draft. Xiangyu mentioned:”I made some assumptions of independence and introduced a set of parameter initialization rules. Do you want to try it!” Sure enough, after verification, after using Xiangyu’s initialization strategy, the overall effect became better. You just wanted to send an emoji to celebrate, but Kaiming, Shaoqing, and Teacher Sun all used WeChat to pat Xiangyu’s “Paper that hasn’t been read yet”, saying, “We haven’t slept yet.”

Then… strike while the iron is hot. Everyone started the research together again, and designed a new activation function prelu to model nonlinear features and deduce a theoretical initialization method. You applied your new method to the competition, and the result reduced the error rate to 4.94%. This has surpassed the level of human recognition image classification (5.1% error rate), which is something that Google failed to do in 2014! Their best score is only 6.67%!

for m in self.modules():
    if isinstance(m, M.Conv2d):
        M.init.msra_normal_(m.weight, mode="fan_out", nonlinearity="relu")
        if m.bias is not None:
            fan_in, _ = M.init.calculate_fan_in_and_fan_out(m.weight)
            bound = 1 / math.sqrt(fan_in)
            M.init.uniform_(m.bias, -bound, bound)
    elif isinstance(m, M.BatchNorm2d):
        M.init.ones_(m.weight)
        M.init.zeros_(m.bias)
    elif isinstance(m, M.Linear):
        M.init.msra_uniform_(m.weight, a=math.sqrt(5))
        if m.bias is not None:
            fan_in, _ = M.init.calculate_fan_in_and_fan_out(m.weight)
            bound = 1 / math.sqrt(fan_in)
            M.init.uniform_(m.bias, -bound, bound)

See also

The relevant implementations correspond to msra_normal_, init module.

The birth of residual connections#

Breaking the record is gratifying, but the challenge has slowly turned into an engineering problem. Xiangyu said seriously:”Actually, I am personally very dissatisfied, because although defeating humans, it is more of a gimmick. We also know that these methods are not very work, mainly relying on parameter tuning and stacking models.”

So Xiangyu replayed the game again. He found that the 2014 ImageNet champion Google GoogLeNet model was not very complicated, but achieved a very high accuracy rate. “GoogLeNet may be the only way for several other models.” In his eyes With a firm gaze. After several months of research, Xiangyu found that the essence of GoogLeNet is its 1x1 shortcut. “To put it bluntly, simplifying it to the simplest, you can find that GoogLeNet has only two paths, one is 1×1, the other is The way is a 1x1 and a 3x3.”

../../_images/inception_module.png

Inception module in GoogLeNet, picture from “Going Deeper with Convolutions”#

“What is it that supports such high performance of GoogLeNet at very low complexity?”

Xiangyu conjectured that its performance is determined by its depth, and in order for the GoogLeNet 22-layer network to be successfully trained, it must have a short enough straight path. Based on this idea, Xiangyu began to design a model, using a structural unit to continuously divide it up. Although the structure of the model will be very complicated, no matter how complicated it is, it will always have a way, but the depth can be very deep. Xiangyu found his teammates and shared this point of: . “I think this structure can maintain sufficient accuracy and is also very good for training. I call this network a fractal network.”

But Kaiming felt that the:structure was still too complicated. “Complex things often don’t get the essence.”

Kaiming suggests further simplification of this model, using a simplified form of it. So Xiangyu extended the previous assumption: “The shortest path determines the degree of easy optimization; the longest path determines the ability of the model, so can the shortest path be as short as possible to zero layers? Put the deepest path as short as possible. The path becomes infinitely deeper?” Based on this idea, there is a path without any parameters, and it can be considered that the model structure with the number of layers is 0 was born - ResNet.

../../_images/residual-module.jpg

Residual connection in ResNet, picture from “Deep Residual Learning for Image Recognition”#

The teammates decided to let you implement the model structure of ResNet, you put the implementation code in official/vision/classification/resnet/model.py

The logic of the forward calculation of the residual connection is concise and elegant, and the reverse derivation process will be automatically completed by:.

def forward(self, x):
    identity = x
    x = self.conv1(x)
    x = self.bn1(x)
    x = F.relu(x)
    x = self.conv2(x)
    x = self.bn2(x)
    identity = self.downsample(identity)
    x += identity
    x = F.relu(x)
    return x

With this basic structure, you start trying to deepen the network and train it to see if it works as expected.

18 layers, 34 layers… Compared with the ordinary structure, the Top-1 error rate obtained on the validation set has decreased by 3.5%, and the residual connection can converge and continue to deepen! 50 layers, 101 layers, the effect is still getting better! In the end you stopped at the 152-layer structure… and submitted the test results to the competition platform.

The final result of the ILSVRC’15 classification challenge is out, the Top-5 score of BN-inception (Google’s improved GoogLeNet) on the ImageNet test set is 4.82, and the final score of ResNet is - 3.57! Your morale is soaring, Won first place in 5 challenges using ResNet. Under the guidance of Mr. Sun, you submitted your ResNet paper to CVPR 2016 and won the best paper award!

“Mr. He Yuming’s research ideas have inspired me a lot, finding out the essential attributes of the most work from the complicated structure. This extremely simplified idea is the core of ResNet, and makes ResNet have a strong generalization ability. Various modifications can be made on the basis, which can inspire other people’s research.” Xiangyu said. Later, Xiangyu and Mr. Sun went to a start-up company in China, determined to build the strongest computer vision research institute in the Eastern Hemisphere. Kaiming went to a technology giant in the United States to engage in his own research work. Shaoqing decided to join the field of autonomous driving… After you have the experience of this competition, you will be more comfortable using MegEngine, whether it is scientific research or engineering, you are full of confidence, and everyone has a bright future.

Improve fire control technology#

During the competition, you also used common tricks related to model optimization algorithms. For example, for the SGD optimizer, you used the weight decay technique and introduced the momentum (Momentum) (the meaning of these parameters can be found in the documentation):

# Optimizer
opt = optim.SGD(
    model.parameters(),
    lr=args.lr * dist.get_world_size(),
    momentum=args.momentum,
    weight_decay=args.weight_decay,
)

When using these techniques, the specific heat is always difficult to control, and the elixirs are always grotesque. But Jiang was still old and hot. Teacher Sun had seen this situation all the time. He drank:in a deep voice, “Tune!”, and then used his inner strength to remind him that the medicinal pill was slowly taking shape in the flame of the pill stove. The rest of the people were overjoyed. Mr. Sun wiped the sweat from his forehead, took out a “Gradient Descent Sunflower Collection”, and said: “There is a lot of practice and theory of optimization algorithms. This is a review. Please study it when you have time.”

../../_images/loss-surface-optimization.gif
../../_images/sgd-saddle-point.gif

Postpartum care of elixirs#

In the field of deep learning, it is very common for researchers to reproduce experimental results in other people’s papers. For a research paper, if the original author does not provide the source code of the experiment, then others can only try to reproduce it by themselves, which may lead to huge differences in the final experimental results due to some differences in details. Even if the source code of the model structure and the training and testing process is provided, it will take a considerable amount of time for others to obtain the same model, and may get different results due to random states in some codes. Therefore, most people publish their research results in order to facilitate others to reproduce the experimental results. The model file will be published along with the pre-trained model parameter file for others to verify.

See also

  • There are many cutting-edge experimental results and supporting codes on the website Paper with code, you can try to reproduce with MegEngine for reference;

  • About how to Save and Load Models (S&L) locally, it has been introduced in detail in the user guide, so I won’t repeat it here;

  • In order to facilitate users to quickly obtain pre-trained models, MegEngine also provides hub module, which can be used to share and download trained models through GitHub/GitLab, which is convenient for researchers to use, refer to Use Hub to publish and load pre-trained models .

Let’s use the Hub to get the pre-trained ResNet18 model from MegEngine official:

import cv2
import numpy as np

import megengine
import megengine.data.transform as T
import megengine.functional as F
import megengine.hub as hub


image_path = "/path/to/example.jpg"  # Select the same image to read
image = cv2.imread(image_path)

transform = T.Compose([
    T.Resize(256),
    T.CenterCrop(224),
    T.Normalize(mean=[103.530, 116.280, 123.675],
                std=[57.375, 57.120, 58.395]),
    T.ToMode("CHW"),
])

model = hub.load('megengine/models', 'resnet18', pretrained=True)
model.eval()

processed_img = transform.apply(image)  # Still NumPy ndarray here
processed_img = F.expand_dims(megengine.Tensor(processed_img), 0)  # -> 1CHW
logits = model(processed_img)
probs = F.softmax(logits)
print(probs)
  • The input data needs to be preprocessed accordingly (and the channel order needs to be consistent);

  • Remember to switch the loaded pre-trained model to evaluation mode through the eval interface;

  • The input accepted by the model should be Tensor input (for some models, ndarray will be automatically converted to Tensor).

This allows us to load and use someone else’s pretrained ResNet model to make probabilistic predictions on 1000 classes.

Summary:Welcome to the world of deep learning#

This set of tutorials is over, and our learning journey has finally come to an end. By now, you should have mastered the basic concepts of machine learning, deep learning, and be able to use MegEngine to write neural network code for simple scenarios, as well as read other people’s work coded with MegEngine. It may still take some time to become familiar with these concepts, but… Welcome to the world of deep learning!

Here are some directions for you to explore next (depending on whether you are research-oriented, engineering-oriented, or well-rounded):

Select branch areas for research, and use MegEngine to reproduce classic experiments

All the code in this set of tutorials revolves around the image classification task in the field of computer vision, you can try to reproduce ResNet yourself. Now is the time to touch more types of computer vision tasks, such as object detection, instance style, super Resolution, etc.; you can also try other traditional domain problems solved with deep learning, such as natural language processing, recommender systems, and so on. In addition to reproducing other people’s code, you can also try to participate in some competitions, you will have some inspiration if you do more.

In order to try some cutting-edge ideas, you may actively expand the existing functions of MegEngine; in order to save model training time, you may have a better understanding of MegEngine’s distributed training; In the case of limited computing resources, In order to save video memory, you may learn about MegEngine’s heavy computing technology.

Understanding MLOps in Engineering Practice

Everything we’ve learned so far can only be considered the foundation of MegEngine model development. Reliably and efficiently deploying and maintaining machine learning models in production requires an understanding of the entire MLOps process. Taking the reasoning process of the above ResNet pre-training model as an example, we use the Python (+GPU) environment used for training, but our purpose is to hope that the model can be deployed to other platforms and devices to perform extremely efficient inference under limited conditions. See Model Deployment Overview and Process Recommendations for more details.

In order to serve the business scenario requirements, we will touch more inference-related functions of MegEngine.

From zero basic state to new MegEngine, this is an unforgettable journey, we learned how to do basic things right, and take the initiative to think. It will take a long time to keep trying to get things right and become an experienced MegEnginer.

Expansion material#

references#