MegEngine implements linear classification#

What this tutorial covers

Have a basic understanding of image coding in the field of computer vision and understand the classification tasks in machine learning;
Think about what changes the linear regression model has undergone to solve the classification task, and what evaluation metrics should be used;
According to the previous introduction, use MegEngine to implement a linear classifier to complete the MNIST handwritten digit classification task.

Visual operation

There is some data visualization in this tutorial, if you have no experience with Matplotlib, just focus on the output.

Get the original dataset#

In the previous tutorial, we used the interface in Scikit-learn to get California housing data, and mentioned that many Python libraries and frameworks will encapsulate the interface to get some datasets, which is convenient for users to call. MegEngine is no exception. In data.dataset module, you can get Use the implemented data set interface to get the original data set, such as MNIST data set used in this tutorial.

from megengine.data.dataset import MNIST
from os.path import expanduser

DATA_PATH = expanduser("~/data/datasets/MNIST")

train_dataset = MNIST(DATA_PATH, train=True)
test_dataset = MNIST(DATA_PATH, train=False)

Different from the data format obtained in the previous tutorial, what is obtained here is the already divided training set train_dataset and test set test_dataset, and both have been encapsulated into MegEngine and can be provided to Dataset type of DataLoader` (for details, please refer to Use Dataset to define a data set). where each element is a tuple \((X_i, y_i)\) ：consisting of a single sample \(X_i\) and a token \(y_i\)

>>> type(train_dataset[0])
tuple

In order to facilitate the following discussion, we deliberately split the samples and labels of the training data into：

import numpy as np

X_train, y_train = map(np.array, train_dataset[:])

>>> print(X_train.shape, y_train.shape)
(60000, 28, 28, 1) (60000,)

Statistical mean and standard deviation information of training samples, normalized to：for data preprocessing

>>> mean, std = X_train.mean(), X_train.std()
>>> print(mean, std)
33.318421449829934 78.56748998339798

Note

For classic datasets such as MNIST, these statistics are usually provided on the dataset homepage, or you can find data that others have already calculated on the Internet with the help of search engines.

Learn about dataset information#

Recall that in the last tutorial, our housing sample data (before splitting) has a shape of \((20640, 8)\), which means that there are 20640 samples in total, each with 8 attribute values. It can also be said that is the dimension of the feature vector is 8. And the shape of the training data here is \((60000, 28, 28, 1)\), if 60000 is the total number of training samples, then the following \((28, 28, 1)\) shape What does it stand for?

Unstructured data：pictures#

In the California housing dataset, we can use the feature vector \(\mathrm{x} \in \mathbb{R}^d\) to describe the characteristics of housing. The entire data set can be expressed and stored with the help of a two-dimensional table (matrix). We call such data structured data. In real life, it is easier for humans to touch and understand unstructured data, such as video, audio, pictures, etc…

How do computers understand, represent, and store such data? This tutorial will illustrate the example of a sample handwritten digit picture in：

>>> idx = 28204
>>> plt.title("The label is: %s" % y_train[idx])
>>> plt.imshow(X_train[idx].squeeze(), cmap="gray")
>>> plt.show()

The images in the MNIST dataset are all bitmap images (Bitmap), also called raster images, raster images, as opposed to the concept of vector graphics; each image consists of many points called pixels (Pixel, Pictrue element) composition. We take a picture sample marked 3 X_train[idx] for observation, and find that its height (Height) and width (Weight) are both 28 pixels, which is similar to the shape of a single sample :math:`(28, 28, 1) ` Corresponding; the last dimension represents the number of channels. Since the image in MNIST is a grayscale image, the number of channels is 1. For color images in the RGB color system, the number of channels is 3, which we will cover in the next tutorial. see data like this.

Now let’s visualize this grayscale image in 3 dimensions to help develop an intuitive understanding of：

>>> from mpl_toolkits.mplot3d import Axes3D
>>> ax = plt.axes(projection='3d')
>>> ax.set_zlim(-10, 255)
>>> ax.view_init(elev=45, azim=10)
>>> X, Y = np.meshgrid(np.arange(28), np.arange(28))
>>> Z = np.squeeze(X_train[idx])
>>> surf = ax.plot_surface(Y, X, Z, cmap="gray")
>>> plt.colorbar(surf, shrink=0.5, aspect=8)
>>> plt.show()

../../_images/mnist-single-example-3d.png

In a grayscale image, each pixel value is represented by 0 (black) ~ 255 (white), which is the range that an int8 can represent.

We usually call \((60000, 28, 28, 1)\) such data as NHWC layout, we will see more layout forms later.

Simple processing of image features#

Differences in features between different types of data

In the previous tutorial, the one-shot input to the linear model was a feature vector \(\mathrm{x} \in \mathbb{R}^d\);
And the feature space of picture \(\mathsf{I}\) in the MNSIT dataset here is \(\mathbb{R}^{H \times W \times C}\).

In order to meet the requirements of the linear model for the input shape, it is necessary to perform certain processing on the characteristics of the input samples. The easiest and crudest way to implement \(\mathbb{R}^{H \times W \times C} \mapsto \mathbb{R}^d\) without losing the numerical information of each element is to use the flatten flattening operation：

The original image

original_image = np.linspace(0, 256, 9).reshape((3, 3))
plt.axis('off')
plt.title(f"Original Image {original_image.shape}:")
plt.imshow(original_image, cmap="gray")
plt.show()

After Flatten processing

flattened_image = original_image.flatten()
plt.axis('off')
plt.title(f"Flattened Image {flattened_image.shape}:")
plt.imshow(np.expand_dims(flattened_image, 0), cmap="gray")
plt.show()

The corresponding interface in MegEngine is Tensor.flatten or functional.flatten. Same as expand_dims and squeeze operations mentioned in the previous tutorial, :func:` The ~.flatten` operation can also be implemented with reshape. But considering the code semantics, you should try to use flatten instead of reshape.

The flattening operation in MegEngine can be directly used in the case of multiple samples (in order to support vectorized calculations)：

import megengine.functional as F

x = F.ones((10, 28, 28, 1))
out = F.flatten(x, start_axis=1, end_axis=-1)

>>> print(x.shape, out.shape)
(10, 28, 28, 1) (10, 784)

After a simple process, it is possible to correlate the single-sample linear prediction model with the form \(\hat{y} = \boldsymbol{w} \cdot \boldsymbol{x}+b\) from the previous tutorial.

The output form of the classification task#

In a linear regression task, our predicted output value \(\hat{y}\) is continuous on the real number domain \(\mathbb R\). For classification tasks, the labels are usually given as a set of discrete values, such as \(y \in \left\{ 0, 1, \ldots , 9 \right\}\) here. What about achieving unity of model prediction output and labeling form?

Let’s simplify the problem form first, starting with the simplest classification case, the binary classification task.

binary classification task#

Suppose our handwritten digit classification task is simplified to the case where the tokens only contain 0 and 1, ie \(y \in \left\{ 0, 1 \right\}\). Where 0 means that the picture is a handwritten digit 0, and 1 means that the image is not a handwritten number 0. For discrete tokens, we can introduce a non-linear link function \(g(\cdot)\) to map the linear output to categories. For example, you can divide the output of \(f(\boldsymbol{x})=\boldsymbol{x} \cdot \boldsymbol{w}+b\) with 0 as the threshold (Threshold), and think that any sample whose calculation result is greater than 0 represents this picture is the handwritten number 0; and any sample whose calculation result is less than 0 means that the picture is not a handwritten number 0. Therefore, such a prediction model can be：

\[\begin{split}\hat{y}=\mathbb{I}(f(\boldsymbol{x})-\text { threshold })= \left\{\begin{array}{lll} 1 & \text { if } & f(\boldsymbol{x})>\text { threshold } \\ 0 & \text { if } & f(\boldsymbol{x})<\text { threshold } \end{array}\right.\end{split}\]

Where \(\mathbb I\) is the indicator function (also called the indicative function), which is also the link function we use here, but it is not commonly used. There are many reasons, for optimization problems, its mathematical properties are not good and it is not suitable for gradient descent algorithm. For the outputs of the linear model -100 and -1000, this piecewise function decides both to be in the class marked 0, and does not reflect the difference within the two sample classes - for example, although both are Not 0, but a sample with an output of -1000 should be less like 0 than a sample with an output of -100. The former might be 1, the latter 6.

In practice, the link function we use more often is the Sigmoid function \(\sigma(\cdot)\) (also called the Logistic function)：

\[\sigma(x)=\frac{1}{1+\exp (-x)}\]

def sigmoid(x):
    return 1. / (1. + np.exp(-x))

x = np.arange(-10, 10, 0.2)
y = sigmoid(x)
plt.plot(x, y)
plt.show()

where \(\exp\) refers to the exponential function：with the natural constant \(e\) as the base

Sigmoid / Logistic 命名的由来，与 Logit 函数

Logistic was named by Belgian mathematician Pierre François Verhulst in 1844 or 1845 when he was studying the relationship between population growth. Like Sigmoid, it refers to the sigmoid function：The initial stage is roughly exponential growth; then as it begins to become saturated, the increase slows down Finally, the increase stops when maturity is reached. Here’s Verhulst’s explanation of the naming：

Verhulst writes “We will give the name logistic [logistique] to the curve” (1845 p.8). Though he does not explain this choice, there is a connection with the logarithmic basis of the function. Logarithm was coined by John Napier (1550- 1617) from Greek logos (ratio, proportion, reckoning) and arithmos (number). Logistic comes from the Greek logistikos (computational). In the 1700’s, logarithmic and logistic were synonymous. Since computation is needed to predict the supplies an army requires, logistics has come to be also used for the movement and supply of troops.

Berkson coined logit (pronouced “low-jit”) in 1944 as a contraction of logistic unit to indicate the unit of measurement (J. Amer. Stat. Soc. 39:357-365). Georg Rasch derives logit as a contraction of logistic transform (1980, p.80). Ben Wright derives logit as a contraction of log-odds unit. Logit is also used to characterize the logistic function in the way that probit (probability unit, coined by Chester Bliss about 1934), characterizes the cumulative normal function.

The inverse function of Sigmoid is the Logit function, that is, \(\text{logit}(p) = \log(\frac{p}{1-p})\), there are \(\sigma (\text{logit}(p))=p\). According to Sigmoid, real numbers can be Properties that map to probabilities, it’s easy to see that Logit is able to map probabilities to real numbers.

Another reason to choose Sigmoid is that it has a good mathematical explanation from an information theory perspective (not covered in detail in this tutorial), which can be derived under the assumptions of Bernoulli distribution and generalized linear models according to the principle of Maximum entroy . Models that use Sigmoid / Logistic functions are also called Logistic Regression Models (LR Models). It is translated as “logistic regression” in many places, which is as confusing as the translation of Robustness into “robustness”, which is more suitable for translation as logarithmic probability regression in terms of mathematical properties.

This logistic term will not be translated in this tutorial.

We found that for binary classification tasks, the sigmoid function has the following advantages：

Easy to differentiate：\(\sigma^{\prime}(x)=\sigma(x)(1-\sigma(x))\)
The output of the linear calculation part is mapped to the range of \((0, 1)\) and can be expressed as a probability.

We can express the probability of predicting the label of sample \(\mathrm{x}\) to be a certain class (assuming 1 here)：

\[p(y=1 \mid \boldsymbol{x})=\sigma(f(\boldsymbol{x}))=\frac{1}{1+\exp (-f(\boldsymbol{x}))}\]

Continue to use the example：we gave above for the output of the linear model, after the mapping of the Sigmoid function, \(\sigma(-1000)\) is closer to 0 than \(\sigma(-100)\), indicating that the former is not 0 probability is higher. This enables the difference between the prediction results of different samples to be effectively reflected. At the same time, for samples that have been correctly classified, the sigmoid function discourages stretching the output range of the linear model too large, and the benefit is not obvious.

Warning

The final output of binary logistic regression has only one value, which represents the probability of predicting a base class. Next we will introduce Multiclass Logistic Regression (Multinational Logistic Regression).

Multi-classification tasks#

Converting the output of a classification model to a probabilistic representation of the target class is a very common (but not unique) approach. The final output in logistic regression has only one probability \(p\), which represents the probability of predicting a certain category; the probability of another category can be directly represented by \(1-p\). Generalizing to multi-classification tasks, how do we transform the output of a linear model into probabilistic predictions over multiple classes?

One-hot encoding of markers

Astute readers may have noticed this, the tokens in the MNIST dataset are all scalars. Therefore, some additional processing is required to turn it into a probability vector on multi-classification. The common practice is to use One-hot encoding. For example, the corresponding marks of two handwritten digit samples are 3 and 5, using One-hot encoding (the corresponding interface is functional.nn.one_hot ) will get the following representation：

from megengine import Tensor
import megengine.functional as F

inp = Tensor([3, 5])  # two labels
out = F.nn.one_hot(inp, num_classes=10)

>>> out.numpy()
array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]], dtype=int32)

After this processing, the label of each sample becomes a 10-class probability vector that can be compared with the model prediction output.

Recalling the form of linear regression, for a single sample \(\boldsymbol x \in \mathbb{R}^d\), with the help of the weight vector \(\boldsymbol{w} \in \mathbb{R}^d\) and the bias \(b\), you can Get a linear output \(\boldsymbol {w} \cdot \boldsymbol {x} + b\), now we can think of this output as the score \(s\) predicting that the current sample is a certain category, or predicting the credibility of the category Spend. For \(c\) classification tasks, we want to get \(c\) such scores, so we can achieve：with the help of matrix operations

\[\begin{split}\left[\begin{array}{c} -\boldsymbol{w}_{1}^{T}- \\ -\boldsymbol{w}_{2}^{T}- \\ \vdots \\ -\boldsymbol{w}_{c}^{T}- \end{array}\right] \left[\begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{d} \end{array}\right] + \left[\begin{array}{c} b_{1} \\ b_{2} \ \ \vdots \\ b_{c} \end{array}\right] = \left[\begin{array}{c} \boldsymbol{w}_{1}^{T} \boldsymbol{x} + b_1 \\ \boldsymbol{w}_{2}^{T} \boldsymbol{x} + b_2 \\ \vdots \\ \boldsymbol{w}_{c}^{T} \boldsymbol{x} + b_c \end{array}\right] = \left[\begin{array}{c} s_{1} \\ s_{2} \\ \vdots \\ s_{c} \end{array}\right]\end{split}\]

Note that：is written here in the form of matrix operations in linear algebra for ease of understanding. In actual calculations, it is not necessary to upgrade vectors to matrices.

Let’s simply verify that the output score is a 10-dimensional vector (using the matrix@vector operation in the previous tutorial)：

num_features = 784  # Flatten(28, 28, 1)
num_classes = 10

x = X_train[0].flatten()
W = np.zeros((num_classes, num_features))
b = np.zeros((num_classes,))
s = W @ x + b

>>> s.shape
(10,)

We can intuitively understand that a \(d\) dimensional feature vector can get a \(c\) dimensional score vector after a weight matrix transformation and bias.

The sigmoid function is able to map a single output from the real number domain \(\mathbb R\) to the probability interval \((0,1)\) , for a score vector. If you apply the sigmoid function for each class score, you do get multiple probability values, but the problem is that the probability values obtained in this way do not sum to 1. Need to find other ways.

The easiest way to think of it is to learn the One-hot representation, set the probability at the maximum value (corresponding to interface functional.argmax) to 1, and all other positions to 0. For example, \((10, -10, 30 , 20)\) becomes \((0, 0, 1, 0)\). \(\operatorname{Argmax}\) This approach belongs to hard classification, and the indicative function :math:seen earlier in the binary classification task mathbb I (cdot ) has the same problem：that the mathematical properties are relatively poor; the difference between samples within a class, between classes cannot be reflected… and so on.

In multi-classification tasks, the more commonly used function is \(\operatorname{Softmax}\) (corresponding to interface functional.nn.softmax )：

\[p(y=c \mid \boldsymbol{x}) =\operatorname{Softmax}\left(s_{c}\right) =\frac{\exp s_{c}}{\sum_{i} \exp s_{i}}\]

It can be understood that we get the target classification probability value after exponential normalization of the scores on multiple classes, which we usually call Logits;
This exponential form of normalization has some advantages：compared to the mean normalization, it has a “Matthew effect”, we think that the larger original value should also have a larger probability value after normalization; Compared with hard classification \(\operatorname{Argmax}\), it is softer, which can reflect the difference between the predicted values of samples within and between classes; it is beneficial to find the Top-k classification candidates, that is, the first \(k\) have high probability Classification, useful for evaluating model performance.

>>> score = Tensor([1, 2, 3, 4])
>>> F.softmax(score)
Tensor([0.0321 0.0871 0.2369 0.6439], device=xpux:0)

At this point, we’ve been able to get our model to output a probability vector of predictions on the target class based on the input samples.

Optimization Objectives for Classification Tasks#

We have obtained the probability vector \(\hat{\boldsymbol{y}}\) predicted by the model on multi-classification, and also obtained the probability vector \(\boldsymbol{y}\) by using One-hot encoding for the real mark. a probability distribution. Our optimization goal is to make the predicted values as close as possible to the ground truth, and we need to design a suitable loss function.

Relative Entropy and Cross Entropy in Information Theory

In information theory, if there are two separate probability distributions \(p(x)\) and \(q(x)\) for the same random variable \(x\), the relative entropy (KL divergence) can be used to represent the two Difference Between Distributions：

\[\begin{split}\begin{aligned} \mathrm{KL}(p \| q) &=-\int p(x) \ln q(x) d x-\left(-\int p(x) \ln p(x) dx\right ) \\ &=H(p, q)-H(p) \end{aligned}\end{split}\]

The characteristic of relative entropy is that when two probability distributions are exactly the same, its value is zero. The greater the difference between the two distributions, the greater the relative entropy value.

This coincides with our goal of “designing a loss function that can be used to evaluate the difference between the probability distribution \(q(x)\) of the current prediction and the probability distribution \(p(x)\) of the true labels”. And since the sample label will not change during the training process, its probability distribution \(p(x)\) is a definite constant, then the \(H(p)\) value in the above formula will not change with the change of the training sample, It will not affect the gradient calculation, so it can be omitted.

The remaining \(H(p, q)\) part is defined as our commonly used cross entropy (Cross Entropy, CE), which is：in terms of discrete probability values in our current example

\[\ell_{\mathrm{CE}}=H(\boldsymbol{y}, \hat{\boldsymbol{y}})=-\sum_{i=1}^{c} y_{i} \ln \hat{y}_{i}\]

This is the loss function often used in classification tasks, corresponding to the functional.nn.cross_entropy interface in MegEngine.

Evaluation Metrics for Classification Tasks#

Commonly used evaluation metrics in regression problems are MAE, MSE, etc., but they are not suitable for classification tasks.

In this tutorial, we will introduce Accuracy as an evaluation metric for handwritten digit image classification. Its definition is as follows:：is the proportion of correctly classified samples to all samples. For example, out of 10,000 samples, if 6,000 samples are correctly classified, then we say that the accuracy of the current classifier is 0.6, which is 60% when converted into a percentage.

For the multiple outputs on the multi-class predicted by the model, we usually use the argmax interface to use the 0 ~.argmax interface to take the label corresponding to the largest position as the predicted classification, and compare it with the real label (if it is a vectorized implementation of batch samples , usually need to specify the axis parameter)：

logits = Tensor([0., 0., 0.3, 0.7, 0.,
                 0., 0., 0., 0., 0.])
pred = F.argmax(logits)

>>> pred
Tensor(3, dtype=int32, device=xpux:0)

For the output obtained on multiple samples, we can use the == operation to judge, taking the prediction results on 5 samples as an example：

>>> pred = Tensor([0, 3, 2, 4, 5])
>>> label = Tensor([0, 3, 3, 5, 2])
>>> pred == label
Tensor([ True  True False False False], dtype=bool, device=xpux:0)

At this time, with the help of sum interface, you can get the number of correctly classified samples：

>>> (pred == label).sum().item()
2

In the process of training the linear classifier, after each round of training, the classification accuracy of the current classifier on the training samples and test samples can be detected as a means of evaluating the model performance. If after many rounds of training, the accuracy on the training set keeps rising and the accuracy on the test set starts to drop, it may be an indication of overfitting.

Exercise：Linear Classifier#

Combining all the concepts mentioned above, now we use：for a complete implementation of the linear classifier.

You can change the code of the corresponding link according to the process of the previous tutorial, and experience the whole logic;
Note again that in this tutorial the test set provided by MNIST is used as the validation set (albeit named test);
For batches of data, remember to use a vectorized implementation instead of a for loop implementation on a single sample.

import megengine
import megengine.data as data
import megengine.data.transform as T
import megengine.functional as F
import megengine.optimizer as optim
import megengine.autodiff as autodiff


# Get train and test dataset and prepare dataloader
from os.path import expanduser
DATA_PATH = expanduser("~/data/datasets/MNIST")

train_dataset = data.dataset.MNIST(DATA_PATH, train=True)
test_dataset = data.dataset.MNIST(DATA_PATH, train=False)

train_sampler = data.RandomSampler(train_dataset, batch_size=64)
test_sampler = data.SequentialSampler(test_dataset, batch_size=64)

transform = T.Normalize(33.318421449829934, 78.56748998339798)

train_dataloader = data.DataLoader(train_dataset, train_sampler, transform)
test_dataloader = data.DataLoader(test_dataset, test_sampler, transform)

num_features = train_dataset[0][0].size
num_classes = 10


# Parameter initialization
W = megengine.Parameter(F.zeros((num_features, num_classes)))
b = megengine.Parameter(F.zeros((num_classes,)))


# Define linear classification model
def linear_cls(X):
    return F.matmul(X, W) + b


# GradManager and Optimizer setting
gm = autodiff.GradManager().attach([W, b])
optimizer = optim.SGD([W, b], lr=0.01)


# Training and validation
nums_epoch = 5
for epoch in range(nums_epoch):
    training_loss = 0
    nums_train_correct, nums_train_example = 0, 0
    nums_val_correct, nums_val_example = 0, 0

    for step, (image, label) in enumerate(train_dataloader):
        image = F.flatten(megengine.Tensor(image), 1)
        label = megengine.Tensor(label)

        with gm:
            score = linear_cls(image)
            loss = F.nn.cross_entropy(score, label)
            gm.backward(loss)
            optimizer.step().clear_grad()

        training_loss += loss.item() * len(image)

        pred = F.argmax(score, axis=1)
        nums_train_correct += (pred == label).sum().item()
        nums_train_example += len(image)

    training_acc = nums_train_correct / nums_train_example
    training_loss /= nums_train_example

    for image, label in test_dataloader:
        image = F.flatten(megengine.Tensor(image), 1)
        label = megengine.Tensor(label)
        pred = F.argmax(linear_cls(image), axis=1)

        nums_val_correct += (pred == label).sum().item()
        nums_val_example += len(image)

    val_acc = nums_val_correct / nums_val_example

    print(f"Epoch = {epoch}, "
          f"train_loss = {training_loss:.3f}, "
          f"train_acc = {training_acc:.3f}, "
          f"val_acc = {val_acc:.3f}")

After 5 rounds of training, a linear classifier with an accuracy rate of about 92% is usually obtained.

Summarize the flow of：data#

So far, we’ve done line fitting, house price prediction, and handwritten digit recognition tasks with linear models, so it’s time for a comparison.

task name	Forward computation (single sample)	loss function	Evaluation Metrics
Fit straight line	\(x \in \mathbb{R} \stackrel{w,b}{\mapsto} y \in \mathbb{R}\)	MSE	observed fit
house price forecast	\(\boldsymbol{x} \in \mathbb{R}^{d} \stackrel{\boldsymbol{w},b}{\mapsto} y \in \mathbb{R}\)	MSE	MAE
Handwritten Digit Recognition	\(\boldsymbol{x} \in \mathbb{R}^{d} \stackrel{W,\boldsymbol{b}}{\mapsto} \boldsymbol{y} \in \mathbb{R}^{c}\)	CE	Accuracy

Note：The default image information has been flattened into a vector, ie :math:`mathsf{I} in mathbb{R}^{H times W times C} stackrel{operatorname{flatten}}{mapsto} boldsymbol{x} in mathbb{R}^{d}

When implementing with MegEngine, we need to pay attention to the flow of single sample data in the entire calculation graph, especially the change of Tensor shape (essentially these are common transformations in linear algebra), which is actually our machine learning model In the design process, in addition to the linear model itself, the entire machine learning process also needs to design an appropriate loss function and select scientific evaluation indicators.

In other respects, the practices of several tasks are the same. For example, for batch-sized data, the vectorized implementation is used to speed up, and the gradient descent algorithm is used to iteratively optimize the parameters in the model, and it is necessary to evaluate the model performance in time. …hopefully you already have a better understanding of these steps.

It seems that linear models can do everything?

the answer is negative. Since the datasets in these tutorials are too simple, even using the simplest linear model for training can achieve relatively good results in the end. In the next tutorial, we will use a more complex dataset (compared to MNIST only slightly more complex), the same classification task, to see if the linear model is still as powerful.