Custom Op#

MegEngine provides a wealth of functions and modules related to machine learning, neural networks, and tensor calculations. However, in the process of developing models, researchers often design new operations such as defining new Neural Network Layers, etc. MegEngine needs to provide users with the ability to customize these operations.

Generally speaking, researchers can use the python interface provided by MegEngine to implement their required functions by extending Function and Module. At the same time, for users with high performance requirements, MegEngine also provides users with a set of tools that can quickly integrate their custom C++/CUDA operators into MegEngine, that is, Custom Op.

A simple example will be used to show the process of writing Custom Op and integrating it into MegEngine, and then a more specific interface introduction will be shown.

Overall process#

Now we need to add an MatMulScale to MegEngine. This operator first performs a matrix multiplication on the two input Tensors, lhs and rhs, and then multiplies the result of this matrix multiplication by a scalar. Scale.

The pseudo code of the mathematical execution process of the operator is as follows：

def MatMulScale(lhs, rhs, scale):
    result = lhs.dot(rhs)
    result = result * scale
    return result

For such an operation, suppose we have already written a CUDA kernel code for it, and provide the following interface functions for calling：

void matmul_scale(const float *lhs, const float *rhs, float *result, size_t M, size_t K, size_t N, float scale);

Among these parameters, lhs, rhs, and result are three pointers of float type, which represent the two input Tensor and one output Tensor'' of this Op. ``, they all need to point to a piece of cuda memory that has been allocated. And ``M, K, N are the dimension information of the matrix, which means a matrix of M*K Multiply by a matrix of K*N. And ``scale’’ represents the coefficient that the result of matrix multiplication needs to be multiplied by.

In this case, we can write the following C++ code, which can be encapsulated into the Op of MegEngine.

#include "megbrain/custom/custom.h"

CUSTOM_OP_REG_BEGIN(MatMulScale)

void shape_infer(const std::vector<Shape> &inputs, const Param &params, std::vector<Shape> &outputs) {
    outputs[0] = {inputs[0][0], inputs[1][1]};
}

void compute(const std::vector<Tensor> &inputs, const Param &params, std::vector<Tensor> &outputs) {
    matmul_scale(                       // 调用 kernel 的接口函数
        inputs[0].data<float>(),        // lhs
        inputs[1].data<float>(),        // rhs
        outputs[0].data<float>(),       // result
        inputs[0].shape()[0],           // M
        inputs[0].shape()[1],           // K
        inputs[1].shape()[1],           // N
        params["scale"].as<float>()     // scale
    );
}

 CUSTOM_OP_REG(MatMulScale)              // 定义一个名为 MatMulScale 的 Op
     .add_inputs(2)                      // 两个输入 Tensor
     .add_outputs(1)                     // 一个输出 Tensor
     .add_param("scale", 1.0f)           // 一个名为 scale 的 Parameter，默认值为 1.0f
     .set_shape_infer(shape_infer)       // 设置这个 Op 的 shape 推导函数
     .set_compute("cuda", compute);      // 设置这个 Op 的 计算函数

 CUSTOM_OP_REG_END(MatMulScale)

In this code, it first includes the Custom Op header file, and then uses two macros CUSTOM_OP_REG_BEGIN() and CUSTOM_OP_REG_END()'' to construct a scope. In this scope, we can write the body of Custom Op Code, and this main code is divided into two parts. The first part is the definition of some functions, including the output ``Tensor'' attribute inference function and calculation function. The former will derive the corresponding attributes of the output ``Tensor'' according to the attributes of the input ``Tensor (such as shape''), while the latter will call the CUDA kernel in it to complete the calculation. The second part is the registration of Op, which is mainly used to define Op has several input and output ``Tensor, several ``Params’’, and also register the pointers of the attribute inference function and calculation function defined above to Op .

之后可以使用 Custom Op 所提供的编译与加载函数 build_and_load 将 CUDA kernel 以及上面的 C++ 文件一起编译成一个库文件. 我们可以在 python 中，编写如下的代码去完成编译和加载的工作：

from megengine.core._imperative_rt.core2 import apply
from megengine.core.ops import custom
from megengine.utils import custom_op_tools
from megengine.tensor import Tensor
import numpy as np

# 该函数会编译我们编写的 cpp/cu 文件成 .so 并完成对其的加载
custom_op_tools.build_and_load("matmul_scale", ["matmul_scale.cpp", "matmul_scale.cu"])
op = custom.MatMulScale(scale = 0.1)   # custom.your_op_name，就是我们在 C++ 中定义的那个 Op 的名字
lhs = Tensor(np.random.uniform(size=(128, 256)))
rhs = Tensor(np.random.uniform(size=(256, 512)))
result = apply(op, lhs, rhs)

Of course, we can also Custom Op and MegEngine existing Python components such as `` autodiff.Function`` and `` module.Module`` combine to support training and build a larger model：

from megengine.autodiff import Function
from megengine.module import Module

class MatMulScaleFunc(Function):        # 将我们定义的 Op 包装成 autodiff.Function 以支持反向训练
    def __init__(self, scale):
        super().__init__()
        self.scale = scale

    def forward(self, lhs, rhs):
        self.lhs = lhs
        self.rhs = rhs
        op = custom.MatMulScale(scale=self.scale)   # custom.your_op_name，就是我们在 C++ 中定义的那个 Op 的名字
        return apply(op, lhs, rhs)

    def backward(self, ograd):                              # 这里假设我们又定义了另一个 Custom Op MatMulScaleBackward
        op = custom.MatMulScaleBackward(scale=self.scale)   # 其完成了 MatMulScale 的反向计算，出于篇幅限制就不展示其 C++ 代码
        return apply(op, ograd, self.lhs, self.rhs)

class MatMulScaleModule(Module):                            # 进一步将上面的 autodiff.Function 封装成 Module
    def __init__(self, ic, oc, scale, **kwargs):
        super().__init__(**kwargs)
        self.scale = scale
        self.weight = Parameter(np.zeros(shape=(ic, oc), dtype=np.float32))
        self.func = MatMulScaleFunc(scale=scale)

    def forward(self, inp):
        return self.func(inp, self.weight)

Interface introduction#

Attribute inference function#

Custom Op’s output Tensor attribute derivation is mainly based on some attributes of the input Tensor (Shape, DType, Device) and the parameters of Op to calculate the output` Corresponding related attributes of Tensor`. Among them, Shape'' represents the dimensional information of ``Tensor, DType'' corresponds to the data type of Tensor, and ``Device'' indicates which device (cpu/gpu) the ``Tensor'' is on. For example, in convolution, we can calculate the ``Shape information of the output Tensor according to the input Tensor Shape, ``stride’’, ``padding’’ and other parameters.

The process of deriving these output attributes currently requires users to give them in the form of C++ functions, and the function signatures of these functions (that is, the types of input parameters and return values of the function) are fixed, which are as follows：

void(*)(const std::vector<Device>&, const Param&, std::vector<Device>&);    // device infer
void(*)(const std::vector<Shape>&,  const Param&, std::vector<Shape>&);     // shape infer
void(*)(const std::vector<DType>&,  const Param&, std::vector<DType>&);     // dtype infer

When we write our own Custom Op’s related attribute derivation function, we need to ensure that the function signature of our related function should be consistent with the function signature of the corresponding function in the above example. The function signatures of these functions are basically similar. Taking the derivation of Shape, the parameters are passed in the Shape information of the input Tensor and its param, and Output a reference to Shape. The lengths of these two vectors'' are the number of input ``Tensor and the number of output Tensor, respectively. In this function, we can calculate the output Tensor Shape, and assign it to the corresponding reference.

Device

Currently Custom Op supports the Device supported device types including x86 and cuda. We can use it like a string, the following are the use of several Device Case.

Device device = "x86";                  // 创建一个 x86 这种设备类型
device = "cuda";                        // 设备类型改为 cuda
bool equal = (device == "cuda");        // 判断某个 device 是否是 cuda
std::string device_str = device.str();  // 获取 device 对应的可读的字符串表示

And Custom Op also provides a default behavior for the derivation of the Device type of the output Tensor, that is, all the Device of the output Tensor are the same as the 0th input Tensor''. The ``Device types of `` are equal. If there is no input Tensor, the Device'' of all output ``Tensor is x86. And in the above example of MatMulScale, we did not define it Device derivation function, so it uses this default Device derivation behavior.

DType

Currently, the DType supported by Custom Op supports device types including float16, bfloat16, float32, uint8, int8, int16 , uint16, int32, and the four quantization types qint8, quint8, qint16, qint32. Among them, quint8 is asymmetric The quantized data type, while the other three are symmetrically quantized data types. We can also use it like a string. Here are a few use cases of DType.

DType dtype1 = "float32", dtype2 = "int8";  // 定义两个 dtype
bool equal = (dtype1 == dtype2);            // 判断这两个 dtype 是否相等
dtype1 = "int16";                           // 修改 dtype1 的数据类型
std::string dtype_str = dtype1.str();       // 获取 dtype1 对应的可读的字符串类型表示

DType dtype3("qint8", 0.32);                // 创建一个 scale 为 0.32 的对称 8bit 量化的数据类型
DType dtype4("quint8", 0.32, 32);           // 创建一个 scale 为 0.32，zero_point 为 32 的非对称 8bit 量化的数据类型

float scale = dtype3.scale();               // 获取 dtype3 的 scale
uint8_t zero_point = dtype4.zero_point();   // 获取 dtype4 的 zero_point

Similar to Device, Custom Op also provides a default behavior for the DType type derivation of the output Tensor, that is, all the DType of the output Tensor are It is equal to the DType type of the 0th input Tensor. If there is no input Tensor, the DType'' of all output ``Tensor is float32. And in the above MatMulScale example, we also did not do it Define the DType derivation function, so it also uses this default DType derivation behavior.

Shape

In Custom Op, we can construct and use Shape in a way similar to vector or C++ native arrays. Here are several use cases of Shape.

Shape shape1 = {16, 3, 224, 224}, shape2 = {16, 32};    // 创建两个 shape
bool equal = (shape1[3] == 224);                        // 获取 shape1 中第 3 个维度的长度，并进行比较
shape2[1] = 16;                                         // 对 shape2 中第 2 个维度的长度进行修改
shape1 = {16, 16};                                      // 让 shape1 等于一个新的 shape 值
bool equal = (shape1 == shape2);                        // 判断两个 shape 是否相等
size_t ndim = shape1.ndim();                            // 获取 shape1 一共有几个维度

The default behavior provided by Custom Op for Shape derivation is to make all the Shape'' of the output ``Tensor equal to the Shape'' type of the 0th input ``Tensor . If there is no input Tensor, the Shape'' of all output ``Tensor will be [1]. And in the above MatMulScale example, obviously the default Shape'' ``The derivation function does not meet our needs, so we defined it ourselves. We also did not define the ``DType derivation function for it, so it also uses this default DType derivation behavior.

Calculation function#

The main function of the calculation function of Custom Op is actually how to call the interface function of the Kernel that we have written. These procedures also need to be given by the user in the form of a C++ function, and the function signature of this function is also fixed：

void(*)(const std::vector<Tensor>&, const Param&, std::vector<Tensor>&);

The same Custom Op calculation function has no return value. The function passes in the input Tensor and Param'', then calculates the value of the output ``Tensor'' and returns it as a reference. There are mainly two concepts involved here, namely ``Tensor and Param, which will be introduced separately below.

Tensor

The Tensor in Custom Op can be regarded as a collection of data (data) and the attributes of the data (ie the above Device, DType, Shape). We can use the following code to `` get the relevant information Tensor``：

Device device = tensor.device();                    // 获取 tensor 的 device 信息
DType dtype = tensor.dtype();                       // 获取 tensor 的 dtype 信息
Shape shape = tensor.shape();                       // 获取 tensor 的 shape 信息

size_t size = tensor.size();                        // 获取 tensor 中元素的数量
std::vector<ptrdiff_t> strides = tensor.stride();   // 获取 tensor 中各个维度的 stride
float scale = tensor.scale();                       // 获取 tensor 中数据的 scale，只在量化数据中有效
uint8_t zero_point = tensor.zero_point();           // 获取 tensor 中数据的 zero_point，只在非对称量化数据中有效

We use the above function to get the relevant attributes of Tensor such as Device, DType, Shape, or some more detailed information such as the number of elements in Tensor, The stride etc. of each dimension in Tensor. Then we can use this information to help us write the kernel.

In addition, we can use the following code to get the data stored in Tensor：

void *data = tensor.data();
float *float_data = tensor.data<float>();

Two ``data()’’ functions are provided here, one that does not support template parameters and the one that supports template parameters, both of which return pointers to actual data.

The former returns the ``void*’’ type, we can force it into the actual type we need when we use it, which provides us with the ability to define our own data types.

The latter returns a pointer of the type specified by the template parameter. For example, in this example, the template parameter is “float”, so it returns a pointer of type “float*”. In this case, Custom Op will check the correctness of the template parameter type, that is, the data type actually stored in the Tensor at this time must also be of the float'' type, otherwise an error will occur. The obtained pointer points to a piece of memory on the device corresponding to the ``Device attribute of this Tensor.

After obtaining this original pointer, combined with the information that can be obtained above such as Shape, stride, we can go to calculate the subscript of each element normally, read/store the data, Write the kernel and complete the calculation. However, subscript calculation is always cumbersome and error-prone, so Custom Op also provides a tool called TensorAccessor, which allows us to access the corresponding elements in Tensor in a way similar to C++ arrays. The following code shows how to use TensorAccessor to access the (n, c, h, w)'' element in a 4-dimensional ``Tensor

auto accessor = tensor.accessor<float, 4>();        // 获取 accessor
accessor[n][c][h][w] = 1.f;                         // 根据 accessor 访问对应的元素
float val = accessor[n][c][h][w];

The accessor()'' function here generally needs to provide two template parameters. The first parameter represents the data type of ``Tensor, and the second parameter represents the dimension of Tensor. In this example, because tensor is a 4-dimensional Tensor of type float, the two template parameters here are float and 4 respectively.

If we want to use TensorAccessor, we can pass it as a parameter of the kernel to the kernel, and then use the accessor inside the kernel to access the data. Of course, using TensorAccessor will introduce a little extra overhead compared to calculating the element index by yourself. You can choose whether to use TensorAccessor according to your needs.

The last thing that needs to be emphasized is that, in order to facilitate memory management, it is currently not allowed to construct a Tensor'' in the code of Custom Op. MegEngine will automatically construct a ``Tensor'' for Custom Op, allocate memory, and then pass the constructed ``Tensor'' to us, and then we will call the above interface to operate on the ``Tensor.

Param

Param is used to record some non-Tensor input of Custom Op, such as padding, stride, etc. in convolution. It is actually a map, whose key'' is of ``std::string'' type, which represents the name of a certain ``param element, and value'' is ``ParamVal'' `` type, this class can be seen as a support Any type of limited by the following code can simply show `` some of the features of ParamVal：

ParamVal a = 1, b = 1.0, c = true, d = "string";    // 可以将各种类型的数据直接赋值给 ParamVal
ParamVal e = {1, 2, 3, 4};                          // 支持 std::vector

ParamVal f = a + b;                                 // ParamVal 可以进行四则运算，计算结果仍然是 ParamVal类型
ParamVal g = d + "abc";                             // ParamVal 可以和 C++ 内置类型直接进行计算

bool equal = (a == b);                              // ParamVal 可以进行比较运算，计算结果是 bool 类型
a = "string";                                       // ParamVal 在运行时改变其元素的实际类型
std::string str = a.as<std::string>();              // ParamVal 转成 C++ 类型

Currently ParamVal supports types including int32_t, uint32_t, int64_t, uint64_t, float, double, bool` `, ``std::string, and the corresponding std::vector<int32_t>).

Param can use the [] operator to get the corresponding element in Param (ParamVal type) according to the name, we can read and write the data in the following way：

param["scale"] = 0.1;                       // 将 param 中名为 scale 的元素值置为 1
float scale = param["scale"].as<float>();   // 用 param 中名为 scale 的元素为 float 进行赋值

Registration of Custom Op#

Above we have defined information such as attribute derivation functions and calculation functions for Custom Op. However, these information are isolated from each other. The registration of Custom Op will combine this information into a whole.

Op registration

We provide a macro for Custom Op, ``CUSTOM_OP_REG(your_op_name)’’, using this macro we can define a Custom Op with a specified name.

CUSTOM_OP_REG(MatMulScale);     // 定义了一个名为 MatMulScale 的 Op

Add input and output for Op

We can use the add_input() function to add an input Tensor to Op, and use the add_output()'' function to add output ``Tensor information to Op. You can also use add_inputs() and add_outputs() to add inputs and outputs in batches.

CUSTOM_OP_REG(MatMulScale)
    .add_input("lhs", {"float32"}, 2)       // 为 Op 添加一个输入，名为 lhs，数据类型为 float32，维度为 2
    .add_input("rhs")                       // 使用 add_input 的默认行为，数据类型为 float32，维度为 -1，表示可以是任意维度
    .add_output("result", {"float32"}, 2)   // 为 Op 添加一个输出

// 另一种注册输入输出 Tensor 的方式，批量注册
CUSTOM_OP_REG(MatMulScale)
    .add_inputs(2)      // 为 Op 添加两个默认的输入，数据类型为 float32，维度为 -1
    .add_outputs(1)     // 为 Op 添加一个默认的输出，数据类型为 float32，维度为 -1

Add Param to Op

We can use the add_param()'' function to add a ``Param element to Op, the sample code is as follows：

CUSTOM_OP_REG(MatMulScale)
    .add_param("scale", 1.0f);  // 为 Op 添加一个名为 scale 的参数，其默认值为 1.0f

Here we add a parameter named “scale” for MatMulScale Op, its default value is 1.0f, and then we can use param[“scale”] in our related attribute derivation functions and calculation functions Go to access this parameter.

Add attribute derivation and calculation functions to Op

For the addition of attribute derivation functions, Custom Op provides set_shape_infer(), set_device_infer(), set_dtype_infer() three functions for setting Shape, Device ``, ``DType attribute derivation function. For the calculation function, Custom Op provides the ``set_compute()’’ function for setting. The attribute derivation function can only be added once by calling the relevant interface, while the ``set_compute()’’ function can be called multiple times to add calculation functions on different platforms. The relevant sample code is as

CUSTOM_OP_REG(MatMulScale)
    .set_shape_infer(matmul_scale_shape_infer)      // 为 Op 添加 Shape 推导函数
    .set_dtype_infer(matmul_scale_dtype_infer)      // 为 Op 添加 DType 推导函数
    .set_device_infer(matmul_scale_device_infer)    // 为 Op 添加 Device 推导函数
    .set_compute("x86", matmul_scale_compute_x86)   // 为 Op 添加 x86 上的计算函数
    .set_compute("cuda", matmul_scale_compute_cuda) // 为 Op 添加 cuda 上的计算函数

Here, the MatMulScale operator does not use the default attribute derivation function, but calls the relevant interfaces as Shape, Device, DType and the attribute derivation functions are separately set . At the same time, the calculation functions of MatMulScale on x86 and cuda are also set up here.