using lite::StartCallback = std::function<void(const std::unordered_map<std::string, std::pair<IO, std::shared_ptr<Tensor>>>&)>

the start/finish callback function type

参数 unordered_map:实际输入输出tensor以及用户配置信息与输入输出名称间的映射表

map from the io tensor name to the pair of the user configuration information and the really input or output tensor.

using lite::FinishCallback = std::function<void(const std::unordered_map<std::string, std::pair<IO, std::shared_ptr<Tensor>>>&)>
using lite::AsyncCallback = std::function<void(void)>

the network async callback function type

using lite::ThreadAffinityCallback = std::function<void(int thread_id)>

the thread affinith callback function type

参数 thread_id:当前线程的ID,线程 Id 的取值可以从 0 到 nr_threads - 1,Id 为 nr_threads - 1的线程为工作主线程

the id of the current thread, the id is a number begin from 0 to (nr_threads - 1), thread id of (nr_threads - 1) is the main worker thread.

struct Options

the inference options which can optimize the network forwarding performance

参数 weight_preprocess:是否进行权重预处理

is the option which optimize the inference performance with processing the weights of the network ahead

参数 fuse_preprocess:是否进行融合预处理

fuse preprocess patten, like astype + pad_channel + dimshuffle

参数 fake_next_exec:是否进行预热,不做实际推理

whether only to perform non-computing tasks (like memory allocation and queue initialization) for next exec. This will be reset to false when the graph is executed.

参数 var_sanity_check_first_run:是否进行变量合理性检查

Disable var sanity check on the first run. Var sanity check is enabled on the first-time execution by default, and can be used to find some potential memory access errors in the operator

参数 const_shape:模型shape 是否不变

used to reduce memory usage and improve performance since some static inference data structures can be omitted and some operators can be compute before forwarding

参数 force_dynamic_alloc:是否进行强制动态内存分配

force dynamic allocate memory for all vars

参数 force_output_dynamic_alloc:是否强制动态分配输出内存

force dynamic allocate memory for output tensor which are used as the input of CallbackCaller Operator

参数 no_profiling_on_shape_change:tesnor shape 变化时是否不进行profile

do not re-profile to select best implement algo when input shape changes (use previous algo)

参数 jit_level:JIT 等级

Execute supported operators with JIT (support MLIR, NVRTC). Can only be used on Nvidia GPUs and X86 CPU, this value indicates JIT level: level 1: for JIT execute with basic elemwise operator level 2: for JIT execute elemwise and reduce operators

参数 record_level:record 等级

flags to optimize the inference performance with record the kernel tasks in first run, hereafter the inference all need is to execute the recorded tasks. level = 0 means the normal inference, level = 1 means use record inference, level = 2 means record inference with free the extra memory

参数 graph_opt_level:图优化等级

network optimization level: 0: disable 1: level-1: inplace arith transformations during graph construction 2: level-2: level-1, plus global optimization before graph compiling 3: also enable JIT

参数 async_exec_level:异步执行等级

level of dispatch on separate threads for different comp_node. 0: do not perform async dispatch 1: dispatch async if there are more than one comp node with limited queue mask 0b10: async if there are multiple comp nodes with mask 0b100: always async


bool weight_preprocess = false
bool fuse_preprocess = false
bool fake_next_exec = false
bool var_sanity_check_first_run = true
bool const_shape = false
bool force_dynamic_alloc = false
bool force_output_dynamic_alloc = false
bool force_output_use_user_specified_memory = false
bool no_profiling_on_shape_change = false
uint8_t jit_level = 0
uint8_t comp_node_seq_record_level = 0
uint8_t graph_opt_level = 2
uint16_t async_exec_level = 1
bool enable_nchw44 = false

layout transform options

bool enable_nchw44_dot = false
bool enable_nchw88 = false
bool enable_nhwcd4 = false
bool enable_nchw4 = false
bool enable_nchw32 = false
bool enable_nchw64 = false
struct IO

config the network input and output item, the input and output tensor information will describe there


  • 如果前向推理执行前输入tensor被设置成了其他layout,这里的layout 不会生效

  • 如果前向推理执行前没有设置任何layout,模型推理时会选择原始的layout

  • 如果这里设置了输出 tensor layout, 该layout 会被用于检查模型网络计算得到的layout的正确性

参数 name: 输入输出tensor名称

the input/output tensor name

参数 is_host:用于标记输入或输出Tensor 所在设备,当is_host 为 true时,表明输入或输出tensor在host上,否则在device上。主要用于输入输出tesnor 数据同步过程

Used to mark where the input tensor comes from and where the output tensor will copy to, if is_host is true, the input is from host and output copy to host, otherwise in device. Sometimes the input is from device and output no need copy to host, default is true.

参数 io_type: IO类型, 其值可以是 Shape 或 Value, 当为 Shape 时, 输入输出tensor的值是无效的,只会设置输入输出tesnor的shape, 默认值是 Value

The IO type, it can be SHAPE or VALUE, when SHAPE is set, the input or output tensor value is invaid, only shape will be set, default is VALUE

参数 配置的layout

The layout of input or output tensor


std::string name
bool is_host = true
LiteIOType io_type = LiteIOType::LITE_IO_VALUE
Layout config_layout = {}
struct Config

Configuration when load and compile a network.

参数 has_compression:模型权重是否被压缩,压缩方法注册在模型中

flag whether the model is compressed, the compress method is stored in the model

参数 device_id:模型网络推理所用设备ID

configure the device id of a network

参数 device_id:模型网络推理所用设备类型,主要类型参考LiteDeviceType的定义

configure the device type of a network

参数 backend:设备推理后端,具体类型参考LiteBackend定义

configure the inference backend of a network, now only support megengine

参数 bare_model_cryption_name:裸模型加密方案名称

is the bare model encryption method name, bare model is not pack json information data inside

参数 options:模型推理优化选项

configuration of Options


bool has_compression = false
int device_id = 0
LiteDeviceType device_type = LiteDeviceType::LITE_CPU
LiteBackend backend = LiteBackend::LITE_DEFAULT
std::string bare_model_cryption_name = {}
Options options = {}
struct NetworkIO

the input and output information when load the network the NetworkIO will remain in the network until the network is destroyed.

参数 inputs:输入IO 列表

The all input tensors information that will configure to the network

参数 outputs:输出 IO 列表

The all output tensors information that will configure to the network


std::vector<IO> inputs = {}
std::vector<IO> outputs = {}
class Allocator

A user-implemented allocator interface, user can register an allocator to the megengine, then all the runtime memory will allocate by this allocator.


virtual ~Allocator() = default
virtual void *allocate(LiteDeviceType device_type, int device_id, size_t size, size_t align) = 0

allocate memory of size in the given device with the given align

  • device_type – the device type the memory will allocate from

  • device_id – the device id the memory will allocate from

  • size – the byte size of memory will be allocated

  • align – the align size require when allocate the memory

virtual void free(LiteDeviceType device_type, int device_id, void *ptr) = 0

free the memory pointed by ptr in the given device

  • device_type – the device type the memory will allocate from

  • device_id – the device id the memory will allocate from

  • ptr – the memory pointer to be free

class Network

The network is the main class to perform forwarding, which is construct form a model, and implement model load, init, forward, and display some model information.


Construct a network with given configuration and IO information

参数 config:模型推理配置

The configuration to create the network

参数 networkio:模型输入输出信息

The NetworkIO to describe the input and output tensor of the network

friend class NetworkHelper
Network(const Config &config = {}, const NetworkIO &networkio = {})
Network(const NetworkIO &networkio, const Config &config = {})
void load_model(void *model_mem, size_t size)

load the model form memory

void load_model(std::string model_path)

load the model from a model path

void compute_only_configured_output()

only compute the output tensor configured by the IO information

std::shared_ptr<Tensor> get_io_tensor(std::string io_name, LiteTensorPhase phase = LiteTensorPhase::LITE_IO)

get the network input and output tensor, the layout of which is sync from megengine tensor, when the name of input and output tensor are the same, use LiteTensorPhase to separate them

  • io_name – the name of the tensor

  • phase – indicate whether the tensor is input tensor or output tensor, maybe the input tensor name is the same with the output tensor name

std::shared_ptr<Tensor> get_input_tensor(size_t index)

get the network input tensor by index

std::shared_ptr<Tensor> get_output_tensor(size_t index)

get the network output tensor by index

Network &set_async_callback(const AsyncCallback &async_callback)

set the network forwarding in async mode and set the AsyncCallback callback function

Network &set_start_callback(const StartCallback &start_callback)

set the start forwarding callback function of type StartCallback, which will be execute before forward. this can be used to check network input or dump model inputs for debug

Network &set_finish_callback(const FinishCallback &finish_callback)

set the finish forwarding callback function of type FinishCallback, which will be execute after forward. this can be used to dump model outputs for debug

void forward()

forward the network with filled input data and fill the output data to the output tensor

void wait()

waite until forward finish in sync model

std::string get_input_name(size_t index) const

get the input tensor name by index

std::string get_output_name(size_t index) const

get the output tensor name by index

std::vector<std::string> get_all_input_name() const

get all the input tensor names

std::vector<std::string> get_all_output_name() const

get all the output tensor names

Network &set_device_id(int device_id)

set the network forwarding device id, default device id = 0

int get_device_id() const

get the network forwarding device id

Network &set_stream_id(int stream_id)

set the network stream id, default stream id = 0

int get_stream_id() const

get the network stream id

void enable_profile_performance(std::string profile_file_path)

enable profile the network, a file will be generated to the given path

const std::string &get_model_extra_info()

get model extra info, the extra information is packed into model by user

LiteDeviceType get_device_type() const

get the network device type

void get_static_memory_alloc_info(const std::string &log_dir = "logs/test") const

get static peak memory info showed by Graph visualization

void extra_configure(const ExtraConfig &extra_config)

the extra configuration


extra_config – the extra configuration to set into the network


class Runtime

All the runtime configuration function is define in Runtime class, as a static member function.


static void set_cpu_threads_number(std::shared_ptr<Network> dst_network, size_t nr_threads)

The multithread number setter and getter interface When device is CPU, this interface will set the network running in multi thread mode with the given thread number.

  • dst_network – the target network to set/get the thread number

  • nr_threads – the thread number set to the target network

static size_t get_cpu_threads_number(std::shared_ptr<Network> dst_network)
static void set_runtime_thread_affinity(std::shared_ptr<Network> network, const ThreadAffinityCallback &thread_affinity_callback)

set threads affinity callback

  • dst_network – the target network to set the thread affinity callback

  • thread_affinity_callback – the ThreadAffinityCallback callback to set the thread affinity

static void set_cpu_inplace_mode(std::shared_ptr<Network> dst_network)

Set cpu default mode when device is CPU, in some low computation device or single core device, this mode will get good performace.


dst_network – the target network to set/get cpu inplace model

static bool is_cpu_inplace_mode(std::shared_ptr<Network> dst_network)
static void use_tensorrt(std::shared_ptr<Network> dst_network)

Set the network forwarding use tensorrt.

static void set_network_algo_policy(std::shared_ptr<Network> dst_network, LiteAlgoSelectStrategy strategy, uint32_t shared_batch_size = 0, bool binary_equal_between_batch = false)

set opr algorithm selection strategy in the target network

  • dst_network – the target network to set the algorithm strategy

  • strategy – the algorithm strategy will set to the network, if multi strategy should set, use | operator can pack them together

  • shared_batch_size – the batch size used by fast-run, Non-zero value means that fast-run use this batch size regardless of the batch size of the model, if set to zero means fast-run use batch size of the model

  • binary_equal_between_batch – if set true means if the content of each input batch is binary equal, whether the content of each output batch is promised to be equal, otherwise not

static void set_network_algo_workspace_limit(std::shared_ptr<Network> dst_network, size_t workspace_limit)

set the opr workspace limitation in the target network, some opr maybe use large of workspace to get good performance, set workspace limitation can save memory but may influence the performance

  • dst_network – the target network to set/get workspace limitation

  • workspace_limit – the byte size of workspace limitation

static void set_memory_allocator(std::shared_ptr<Network> dst_network, std::shared_ptr<Allocator> user_allocator)

set the network runtime memory Allocator, the Allocator is defined by user, through this method, user can implement a memory pool for network forwarding

  • dst_network – the target network

  • user_allocator – the user defined Allocator

static void share_runtime_memory_with(std::shared_ptr<Network> dst_network, std::shared_ptr<Network> src_network)

share the runtime memory with other network, the weights is not shared


src 网络不能与 dst 网络同时运行

  • dst_network – the target network to share the runtime memory from src_network

  • src_network – the source network to shared runtime memory to dst_network

static void enable_io_txt_dump(std::shared_ptr<Network> dst_network, std::string io_txt_out_file)

dump all input/output tensor of all operators to the output file, in txt format, user can use this function to debug compute error

  • dst_network – the target network to dump its tensors

  • io_txt_out_file – the txt file

static void enable_io_bin_dump(std::shared_ptr<Network> dst_network, std::string io_bin_out_dir)

dump all input/output tensor of all operators to the output file, in binary format, user can use this function to debug compute error

  • dst_network – the target network to dump its tensors

  • io_bin_out_dir – the binary file director

static void shared_weight_with_network(std::shared_ptr<Network> dst_network, const std::shared_ptr<Network> src_network)

load a new network which will share weights with src network, this can reduce memory usage when user want to load the same model multi times

  • dst_network – the target network to share weights from src_network

  • src_network – the source network to shared weights to dst_network

static void enable_global_layout_transform(std::shared_ptr<Network> network)

set global layout transform optimization for network, global layout optimization can auto determine the layout of every operator in the network by profile, thus it can improve the performance of the network forwarding

static void dump_layout_transform_model(std::shared_ptr<Network> network, std::string optimized_model_path)

dump network after global layout transform optimization to the specific path

static NetworkIO get_model_io_info(const std::string &model_path, const Config &config = {})

get the model io information before model loaded by model path.

  • model_path – the model path to get the model IO information

  • config – the model configuration


the model NetworkIO information

static NetworkIO get_model_io_info(const void *model_mem, size_t size, const Config &config = {})

get the model io information before model loaded by model memory.

  • model_mem – the model memory to get the model IO information

  • size – model memory size in byte

  • config – the model configuration


the model NetworkIO information