megengine.distributed#

>>> import megengine.distributed as dist

backend

Get or set backend of collective communication.

分组(Group)#

Server

Distributed Server for distributed training.

Group

Include ranked nodes running collective communication (See distributed).

init_process_group

Initialize the distributed process group and specify the device used in the current process

new_group

Build a subgroup containing certain ranks.

group_barrier

Block until all ranks in the group reach this barrier.

override_backend

Override distributed backend

is_distributed

Return True if the distributed process group has been initialized.

get_backend

Get the backend str.

get_client

Get client of python XML RPC server.

get_mm_server_addr

Get master_ip and port of C++ mm_server.

get_py_server_addr

Get master_ip and port of python XML RPC server.

get_rank

Get the rank of the current process.

get_world_size

Get the total number of processes participating in the job.

运行器(Launcher)#

launcher

Decorator for launching multiple processes in single-machine multi-gpu training.

辅助功能(Helper)#

bcast_list_

Broadcast tensors between given group.

synchronized

Decorator.

make_allreduce_cb

alias of AllreduceCallback

helper.AllreduceCallback

Allreduce Callback with tensor fusion optimization.

helper.param_pack_split

Returns split tensor to list of tensors as offsets and shapes described, only used for parampack.

helper.param_pack_concat

Returns concated tensor, only used for parampack.

helper.pack_allreduce_split