megengine.functional.distributed.all_reduce_sum#

all_reduce_sum(inp, group=WORLD, device=None)[source]#

Reduce tensors with sum operation on each value across the specified group.

Note

inp tensor must have identical shape in all processes across the group.

Parameters:

inp (Tensor) – tensor to be reduced.

Keyword Arguments:

group (Group or sequence of ints) – the process group to work on. Default: WORLD. WORLD group selects all processes available. list of process rank as parameter will create a new group to work on.
device (Tensor.device) – the specific device to execute this operator. Default: None None will select the device of inp to execute. Specially, GPU device can assign a different stream to execute by adding a number right after a colon following the device name while :0 denotes default stream of GPU, otherwise will use default stream.

Return type:

Tensor

Returns:

A tensor with sum operation on each value across the group.

The shape of the output tensor must be the same as inp, and the output tensor is going to be bitwise identical in all processes across the group.

Examples

>>> # We execute all_reduce_sum on rank 0 and rank 1
>>> input = F.arange(2) + 1 + 2 * rank 
>>> input 
Tensor([1. 2.], device=xpux:0) # Rank 0
Tensor([3. 4.], device=xpux:0) # Rank 1
>>> F.distributed.all_reduce_sum(input, group=[0, 1]) 
Tensor([4. 6.], device=xpux:0) # Rank 0
Tensor([4. 6.], device=xpux:0) # Rank 1

>>> # We execute all_reduce_sum with on gpu0 with cuda stream 1
>>> megengine.set_default_device("gpu0") 
>>> input = F.arange(2) + 1 + 2 * rank 
>>> input  
Tensor([1. 2.], device=gpu0:0) # Rank 0
Tensor([3. 4.], device=gpu0:0) # Rank 1
>>> F.distributed.all_reduce_sum(input, device="gpu0:1") 
Tensor([4. 6.], device=gpu0:0) # Rank 0
Tensor([4. 6.], device=gpu0:0) # Rank 1