GradManager#

class GradManager[source]#

GradManager computes gradients or more generally, vector-Jacobian product, by reverse mode automatic differentiation (aka back propagation).

Reverse mode autodiff normally reuses many intermediate tensors for best computation efficiency. In a read-eval-print-loop (REPL) environment however, it is impossible to known how the user would take gradients later thus which tensors to keep. To solve this problem, the user must somehow declare beforehand which gradient could possibly be taken. With GradManager, users are required to call the attach method on a tensor if they want to take gradients with respect to it later. Furthermore, any computation on a tensor before it is attached is completely ignored from the autodiff perspective, so attach must be called before any computation that needs differentiation.

For example, the following symbolic differentiation code

x = get_x()
y = f(x)
dy = ones_like(y)
dx = vjp(y, x, dy) # vector-Jacobian product

can be rewriten using GradManager for REPL environment as

with GradManager() as gm:
    x = get_x()
    gm.attach(x) # must be placed before any computation on x that needs differentiation
    y = f(x)
    dy = ones_like(y)
    gm.backward(y, dy) # doesn't need x, already known via attach()
    dx = x.grad # backward() saves result to .grad attribute

A more realistic example of training a neural network would be like

gm = GradManager()
gm.attach(model.parameters())

for data in dataset:
    with gm:
        loss = model(data)
        gm.backward(loss)
    # gradients w.r.t. parameters is accumulated into their .grad attributes

You can also use record() and release() method instead of with context:

gm = GradManager()
gm.attach(model.parameters())

for data in dataset:
    gm.record()
    loss = model(data)
    gm.backward(loss)
    # backward() will clear recorded history and free resources
    # call release() if backward() is not called
    # gm.release()

For your convenience, GradManager may (not must) be reused. As shown in the examples, you only need to attach a tensor once and GradManager will remember it afterwards. However, a single GradManager can record only one computation history at a time. To run multiple differentiations simultaneously or perform high order differentiation, create as many GradManager as you need.

Note

Mutable tensors introduce ambiguities when doing symbolic differentiation: which version of the tensor are we referring to? For attached tensors, GradManager resolves this ambiguity by “snapshoting” them on first encounter, either on record (or entering with statement) if tensor is attached before record, or on attach if GradManager is already recording. Attached tensors will then be interpreted as their snapshotted version for differentiation purpose. The same ambiguity on the first parameter of backward is simply resolved by using the latest version.

Typically, in data parallel, we would like to average the gradients across processes. Users will finally get the averaged gradients if an “AllReduce” callback is registered as follows:

import megengine.distributed as dist

gm = GradManager()
gm.attach(model.parameters(), callback=dist.make_allreduce_cb("MEAN"))

attach(tensors, callbacks=None)[source]#

Instruct GradManager to track operations on tensors, so that gradients with respect to those tensors could be evaluated later.

attach also accepts a list of callbacks, which will be called with the tensor and its gradient during backward. The signature of callbacks should look like:

def callback(tensor: Tensor, grad: Tensor) -> Tensor:
    ...
    # returned grad is passed to subsequent callbacks
    # and finally accumulated to the .grad attribute of tensor
    return grad

attach calls with overlapping tensors will result in their callbacks concatenated, independently for each tensor. For example,

gm.attach([x, y], callbacks=[f])
gm.attach([y], callbacks=[g])

is equivalent to

gm.attach([x], callbacks=[f])
gm.attach([y], callbacks=[f, g])

The effect of attach will persist across multiple uses of the GradManager. When reusing a GradManager, it is likely a mistake to call attach on the same set of tensors and callbacks repeatedly, which may grow the callback list indefinitely.

Note

When reusing a GradManager, it is sometimes desirable to attach temporary tensors each time, e.g. for computing gradients of inputs of a neural network. GradManager tries to accommodate such usages by holding weak references to attached tensors. Most of the times, this should be enough to prevent resource leak. Unfortunately, there are still some pitfalls left:

Callbacks should not hold strong references, directly or indirectly, to attached tensors. Any strong reference, including those from callbacks, will prevent garbage collection (even by the cycle collector!) of a attached tensor, until the GradManager object is garbage collected.

Please also note that GradManager might hold additional strong references to attached tensors when it is in use. This note only covers potential resource leaks across multiple uses of a GradManager, which is unrelated to whether resources is timely released within a single use.

Parameters:

tensors (Iterable[Tensor]) – tensor or list of tensors to track
callbacks – callback or list of callbacks

attached_tensors()[source]#: Return attached tensor list from attach.

backward(y=None, dy=None)[source]#

Compute gradients (or vector-Jacobian product) for all attached tensors, accumulate to corresponding .grad attribute, and release resources along the way.

backward computes the vector-Jacobian product \(dx_j = \sum_{i} dy_i J_{ij}\) where \(J_{ij} = ∂y_i/∂x_j\) is the Jacobian matrix between vector variables \(y\) and \(x\), with all vectors involved represented as a list of tensors, in the sense of direct sums (or flatten-and-concatenate). \(y\) and \(dy\) are passed as the first and second parameter respectively, whereas \(x\) is directly taken from the list of all attached tensors. The result \(dx\) is also not returned. Instead, it is directly accumulated into the .grad attribute of matching attached tensors (a.k.a. \(x\)). This can be done unambiguously since \(dx\) as a list of tensors has the same structure as \(x\).

If \(y\) is a scalar and \(dy\) is chosen to be 1, the vector-Jacobian product yield gradient of \(y\) with repect to \(x\) as a special case. In that case, you will be able to omit the \(dy\) parameter and backward will automatically use 1 for it and compute the gradient.

backward consumes all resources held by this GradManager and releases them in the process of this call. When the call successfully finishes, the GradManager will be put back to an inactive state.

Parameters:

y (Union[Tensor, List[Tensor], None]) – tensor or list of tensors
dy (Union[Tensor, List[Tensor], None]) – tensor or list of tensors. Defaults to 1 if y is scalar

record()[source]#

Start recording operations

After this call, you will be able to call backward.

release()[source]#

Stop recording operations and release resources kept for gradient computation

After this call, you will not be able to call backward.