Detailed Explanation of Copy and Local Operation in PyTorch

  • 2021-08-21 20:51:23
  • OfStack

Preface

In PyTroch, we often use Numpy to process data, and then turn to Tensor, but when it comes to data changes, we should pay attention to whether the method is shared address, which is related to the update of the whole network. This article summarizes the points for attention in In-palce operation and copy operation.

In-place operation

In-place operations in pytorch, with the suffix _, such as. add_ () or.scatter_ (), are operations that directly change the contents of a given Tensor without copying, i.e. no new memory is allocated to variables. Python operations, like += or *=, are also in-place operations. (I added myself ~)

Why the in-place operation can help reduce memory usage when processing high-dimensional data is illustrated by using an example below, defining the following simple function to measure the memory allocated by PyTorch's ectopic ReLU (out-of-place) and in-place ReLU (in-place):


import torch # import main library
import torch.nn as nn # import modules like nn.ReLU()
import torch.nn.functional as F # import torch functions like F.relu() and F.relu_()

def get_memory_allocated(device, inplace = False):
 '''
 Function measures allocated memory before and after the ReLU function call.
 INPUT:
 - device: gpu device to run the operation
 - inplace: True - to run ReLU in-place, False - for normal ReLU call
 '''
 
 # Create a large tensor
 t = torch.randn(10000, 10000, device=device)
 
 # Measure allocated memory
 torch.cuda.synchronize()
 start_max_memory = torch.cuda.max_memory_allocated() / 1024**2
 start_memory = torch.cuda.memory_allocated() / 1024**2
 
 # Call in-place or normal ReLU
 if inplace:
 F.relu_(t)
 else:
 output = F.relu(t)
 
 # Measure allocated memory after the call
 torch.cuda.synchronize()
 end_max_memory = torch.cuda.max_memory_allocated() / 1024**2
 end_memory = torch.cuda.memory_allocated() / 1024**2
 
 # Return amount of memory allocated for ReLU call
 return end_memory - start_memory, end_max_memory - start_max_memory
 # setup the device
device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")
# Begin testing 
# Call the function to measure the allocated memory for the out-of-place ReLU
memory_allocated, max_memory_allocated = get_memory_allocated(device, inplace = False)
print('Allocated memory: {}'.format(memory_allocated))
print('Allocated max memory: {}'.format(max_memory_allocated))
'''
Allocated memory: 382.0
Allocated max memory: 382.0
'''
#Then call the in-place ReLU as follows:
memory_allocated_inplace, max_memory_allocated_inplace = get_memory_allocated(device, inplace = True)
print('Allocated memory: {}'.format(memory_allocated_inplace))
print('Allocated max memory: {}'.format(max_memory_allocated_inplace))
'''
Allocated memory: 0.0
Allocated max memory: 0.0
'''

It seems that using in-place operation can help us save some GPU memory. However, extra care should be taken when using in-place operations.

There are two main reasons for the main shortcomings of local operation. Official documents:

1. It may override the value needed to calculate the gradient, which means that the training process of the model is destroyed.

2. Every in-place operation actually needs to be implemented to rewrite the calculation diagram. The remote operation Out-of-place allocates new objects and retains references to old graphs, while the local operation requires changing the creator of all inputs to the function representing the operation.

Supporting in-place operations in Autograd is difficult and discouraged in most cases. Autograd's aggressive buffer release and reuse make it very efficient, and in-place operations actually reduce memory usage rarely. They may never be needed unless they are running under heavy memory pressure.

Summary: Autograd is very fragrant, so it should be used with caution in local operation.

Copy method

Shallow copy method: Share the memory address of data, and the data will change synchronously

* a. numpy () # Tensor-- > Numpy array

* view () # Change the shape of tensor, but share data memory, do not use id directly for judgment

* y = x [:] # Index

* torch.from_numpy () # Numpy array-- > Tensor

* torch. detach () # The new tensor is detached from the calculation diagram and does not involve gradient calculation.

* model:forward()

Many other selection functions are also data shared memory, such as index_select () masked_select () gather ().

And the in-place operation in-place mentioned later.

Deep copy method:

* torch. clone () # The new tensor will remain in the calculation diagram and participate in gradient calculation

The following validates, first verifying the shallow copy:


import torch as t
import numpy as np
a = np.ones(4)
b = t.from_numpy(a) # Numpy->Tensor
print(a)
print(b)
''' Output: 
[1. 1. 1. 1.]
tensor([1., 1., 1., 1.], dtype=torch.float64)
'''
b.add_(1)# add_ Will modify b Oneself 
print(a)
print(b)
''' Output: 
[2. 2. 2. 2.]
tensor([2., 2., 2., 2.], dtype=torch.float64)
b Go on add After operation , a,b Synchronization has changed 
'''

Tensor and numpy objects share memory (shallow copy operation), so the conversion between them is fast and changes synchronously.

The operation y = x + y in torch will open new memory, and then point y to the new memory. For verification, we can use the id function that comes with Python: if the ID1 of two instances is caused, then their corresponding memory addresses are the same; However, it should be noted that there are some special features in torch, and id that prints tensor directly will still be different when sharing data.


x = torch.tensor([1, 2])
y = torch.tensor([3, 4])
id_0 = id(y)
y = y + x
print(id(y) == id_0) 
# False 

At this time, using index operation will not open up new memory, but if we want to specify the result to the memory of the original y, we can use index to replace the operation. For example, write the result of x + y into the memory corresponding to y through [:].


x = torch.tensor([1, 2])
y = torch.tensor([3, 4])
id_0 = id(y)
y[:] = y + x
print(id(y) == id_0) 
# True

In addition, the following two ways can also be indexed to the same memory:

torch.add(x, y, out=y) y += x, y.add_(x)

x = torch.tensor([1, 2])
y = torch.tensor([3, 4])
id_0 = id(y)
torch.add(x, y, out=y) 
# y += x, y.add_(x)
print(id(y) == id_0) 
# True

Comparison between clone () and detach ()

Torch to speed up, vector or matrix assignments point to the same memory, which is different from Matlab. If you need to save the old tensor, that is, you need to open up a new storage address instead of a reference, you can use clone () for deep copy.

First, let's print out the change of data type definition after clone () operation:

(1). Simple print type


import torch

a = torch.tensor(1.0, requires_grad=True)
b = a.clone()
c = a.detach()
a.data *= 3
b += 1

print(a) # tensor(3., requires_grad=True)
print(b)
print(c)

'''
 Output: 
tensor(3., requires_grad=True)
tensor(2., grad_fn=<AddBackward0>)
tensor(3.)  # detach() After the value follows the a The change of the change appears to change 
'''

grad_fn= < CloneBackward > Indicates that the return value after clone is an intermediate variable and therefore supports gradient backtracking. The clone operation can be regarded as an identity-mapping function to a certain extent.

After detach () operation, tensor shares data memory with the original tensor. When the original tensor is updated in the calculation diagram, the tensor value of detach () also changes.

Note: In pytorch, we should not directly use whether id is equal to judge whether tensor shares memory. This is only a sufficient condition, because maybe the underlying shared data memory, but it is still a new tensor, such as detach (). If we print id directly, the following situations will occur.


import torch as t
a = t.tensor([1.0,2.0], requires_grad=True)
b = a.detach()
#c[:] = a.detach()
print(id(a))
print(id(b))
#140568935450520
140570337203616

Obviously, id printed directly varies, so we can judge by observing data changes after simple assignment.

(2). Gradient return of clone ()

The detach () function can return an identical tensor, sharing memory with the old tensor, leaving the calculation diagram and not involving gradient calculation.

While clone acts as an intermediate variable, it will pass the gradient to the source tensor for superposition, but it does not save its grad, that is, its value is None


import torch
a = torch.tensor(1.0, requires_grad=True)
a_ = a.clone()
y = a**2
z = a ** 2+a_ * 3
y.backward()
print(a.grad) # 2
z.backward()
print(a_.grad)     # None.  Middle variable , none grad
print(a.grad) 
'''
 Output: 
tensor(2.) 
None
tensor(7.) # 2*2+3=7
'''

The new tensor obtained by using torch. clone () no longer shares memory with the original data, but still remains in the calculation diagram. clone operation supports gradient gradient transfer and superposition without sharing data memory, so it is often used in scenarios where a certain unit in neural network needs to be reused.

Normally if requires_grad=True of the original tensor, then:

tensor requires_grad=True after clone () operation tensor requires_grad=False after detach () operation.

import torch
torch.manual_seed(0)

x= torch.tensor([1., 2.], requires_grad=True)
clone_x = x.clone() 
detach_x = x.detach()
clone_detach_x = x.clone().detach() 

f = torch.nn.Linear(2, 1)
y = f(x)
y.backward()

print(x.grad)
print(clone_x.requires_grad)
print(clone_x.grad)
print(detach_x.requires_grad)
print(clone_detach_x.requires_grad)
'''
 The output is as follows: 
tensor([-0.0053, 0.3793])
True
None
False
False
'''

Another special one is that when the source tensor require_grad=False and the tensor require_grad=True after clone, there is no tensor backflow phenomenon, and the tensor derivative after clone can be obtained.

As follows:


import torch
a = torch.tensor(1.0)
a_ = a.clone()
a_.requires_grad_() #require_grad=True
y = a_ ** 2
y.backward()
print(a.grad) # None
print(a_.grad) 
'''
 Output: 
None
tensor(2.)
'''

Summary:

torch. detach ()-The new tensor is off the calculation diagram and does not involve gradient calculation

torch. clone ()-The new tensor acts as an intermediate variable and will remain in the calculation diagram to participate in gradient calculation (backhaul superposition), but 1 generally does not retain its own gradient.

In-situ operations (in-place, such as resize_/resize_as_/set_/transpose_) either cause errors or warnings.

If you use in-place operation without reporting an error, you can be sure that your gradient calculation is correct. In addition, try to avoid the use of in-place.


Related articles: