Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
112 views
in Technique[技术] by (71.8m points)

python - Reshaping before as_strided for optimisation

def forward(x, f, s):
    B, H, W, C = x.shape # e.g. 64, 16, 16, 3
    Fh, Fw, C, _ = f.shape # e.g. 4, 4, 3, 3 
    # C is redeclared to emphasise that the dimension is the same
    
    Sh, Sw = s # e.g. 2, 2

    strided_shape = B, 1 + (H - Fh) // Sh, 1 + (W - Fw) // Sw, Fh, Fw, C

    x = as_strided(x, strided_shape, strides=(
        x.strides[0], 
        Sh * x.strides[1], 
        Sw * x.strides[2], 
        x.strides[1], 
        x.strides[2], 
        x.strides[3]), 
    )

    # print(x.flags, f.flags)

    # The reshaping changes the einsum from 'wxyijk,ijkd' to 'wxyz,zd->wxyd'
    f = f.reshape(-1, f.shape[-1])
    x = x.reshape(*x.shape[:3], -1) # Bottleneck!
    
    return np.einsum('wxyz,zd->wxyd', x, f, optimize='optimal')

(On the contrary, the variant without the reshapes uses return np.einsum('wxyijk,ijkd->wxyd', x, f))

For reference, here are the flags for x and f before reshaping:

x.flags:

C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False


f.flags:

C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

Interestingly the major bottleneck in the routine is not the einsum, but rather the reshaping (flattening) of x. I understand that f does not suffer from such problems since its memory is C-contiguous, so the reshape amounts to a quick internal modification without changing the data - but since x is not C-contiguous (and does not own its data, for that matter), the reshape is far more expensive since it involves changing the data/fetching non-cache-aligned data often. This, in turn, results from the as_strided function performed on x - the modification of the strides must be in such a manner as to disturb the natural ordering. (FYI, the as_strided is incredibly fast, and should be fast no matter what strides are passed to it)

Is there a way to achieve the same result without incurring the bottleneck? Perhaps by reshaping x before using as_strided?


Also note, for almost 100% of applications: B: [1-64], H, W: [1-60], C: [1-8] Fh, Fw: [1-12]

I'm also including some graphs here, for variation of timing with a variation in the tensor dimensions B (batch size), as well as H, W (image size) on my device (as you can see, the one involving reshape is already reasonably competitive with Tensorflow):

Variation with batch size Variation with image size


EDIT: An interesting find - the reshape-algorithm beats the non-reshape-algorithm by a factor of 5 on the CPU, but when I use the GPU (i.e. using CuPy instead of NumPy), both algorithms are equally fast (around twice as fast as TensorFlow's forward pass)

question from:https://stackoverflow.com/questions/65516926/reshaping-before-as-strided-for-optimisation

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Reshaping of the strided array is a little bit costly, for the reasons you've mentioned (copy on non-contiguous array), but not as costly as you think. np.einsum can actually be a bottleneck in your application, depending on tensor sizes. As mentioned in Convolutional layer in Python using Numpy, np.tensordot can be a good candidate to replace np.einsum.

Just to give you a quick example:

x = np.arange(64*221*221*3).reshape((64, 221, 221, 3))
f = np.arange(4*4*3*5).reshape((4, 4, 3, 5))
s = (2, 2)

B, H, W, C = x.shape # e.g. 64, 16, 16, 3
Fh, Fw, C, _ = f.shape # e.g. 4, 4, 3, 3 
Sh, Sw = s # e.g. 2, 2
strided_shape = B, 1 + (H - Fh) // Sh, 1 + (W - Fw) // Sw, Fh, Fw, C
print(strided_shape)
# (64, 109, 109, 4, 4, 3)

after initializing the variables, we can test timings of the code parts

%timeit x_strided = as_strided(x, strided_shape, strides=(x.strides[0], Sh * x.strides[1], Sw * x.strides[2], x.strides[1], x.strides[2], x.strides[3]), )
>>> 7.11 μs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit f_reshaped = f.reshape(-1, f.shape[-1])
>>> 450 ns ± 7.43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit x_reshaped = x_strided.reshape(*x_strided.shape[:3], -1) # Bottleneck!
>>> 94.6 ms ± 896 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# einsum without reshape
%timeit np.einsum('wxy...,...d->wxyd', x_strided, f, optimize='optimal')
>>> 809 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# einsum with reshape
%%timeit
f_reshaped = f.reshape(-1, f.shape[-1])
x_reshaped = x_strided.reshape(*x_strided.shape[:3], -1) # Bottleneck!
k = np.einsum('wxyz,zd->wxyd', x_reshaped, f_reshaped, optimize='optimal')
>>> 549 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# tensordot without reshape
%timeit k = np.tensordot(x_strided, f, axes=3)
>>> 271 ms ± 4.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# tensordot with reshape
%%timeit
f_reshaped = f.reshape(-1, f.shape[-1])
x_reshaped = x_strided.reshape(*x_strided.shape[:3], -1) # Bottleneck!
k = np.tensordot(x_reshaped, f_reshaped, axes=(3, 0))
>>> 266 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I got similar results with the tensor sizes in your code (i.e. 64, 16, 16, 3 and 4, 4, 3, 3).

As you can see, there is an overhead with resize operation, but it makes matrix operations faster because of continuous data. Please, be aware that results would vary depending on cpu speed, cpu architecture/generation etc.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...