I have no idea what 'code preload' means. There is data preloading with the pld
instruction. It would make sense in the context of the sample code.
Here is the basic 'C' version given the assumptions,
The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.
The code is probably not perfect, but meant to illustrate.
// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr,
unsigned char k, int n)
{
unsigned int *out = (void*)outArr;
unsigned int *in = (void*)inArr;
unsigned int mask = k<<24|k<<16|k<<8|k;
/* Check arguments */
if( n % 64 != 0) return;
if((int)outArr & 3) return;
if((int)inArr & 3) return;
assert(sizeof(int)==4);
for(int i=0; i<n/sizeof(int); i+=64/(sizeof(int)) {
/* 16 transfers per loop 64/4 */
*out++ = *in++ ^ k; // 1
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k; // 5
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k; // 9
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k; // 13
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
}
}
You can view the output on godbolt.
The ldm
and stm
instructions can be used to load consecutive memory addresses to registers. We can not use all 16 ARM registers, so the core of the loop in assembler would look like this,
ldmia [r1], {r4,r5,r6,r7,r8,r9,r10,r11} # r1 is inArr
eor r4,r4,r2 # r2 is expanded k
eor r5,r5,r2
eor r6,r6,r2
eor r7,r7,r2
eor r8,r8,r2
eor r9,r9,r2
eor r10,r10,r2
eor r11,r11,r2
stmia [r0], {r4,r5,r6,r7,r8,r9,r10,r11} # r0 is outArr
This is repeated twice and the R0
or R1
can be checked against the array limits stored in R3
. You need to save all of the callee saved registers if you want to be EABI compliant. The register set r4
-r11
can generally be used, but it will depend on the system. You can also use lr
, fp
, etc if you save them and are not exception safe.
From the comments,
I am trying to find that how many cycles does this subroutine take per
array element when it is optimized and when it isn't.
Cycle counts are extremely difficult on modern CPUs. However you have five instructions in the core with a simple loop,
.L3:
ldrb r3, [r1], #1 @ zero_extendqisi2
eor r3, r3, r2
strb r3, [r0], #1
cmp ip, r0
bne .L3
To do 32 bytes, this is 32 * 5 (160) instructions. With 32 * 2 memory accesses.
The expanded options is just one 32byte memory read and write. These will complete, with the lowest value available first. Then just a single EOR
instruction. So it is just 10 instructions versus 160. On modern processors the memory will be the limiting factor. Because of memory stalls, it maybe better to only process four words at a time such as,
ldmia [r1], {r4,r5,r6,r7} # r1 is inArr
eor r4,r4,r2 # r2 is expanded k
eor r5,r5,r2
eor r6,r6,r2
eor r7,r7,r2
ldmia [r1], {r8,r9,r10,r11} # r1 is inArr
stmia [r0], {r4,r5,r6,r7} # r0 is outArr
...
This (or some permutation) will allow the load/store unit and the 'eor' to not block each other, but this will depend on the particular CPU type. This topic is called instruction scheduling; it is more powerful than pld
or data preloading. As well, you can use NEON or ARM64 instructions so that the body of the loop can do more eor
operations before a load/store.