I decided to do some performance tests about reversing methods.
Using Chad's link I wrote the following methods:
public static byte[] BitReverseTable =
{
0x00, 0x80, 0x40, 0xc0, 0x20, 0xa0, 0x60, 0xe0,
0x10, 0x90, 0x50, 0xd0, 0x30, 0xb0, 0x70, 0xf0,
0x08, 0x88, 0x48, 0xc8, 0x28, 0xa8, 0x68, 0xe8,
0x18, 0x98, 0x58, 0xd8, 0x38, 0xb8, 0x78, 0xf8,
0x04, 0x84, 0x44, 0xc4, 0x24, 0xa4, 0x64, 0xe4,
0x14, 0x94, 0x54, 0xd4, 0x34, 0xb4, 0x74, 0xf4,
0x0c, 0x8c, 0x4c, 0xcc, 0x2c, 0xac, 0x6c, 0xec,
0x1c, 0x9c, 0x5c, 0xdc, 0x3c, 0xbc, 0x7c, 0xfc,
0x02, 0x82, 0x42, 0xc2, 0x22, 0xa2, 0x62, 0xe2,
0x12, 0x92, 0x52, 0xd2, 0x32, 0xb2, 0x72, 0xf2,
0x0a, 0x8a, 0x4a, 0xca, 0x2a, 0xaa, 0x6a, 0xea,
0x1a, 0x9a, 0x5a, 0xda, 0x3a, 0xba, 0x7a, 0xfa,
0x06, 0x86, 0x46, 0xc6, 0x26, 0xa6, 0x66, 0xe6,
0x16, 0x96, 0x56, 0xd6, 0x36, 0xb6, 0x76, 0xf6,
0x0e, 0x8e, 0x4e, 0xce, 0x2e, 0xae, 0x6e, 0xee,
0x1e, 0x9e, 0x5e, 0xde, 0x3e, 0xbe, 0x7e, 0xfe,
0x01, 0x81, 0x41, 0xc1, 0x21, 0xa1, 0x61, 0xe1,
0x11, 0x91, 0x51, 0xd1, 0x31, 0xb1, 0x71, 0xf1,
0x09, 0x89, 0x49, 0xc9, 0x29, 0xa9, 0x69, 0xe9,
0x19, 0x99, 0x59, 0xd9, 0x39, 0xb9, 0x79, 0xf9,
0x05, 0x85, 0x45, 0xc5, 0x25, 0xa5, 0x65, 0xe5,
0x15, 0x95, 0x55, 0xd5, 0x35, 0xb5, 0x75, 0xf5,
0x0d, 0x8d, 0x4d, 0xcd, 0x2d, 0xad, 0x6d, 0xed,
0x1d, 0x9d, 0x5d, 0xdd, 0x3d, 0xbd, 0x7d, 0xfd,
0x03, 0x83, 0x43, 0xc3, 0x23, 0xa3, 0x63, 0xe3,
0x13, 0x93, 0x53, 0xd3, 0x33, 0xb3, 0x73, 0xf3,
0x0b, 0x8b, 0x4b, 0xcb, 0x2b, 0xab, 0x6b, 0xeb,
0x1b, 0x9b, 0x5b, 0xdb, 0x3b, 0xbb, 0x7b, 0xfb,
0x07, 0x87, 0x47, 0xc7, 0x27, 0xa7, 0x67, 0xe7,
0x17, 0x97, 0x57, 0xd7, 0x37, 0xb7, 0x77, 0xf7,
0x0f, 0x8f, 0x4f, 0xcf, 0x2f, 0xaf, 0x6f, 0xef,
0x1f, 0x9f, 0x5f, 0xdf, 0x3f, 0xbf, 0x7f, 0xff
};
public static byte ReverseWithLookupTable(byte toReverse)
{
return BitReverseTable[toReverse];
}
public static byte ReverseBitsWith4Operations(byte b)
{
return (byte)(((b * 0x80200802ul) & 0x0884422110ul) * 0x0101010101ul >> 32);
}
public static byte ReverseBitsWith3Operations(byte b)
{
return (byte)((b * 0x0202020202ul & 0x010884422010ul) % 1023);
}
public static byte ReverseBitsWith7Operations(byte b)
{
return (byte)(((b * 0x0802u & 0x22110u) | (b * 0x8020u & 0x88440u)) * 0x10101u >> 16);
}
public static byte ReverseBitsWithLoop(byte v)
{
byte r = v; // r will be reversed bits of v; first get LSB of v
int s = 7; // extra shift needed at end
for (v >>= 1; v != 0; v >>= 1)
{
r <<= 1;
r |= (byte)(v & 1);
s--;
}
r <<= s; // shift when v's highest bits are zero
return r;
}
public static byte ReverseWithUnrolledLoop(byte b)
{
byte r = b;
b >>= 1;
r <<= 1;
r |= (byte)(b & 1);
b >>= 1;
r <<= 1;
r |= (byte)(b & 1);
b >>= 1;
r <<= 1;
r |= (byte)(b & 1);
b >>= 1;
r <<= 1;
r |= (byte)(b & 1);
b >>= 1;
r <<= 1;
r |= (byte)(b & 1);
b >>= 1;
r <<= 1;
r |= (byte)(b & 1);
b >>= 1;
r <<= 1;
r |= (byte)(b & 1);
b >>= 1;
return r;
}
Then I tested it, and here's the results:
Test features:
- 100000000 random bytes to reverse
- OS: Windows 7 x64
- CPU: AMD Phenom II 955 (4-core @ 3.2 GHz)
- RAM: 4GB
- IDE: Visual Studio 2010
Target framework 3.5
-----------------------------------------------------
| Method | Ticks(x64 mode) | Ticks(x86 mode) |
-----------------------------------------------------
| Loop | 4861859 | 4079554 |
| Unrolled Loop | 3241781 | 2948026 |
| Look-up table | 894809 | 312410 |
| 3-Operations | 2068072 | 6757008 |
| 4-Operations | 893924 | 1972576 |
| 7-Operations | 1219189 | 303499 |
-----------------------------------------------------
Target framework 4
-----------------------------------------------------
| Method | Ticks(x64 mode) | Ticks(x86 mode) |
-----------------------------------------------------
| Loop | 4682654 | 4147036 |
| Unrolled Loop | 3154920 | 2851307 |
| Look-up table | 602686 | 313940 |
| 3-Operations | 2067509 | 6661542 |
| 4-Operations | 893406 | 2018334 |
| 7-Operations | 1193200 | 991792 |
-----------------------------------------------------
So, look-up table method is not always the fastest :)
That can be reasonable, because memory access is slower than CPU registers access, so if some method is compiled and optimized enough to avoid mem access (and to do few operations) it is faster. (Anyway, the gap is extremely reduced by CPU mem caching)
It's also interesting to see the different behaviours in case of x64 or x86 mode, and how 3.5 and 4.0 frameworks performs distinct optimizations.