I'm allocating both input and output MTLBuffer
using posix_memalign
according to the shared GPU/CPU documentation provided by memkite.
Aside: it is easier to just use latest API than muck around with posix_memalign
let metalBuffer = self.metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared)
My kernel function operates on roughly 16 million complex value structs and writes out an equal number of complex value structs to memory.
I've performed some experiments and my Metal kernel 'complex math section' executes in 0.003 seconds (Yes!), but writing the result to the buffer takes >0.05 (No!) seconds. In my experiment I commented out the math-part and just assign the zero to memory and it takes 0.05 seconds, commenting out the assignment and adding the math back, 0.003 seconds.
Is the shared memory slow in this case, or is there some other tip or trick I might try?
Additional detail
Test platforms
- iPhone 6S - ~0.039 seconds per frame
- iPad Air 2 - ~0.130 seconds per frame
The streaming data
Each update to the shader receives approximately 50000 complex numbers in the form of a pair of float
types in a struct.
struct ComplexNumber {
float real;
float imaginary;
};
Kernel signature
kernel void processChannelData(const device Parameters *parameters [[ buffer(0) ]],
const device ComplexNumber *inputSampleData [[ buffer(1) ]],
const device ComplexNumber *partAs [[ buffer(2) ]],
const device float *partBs [[ buffer(3) ]],
const device int *lookups [[ buffer(4) ]],
device float *outputImageData [[ buffer(5) ]],
uint threadIdentifier [[ thread_position_in_grid ]]);
All the buffers contain - currently - unchanging data except inputSampleData
which receives the 50000 samples I'll be operating on. The other buffers contain roughly 16 million values (128 channels x 130000 pixels) each. I perform some operations on each 'pixel' and sum the complex result across channels and finally take the absolute value of the complex number and assign the resulting float
to outputImageData
.
Dispatch
commandEncoder.setComputePipelineState(pipelineState)
commandEncoder.setBuffer(parametersMetalBuffer, offset: 0, atIndex: 0)
commandEncoder.setBuffer(inputSampleDataMetalBuffer, offset: 0, atIndex: 1)
commandEncoder.setBuffer(partAsMetalBuffer, offset: 0, atIndex: 2)
commandEncoder.setBuffer(partBsMetalBuffer, offset: 0, atIndex: 3)
commandEncoder.setBuffer(lookupsMetalBuffer, offset: 0, atIndex: 4)
commandEncoder.setBuffer(outputImageDataMetalBuffer, offset: 0, atIndex: 5)
let threadExecutionWidth = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
let threadGroups = MTLSize(width: self.numberOfPixels / threadsPerThreadgroup.width, height: 1, depth:1)
commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
metalCommandBuffer.commit()
metalCommandBuffer.waitUntilCompleted()
GitHub example
I've written an example called Slow and put it up on GitHub. Seems the bottleneck is the write of the values into the input Buffer. So, I guess the question becomes how to avoid the bottleneck?
Memory copy
I wrote up a quick test to compare the performance of various byte copying methods.
Current Status
I've reduced execution time to 0.02ish seconds which doesn't sound like a lot, but it makes a big difference in the number of frames per second. Currently the biggest improvements are a result of switching to cblas_scopy()
.
See Question&Answers more detail:
os