For instance, I have a kernel code that I have launched with a number
of threads, say N. So,in the amdahl's law the number of processors
will be N right?
Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 103, N is in range 104 -- 106). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N0 threads without them interfering with each other, where N0 is usually several times bigger than K, but actually depends upon you kernel function.
In my opinion, the best way to determine this N0 is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)
Also, for any CUDA programming using a large number of threads, is it
safe for me to assume that the Amdahl's law is reduced to 1/(1-p)
wherein p stands for the parallel code?
This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.
E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.
Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N]
(at it's simplest, k = Frequency(CPU) / Frequency(GPU)
, sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).
I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.
First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.
How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.
Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N)
. Some initial initialization can take up to O(N)
time.
In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.
So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".