Let me leave cuda dynamic parallelism out of the discussion for the moment (i.e. assume compute capability 3.0 or prior).
remember __ global__ is used for cuda functions that will (only) be called from the host (but execute on the device). If you instantiate this object on the device, it won't work. Furthermore, to get device-accessible private data to be available to the member function, the object would have to be instantiated on the device.
So you could have a kernel invocation (ie. mykernel<<<blocks,threads>>>(...);
embedded in a host object member function, but the kernel definition (i.e. the function definition with the __ global__ decorator) would normally precede the object definition in your source code. And as stated already, such a methodology could not be used for an object instantiated on the device. It would also not have access to ordinary private data defined elsewhere in the object. (It may be possible to come up with a scheme for a host-only object that does create device data, using pointers in global memory, that would then be accessible on the device, but such a scheme seems quite convoluted to me at first glance).
Normally, device-usable member functions would be preceded by the __ device__ decorator. In this case, all the code in the device member function executes from within the thread that called it.
This question gives an example (in my edited answer) of a C++ object with a member function callable from both the host and the device, with appropriate data copying between host and device objects.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…