Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
939 views
in Technique[技术] by (71.8m points)

cuda - How to use GPUDirect RDMA with Infiniband

I have two machines. There are multiple Tesla cards on each machine. There is also an InfiniBand card on each machine. I want to communicate between GPU cards on different machines through InfiniBand. Just point to point unicast would be fine. I surely want to use GPUDirect RDMA so I could spare myself of extra copy operations.

I am aware that there is a driver available now from Mellanox for its InfiniBand cards. But it doesn't offer a detailed development guide. Also I am aware that OpenMPI has support for the feature I am asking. But OpenMPI is too heavy weight for this simple task and it does not support multiple GPUs in a single process.

I wonder if I could get any help with directly using the driver to do the communication. Code sample, tutorial, anything would be good. Also, I would appreciate it if anyone could help me find the code dealing with this in OpenMPI.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

For GPUDirect RDMA to work, you need the following installed:

All of the above should be installed (by the order listed above), and the relevant modules loaded. After that, you should be able to register memory allocated on the GPU video memory for RDMA transactions. Sample code will look like:

void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);

This will create (on a GPUDirect RDMA enabled system) a memory region, with a valid memory key that you can use for RDMA transactions with our HCA.

For more details about using RDMA and InfiniBand verbs in your code, you can refer to this document.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...