Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
562 views
in Technique[技术] by (71.8m points)

cluster computing - How to get multi GPUs same type on slurm?

How can I create a job with a multi GPU of the same type but not specific that type directly? My experiment has a constraint that all GPUs have the same type but this type can be whatever we want.

Currently I am able only to create a experiment with multi GPUs with telling exactly what type I want:

--gres=gpu:gres_type:amount

If I don't specify gres_type, then sometimes I get mixed GPUs packs (let say 2x titan V and 2x titan X).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you are fortunate enough that the cluster is consistent in the types of nodes that host the GPUs, and that the features of the nodes a properly specified and allow distinguishing between the nodes that host the different GPU types, you can use the --constraint parameter.

For the sake of the argument, let's assume that the nodes that host the titanV have haswell CPUs, and those that host the titanX have skylake CPUs and that those are defined as features. Then, you can request

--gres=gpu:2
--constraint=[haswell|skylake]

If the above does not apply to your use case, you can submit two jobs and keep only the one that starts the earliest. For that, give your jobs an identical name, and use the singleton dependency.

Write a submission script like this one

#!/bin/bash
#SBATCH --dependency=singleton 
#SBATCH --job-name=gpujob
# Other options

scancel --state=PENDING --jobname=gpujob

# etc.

and submit it twice with

$ sbatch --gres=gpu:titanX:2 submit.sh
$ sbatch --gres=gpu:titanV:2 submit.sh

Each job will be assigned only one type of GPU, and the first one that starts will cancel the other one. This approach can scale up with more than two GPU types.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...