slurm - How to start n tasks with one GPU each? - OGeek|极客中国-技术改变生活,极客改变未来

I have a large cluster of computing nodes, each node having 6 GPUs. And I want to start, lets say, 100 workers on that thing, each one having access to exactly one single GPU.

What I do now is like this:

sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh

And inside the main.sh:

srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh

And this way, I get 100 workers started (fully using like 17 nodes). But I have a problem: the CUDA_VISIBLE_DEVICES is not set properly.

sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh
# CUDA_VISIBLE_DEVICES in main.sh: 0,1,2,3,4,5 (that's fine)
srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh
# CUDA_VISIBLE_DEVICES in worker.sh: 0,1,2,3,4,5 (this is my problem: how to assign exactly 1 GPU to each worker and to that worker alone?)

It might be a misunderstanding on my part on how Slurm actually works since I'm quite new programming on such HPC systems. But any clue how to achieve what I want to achieve? (each worker having exactly 1 GPU assigned to it and only it)

We use SLURM 20.02.2.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

slurm - How to start n tasks with one GPU each?

slurm - How to start n tasks with one GPU each?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags