I have a large cluster of computing nodes, each node having 6 GPUs. And I want to start, lets say, 100 workers on that thing, each one having access to exactly one single GPU.
What I do now is like this:
sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh
And inside the main.sh:
srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh
And this way, I get 100 workers started (fully using like 17 nodes). But I have a problem: the CUDA_VISIBLE_DEVICES is not set properly.
sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh
# CUDA_VISIBLE_DEVICES in main.sh: 0,1,2,3,4,5 (that's fine)
srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh
# CUDA_VISIBLE_DEVICES in worker.sh: 0,1,2,3,4,5 (this is my problem: how to assign exactly 1 GPU to each worker and to that worker alone?)
It might be a misunderstanding on my part on how Slurm actually works since I'm quite new programming on such HPC systems. But any clue how to achieve what I want to achieve? (each worker having exactly 1 GPU assigned to it and only it)
We use SLURM 20.02.2.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…