It's relatively simple.
Under the simplifying assumptions that you request one process per host, slurm will provide you with all the information you need in environment variables, specifically SLURM_PROCID, SLURM_NPROCS and SLURM_NODELIST.
For example, you can initialize your task index, the number of tasks and the nodelist as follows:
from hostlist import expand_hostlist
task_index = int( os.environ['SLURM_PROCID'] )
n_tasks = int( os.environ['SLURM_NPROCS'] )
tf_hostlist = [ ("%s:22222" % host) for host in
expand_hostlist( os.environ['SLURM_NODELIST']) ]
Note that slurm gives you a host list in its compressed format (e.g., "myhost[11-99]"), that you need to expand. I do that with module hostlist by
Kent Engstr?m, available here https://pypi.python.org/pypi/python-hostlist
At that point, you can go right ahead and create your TensorFlow cluster specification and server with the information you have available, e.g.:
cluster = tf.train.ClusterSpec( {"your_taskname" : tf_hostlist } )
server = tf.train.Server( cluster.as_cluster_def(),
job_name = "your_taskname",
task_index = task_index )
And you're set! You can now perform TensorFlow node placement on a specific host of your allocation with the usual syntax:
for idx in range(n_tasks):
with tf.device("/job:your_taskname/task:%d" % idx ):
...
A flaw with the code reported above is that all your jobs will instruct Tensorflow to install servers listening at fixed port 22222. If multiple such jobs happen to be scheduled to the same node, the second one will fail to listen to 22222.
A better solution is to let slurm reserve ports for each job. You need to bring your slurm administrator on board and ask him to configure slurm so it allows you to ask for ports with the --resv-ports option. In practice, this requires asking them to add a line like the following in their slurm.conf:
MpiParams=ports=15000-19999
Before you bug your slurm admin, check what options are already configured, e.g., with:
scontrol show config | grep MpiParams
If your site already uses an old version of OpenMPI, there's a chance an option like this is already in place.
Then, amend my first snippet of code as follows:
from hostlist import expand_hostlist
task_index = int( os.environ['SLURM_PROCID'] )
n_tasks = int( os.environ['SLURM_NPROCS'] )
port = int( os.environ['SLURM_STEP_RESV_PORTS'].split('-')[0] )
tf_hostlist = [ ("%s:%s" % (host,port)) for host in
expand_hostlist( os.environ['SLURM_NODELIST']) ]
Good luck!