First a quick note:
The files load instantly using xr.open_dataset
You probably did not actually load the data at this point, only the metadata. Depending on your IO and compression/encoding, it might take considerable CPU and memory to load your data. You should have an idea of how much time you think it ought to take with a single CPU thread.
To answer our question:
netCDF (HDF5) does not play nicely with parallel writing. You will likely find that only one task is writing at a time because of locking, or even that the output data is all going to a single task before writing, regardless of your chunking. Please check your dask dashboard!
May I recommend that you try the zarr format, which works well for parallel applications, because each chunk is in a different file. You still need to make decisions on the correct chunking of your data (example).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…