Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
167 views
in Technique[技术] by (71.8m points)

javascript - Node.js performance in file system I/O across multiple disk drives: worker threads or not?

I've read several questions and answers here about the performance benefits of Node.js' ability to performantly handle file I/O operations quickly in a non-blocking way versus using worker threads with either blocking or non-blocking requests, however none seem to answer the question I have.

I'm writing a Node.js application that will be opening, hashing, and writing very large files (multiple gigs) that are stored on multiple hard drives. I'm exploring the idea of worker threads, as they'd allow me to isolate commands to a particular hard drive. For example: assume I have a thread handling copying one file on hard drive A to hard drive B, and another thread handling copying one file from hard drive C to hard drive D.

Assuming I scale this to many more hard drives all at the same time, does it make more sense for me to just use Node.js without worker threads and let it handle all these requests, or does worker threads make more sense if I can isolate I/O by drive, and handle multiple drives' worth of requests at the same time?

Given what I've read, worker threads seem like the obvious solution, but I've also seen that just letting the single Node.js process handle a queue of file I/O is generally faster. Thanks for any guidance you can offer!

question from:https://stackoverflow.com/questions/65713419/node-js-performance-in-file-system-i-o-across-multiple-disk-drives-worker-threa

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Edit: Apparently (based on a comment below), nodejs has only one thread pool shared across all the worker threads. If that's the case, then the only way to get a separate pool per disk would be to use multiple processes, not multiple threads.

Or, you could enlarge the worker pool and then make your own queuing system that only puts a couple requests for each separate disk into the worker pool at a time, giving you more parallelism across separate drives.

Original answer (some of which still applies):

Without worker threads, you will have a single libuv thread pool serving all disk I/O requests. So, they will all go into the same pool and once the threads in that pool are busy (regardless of what disk they are serving), new requests will be queued in the order they arrive. This is potentially less than ideal because if you have 5 requests for drive A and 1 request for drive B and 1 request for drive C, you would like to not just fill up the pool with 5 requests for drive A first because that will make the requests for drive B and drive C wait until several requests on drive A are done before they can get started. This loses some opportunities for some parallelism across the separate drives. Of course, whether you truly get parallelism on separate drives also depends upon the drive controller implementation and whether they actually have separate SATA controllers or not.

If you did use worker threads, one nodejs worker thread for each disk, you can at least guarantee that you have a separate pool of OS threads in the thread pool for each disk and you can make it much more likely that no set of requests for one drive will keep the requests for the other drives from getting a chance to start and miss their opportunity to run in parallel with requests to other drives.

Now, of course, all of this discussion is theoretical. In the world of disk drives, controller cards, operating systems on top of the controllers with libuv on top of that with nodejs on top of that, there are lots of opportunities for the theoretical discussion to not bear out in real world measurements.

So, the only way to really know for sure would be to implement the worker thread option and then benchmark compare it to a non-worker thread option with several different disk usage scenarios, including a couple you think might be worst case. So, as with any important performance-related question, you will inevitably have to benchmark and measure to know for sure one way or the other. And, your results will need very careful construction of the benchmark tests too in order to be maximally useful.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...