Is there really no asynchronous block I/O on Linux?

Question

Welcome To Ask or Share your Answers For Others

Is there really no asynchronous block I/O on Linux?

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

Is there really no asynchronous block I/O on Linux?

Consider an application that is CPU bound, but also has high-performance I/O requirements.

I'm comparing Linux file I/O to Windows, and I can't see how epoll will help a Linux program at all. The kernel will tell me that the file descriptor is "ready for reading," but I still have to call blocking read() to get my data, and if I want to read megabytes, it's pretty clear that that will block.

On Windows, I can create a file handle with OVERLAPPED set, and then use non-blocking I/O, and get notified when the I/O completes, and use the data from that completion function. I need to spend no application-level wall-clock time waiting for data, which means I can precisely tune my number of threads to my number of cores, and get 100% efficient CPU utilization.

If I have to emulate asynchronous I/O on Linux, then I have to allocate some number of threads to do this, and those threads will spend a little bit of time doing CPU things, and a lot of time blocking for I/O, plus there will be overhead in the messaging to/from those threads. Thus, I will either over-subscribe or under-utilize my CPU cores.

I looked at mmap() + madvise() (WILLNEED) as a "poor man's async I/O" but it still doesn't get all the way there, because I can't get a notification when it's done -- I have to "guess" and if I guess "wrong" I will end up blocking on memory access, waiting for data to come from disk.

Linux seems to have the starts of async I/O in io_submit, and it seems to also have a user-space POSIX aio implementation, but it's been that way for a while, and I know of nobody who would vouch for these systems for critical, high-performance applications.

The Windows model works roughly like this:

Issue an asynchronous operation.
Tie the asynchronous operation to a particular I/O completion port.
Wait on operations to complete on that port
When the I/O is complete, the thread waiting on the port unblocks, and returns a reference to the pending I/O operation.

Steps 1/2 are typically done as a single thing. Steps 3/4 are typically done with a pool of worker threads, not (necessarily) the same thread as issues the I/O. This model is somewhat similar to the model provided by boost::asio, except boost::asio doesn't actually give you asynchronous block-based (disk) I/O.

The difference to epoll in Linux is that in step 4, no I/O has yet happened -- it hoists step 1 to come after step 4, which is "backwards" if you know exactly what you need already.

Having programmed a large number of embedded, desktop, and server operating systems, I can say that this model of asynchronous I/O is very natural for certain kinds of programs. It is also very high-throughput and low-overhead. I think this is one of the remaining real shortcomings of the Linux I/O model, at the API level.

question from:https://stackoverflow.com/questions/13407542/is-there-really-no-asynchronous-block-i-o-on-linux

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T05:57:34+0000

(2020) If you're using a 5.1 or above kernel you can use the io_uring interface for file-like I/O and get excellent asynchronous operation.

Compared to the existing libaio/KAIO interface, io_uring has the following advantages:

Retains asynchronous behaviour when doing buffered I/O (and not just when doing direct I/O)
Easier to use (especially when using the liburing helper library)
Can optionally work in a polled manner (but you'll need higher privileges to enable this mode)
Less bookkeeping space overhead per I/O
Lower CPU overhead due to less userspace/kernel syscall context switches (a big deal these days due to the impact of spectre/meltdown mitigations)
File descriptors and buffers can be pre-registered to save mapping/unmapping time
Faster (can achieve higher aggregate throughput, I/Os have a lower latency)
"Linked mode" that can be used to express dependencies between groups of I/Os (>=5.3 kernel)
Rapidly improving support for socket based I/O (recvmsg()/sendmsg() are supported from >=5.3, see messages mentioning the word support in io_uring.c's git history)
Supports attempted cancellation of queued I/O (>=5.5)
Growing support for performing asynchronous operations beyond read/write (e.g. fsync (>=5.1), fallocate (>=5.6), splice (>=5.7) and more)
Doesn't become blocking each time the stars aren't perfectly aligned

Compared to glibc's POSIX AIO, io_uring has the following advantages:

Much faster and more efficient (the lower overhead benefits from above apply even more here)
Interface is kernel backed and DOESN'T use a userspace thread pool
Less copies of the data are made when doing buffered I/O
Glibc's POSIX AIO can't have more than one I/O in flight on a single file descriptor whereas io_uring most certainly can!

The "Efficient IO with io_uring" document is periodically updated and goes into far more detail as to io_uring's benefits and usage. The "What's new with io_uring" document describes new features added to io_uring since its inception, while The rapid growth of io_uring LWN article describes which features were available in each of the 5.1 - 5.5 kernels with a forward glance to what was going to be in 5.6 (also see LWN's list of io_uring articles). There's also a "Faster IO through io_uring" videoed presentation (slides) from late 2019 by io_uring author Jens Axboe. Finally, the Lord of the io_uring guide gives a introductory tutorial on io_uring usage.

The io_uring community can be reached via the io_uring mailing list and the io_uring mailing list archives show daily traffic at the start of 2021.

Re "support partial I/O in the sense of recv() vs read()": a patch went into the 5.3 kernel that will automatically retry io_uring short reads and a further commit went into the 5.4 kernel that tweaks the behaviour to only automatically take care of short reads when working with "regular" files on requests that haven't set the REQ_F_NOWAIT flag (it looks like you can request REQ_F_NOWAIT via IOCB_NOWAIT or by opening the file with O_NONBLOCK). Thus you can get recv() style- "short" I/O behaviour from io_uring too.

Software/projects using `io_uring`

Though the interface is still new (its first incarnation arrived in May 2019), some open-source software is using io_uring "in the wild":

fio (which is also authored by Jens Axboe) has an io_uring ioengine backend (in fact it was introduced back in fio-3.13 from February 2019!). The "Improved Storage Performance Using the New Linux Kernel I/O Interface SNIA presentation" (slides) by two Intel engineers states they were able to get double the IOPS on one workload and less than half the average latency at a queue depth of 1 on another workload when comparing the io_uring engine to the libaio engine on an Optane device.
The SPDK project added support for using io_uring (!) for block device access in its v19.04 release (but obviously this isn't the backend you'd typically use SPDK for other than benchmarking). More recently, they also seem to have added support for using it with sockets in v20.04...
Ceph committed an io_uring backend in Dec 2019 which was part of its 15.1.0 release. The commit author posted a github comment showing some io_uring backend has some wins and losses versus the libaio backend (in terms of IOPS, bandwidth and latency) depending on workload.
RocksDB committed an io_uring backend for MultiRead in Dec 2019 and was part of its 6.7.3 release. Jens states io_uring helped to dramatically cut latency.
libev released 4.31 with an initial io_uring backend in Dec 2019 but the libev author is waiting for 5.6+ kernels before improving it further (the libev author's notes make it sound like all of the kernel issues/concerns will have been addressed by 5.7)
QEMU committed an io_uring backend in Jan 2020 and was part of the QEMU 5.0 release. In the "io_uring in QEMU: high-performance disk IO for Linux" PDF presentation Julia Suvorova shows the io_uring backend outperforming the threads and aio backends on one workload of random 16K blocks.
Samba merged an io_uring VFS backend in Feb 2020 (and it was part of the Samba 4.12 release). In the "Linux io_uring VFS backend." Samba mailing list thread, Stefan Metzmacher (the commit author) says the io_uring module was able to push roughy 19% more throughput (compared to some unspecified backend) in a synthetic test. You can also read the "Async VFS Future" PDF presentation by Stefan for some of the motivation behind the changes.
Facebook's experimental C++ libunifex uses it (but you will also need a 5.6+ kernel)
The rust folk have been writing wrappers to make io_uring more accessible to pure rust. rio is one library talked about a bit and the author says they achieved higher throughput compared to using sync calls wrapped in threads. The author gave a presentation about his database and library at FOSDEM 2020 which included a section extolling the virtues of io_uring.
The rust library glommio exclusively uses io_uring. The author (Glauber Costa) published

Categories

Is there really no asynchronous block I/O on Linux?

Is there really no asynchronous block I/O on Linux?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Software/projects using `io_uring`

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

Is there really no asynchronous block I/O on Linux?

Is there really no asynchronous block I/O on Linux?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Software/projects using io_uring

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Software/projects using `io_uring`