Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
351 views
in Technique[技术] by (71.8m points)

c - Writing programs to cope with I/O errors causing lost writes on Linux

TL;DR: If the Linux kernel loses a buffered I/O write, is there any way for the application to find out?

I know you have to fsync() the file (and its parent directory) for durability. The question is if the kernel loses dirty buffers that are pending write due to an I/O error, how can the application detect this and recover or abort?

Think database applications, etc, where order of writes and write durability can be crucial.

Lost writes? How?

The Linux kernel's block layer can under some circumstances lose buffered I/O requests that have been submitted successfully by write(), pwrite() etc, with an error like:

Buffer I/O error on device dm-0, logical block 12345
lost page write due to I/O error on dm-0

(See end_buffer_write_sync(...) and end_buffer_async_write(...) in fs/buffer.c).

On newer kernels the error will instead contain "lost async page write", like:

Buffer I/O error on dev dm-0, logical block 12345, lost async page write

Since the application's write() will have already returned without error, there seems to be no way to report an error back to the application.

Detecting them?

I'm not that familiar with the kernel sources, but I think that it sets AS_EIO on the buffer that failed to be written-out if it's doing an async write:

    set_bit(AS_EIO, &page->mapping->flags);
    set_buffer_write_io_error(bh);
    clear_buffer_uptodate(bh);
    SetPageError(page);

but it's unclear to me if or how the application can find out about this when it later fsync()s the file to confirm it's on disk.

It looks like wait_on_page_writeback_range(...) in mm/filemap.c might by do_sync_mapping_range(...) in fs/sync.c which is turn called by sys_sync_file_range(...). It returns -EIO if one or more buffers could not be written.

If, as I'm guessing, this propagates to fsync()'s result, then if the app panics and bails out if it gets an I/O error from fsync() and knows how to re-do its work when restarted, that should be sufficient safeguard?

There's presumably no way for the app to know which byte offsets in a file correspond to the lost pages so it can rewrite them if it knows how, but if the app repeats all its pending work since the last successful fsync() of the file, and that rewrites any dirty kernel buffers corresponding to lost writes against the file, that should clear any I/O error flags on the lost pages and allow the next fsync() to complete - right?

Are there then any other, harmless, circumstances where fsync() may return -EIO where bailing out and redoing work would be too drastic?

Why?

Of course such errors should not happen. In this case the error arose from an unfortunate interaction between the dm-multipath driver's defaults and the sense code used by the SAN to report failure to allocate thin-provisioned storage. But this isn't the only circumstance where they can happen - I've also seen reports of it from thin provisioned LVM for example, as used by libvirt, Docker, and more. An critical application like a database should try to cope with such errors, rather than blindly carrying on as if all is well.

If the kernel thinks it's OK to lose writes without dying with a kernel panic, applications have to find a way to cope.

The practical impact is that I found a case where a multipath problem with a SAN caused lost writes that landed up causing database corruption because the DBMS didn't know its writes had failed. Not fun.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

fsync() returns -EIO if the kernel lost a write

(Note: early part references older kernels; updated below to reflect modern kernels)

It looks like async buffer write-out in end_buffer_async_write(...) failures set an -EIO flag on the failed dirty buffer page for the file:

set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);

which is then detected by wait_on_page_writeback_range(...) as called by do_sync_mapping_range(...) as called by sys_sync_file_range(...) as called by sys_sync_file_range2(...) to implement the C library call fsync().

But only once!

This comment on sys_sync_file_range

168  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
169  * I/O errors or ENOSPC conditions and will return those to the caller, after
170  * clearing the EIO and ENOSPC flags in the address_space.

suggests that when fsync() returns -EIO or (undocumented in the manpage) -ENOSPC, it will clear the error state so a subsequent fsync() will report success even though the pages never got written.

Sure enough wait_on_page_writeback_range(...) clears the error bits when it tests them:

301         /* Check for outstanding write errors */
302         if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
303                 ret = -ENOSPC;
304         if (test_and_clear_bit(AS_EIO, &mapping->flags))
305                 ret = -EIO;

So if the application expects it can re-try fsync() until it succeeds and trust that the data is on-disk, it is terribly wrong.

I'm pretty sure this is the source of the data corruption I found in the DBMS. It retries fsync() and thinks all will be well when it succeeds.

Is this allowed?

The POSIX/SuS docs on fsync() don't really specify this either way:

If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed.

Linux's man-page for fsync() just doesn't say anything about what happens on failure.

So it seems that the meaning of fsync() errors is "I don't know what happened to your writes, might've worked or not, better try again to be sure".

Newer kernels

On 4.9 end_buffer_async_write sets -EIO on the page, just via mapping_set_error.

    buffer_io_error(bh, ", lost async page write");
    mapping_set_error(page->mapping, -EIO);
    set_buffer_write_io_error(bh);
    clear_buffer_uptodate(bh);
    SetPageError(page);

On the sync side I think it's similar, though the structure is now pretty complex to follow. filemap_check_errors in mm/filemap.c now does:

    if (test_bit(AS_EIO, &mapping->flags) &&
        test_and_clear_bit(AS_EIO, &mapping->flags))
            ret = -EIO;

which has much the same effect. Error checks seem to all go through filemap_check_errors which does a test-and-clear:

    if (test_bit(AS_EIO, &mapping->flags) &&
        test_and_clear_bit(AS_EIO, &mapping->flags))
            ret = -EIO;
    return ret;

I'm using btrfs on my laptop, but when I create an ext4 loopback for testing on /mnt/tmp and set up a perf probe on it:

sudo dd if=/dev/zero of=/tmp/ext bs=1M count=100
sudo mke2fs -j -T ext4 /tmp/ext
sudo mount -o loop /tmp/ext /mnt/tmp

sudo perf probe filemap_check_errors

sudo perf record -g -e probe:end_buffer_async_write -e probe:filemap_check_errors dd if=/dev/zero of=/mnt/tmp/test bs=4k count=1 conv=fsync

I find the following call stack in perf report -T:

        ---__GI___libc_fsync
           entry_SYSCALL_64_fastpath
           sys_fsync
           do_fsync
           vfs_fsync_range
           ext4_sync_file
           filemap_write_and_wait_range
           filemap_check_errors

A read-through suggests that yeah, modern kernels behave the same.

This seems to mean that if fsync() (or presumably write() or close()) returns -EIO, the file is in some undefined state between when you last successfully fsync()d or close()d it and its most recently write()ten state.

Test

I've implemented a test case to demonstrate this behaviour.

Implications

A DBMS can cope with this by entering crash recovery. How on earth is a normal user application supposed to cope with this? The fsync() man page gives no warning that it means "fsync-if-you-feel-like-it" and I expect a lot of apps won't cope well with this behaviour.

Bug reports

Further reading

lwn.net touched on this in the article "Improved block-layer error handling".

postgresql.org mailing list thread.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...