Need for simplification ----------------------- - Code has become complex/tricky and difficult to understand and maintain - Stability has been a concern, bugs keep cropping up, and there still are some potential loopholes in current implementation, esp AIO-DIO e.g. - AIO-DIO has some potential races in handling of IO errors, e.g. EIO. - For AIO-DIO writes, invalidate_inode_pages may be called before write completes Considerations for simplification --------------------------------- - Reduce the number of special cases and conditions to check for - Bring locking model closer to that followed for regular read/writes - Unify logic as far as possible for conceptual simplicity - Optimize for the most common situation, i.e. non-extending DIO to already allocated blocks, with no concurrent buffered IO on the same file, and no concurrent DIO to the same part of the file. I. DIO read ----------- Protection against buffered IO hole overwrite, where uninstantiated blocks may be exposed until writeback completes. Buffered reads go through page cache and hence never see stale data. Direct reads however go directly to disk. (a) Current Solution: - i_alloc_sem held for read => protects against truncate - i_mutex held => no new blocks get created until DIO getblocks completes - filemap_write_and_wait => ensures any existing uninstantiated blocks complete writeback to disk - AIO error handling paths differ for errors during submission vs during IO - in one case aio_complete issued by higher layer, in another case within the io completion path in finished_one_bio (b) Alternative Solution (Follow similar locking rules as buffered read) - i_alloc_sem held for read => protects against truncate - Lookup and lock pages in the range by tagging the radix tree slots as "TAG_LOCKED", and locking pages that were present. Modify find_get_page et al to return a dummy locked cache page if TAG_LOCKED. (Alternatively put the check in add_to_page_cache, instead) => no new writes or blocks instantiated until DIO getblocks completes - Issue equiv of filemap_write_and_wait_range that acts on already locked pages (also likely to be useful for mpage_writepages) => ensures existing uninstantiated blocks written to disk - On IO completion release lock on pages and the range, including unlocking and releasing dummy cache page. => This will wake up readers and writers blocked on locked pages. Since on wakeup most code paths check mapping once again and also hold an extra ref count, this should be safe. (hopefully !) - With retry based AIO-DIO => aio_complete can happen in the same way during submission and during a retry following completion, thus error handling takes the same path in both cases. II. DIO write Protection against buffered IO reads or DIO reads being exposed to uninstantiated blocks during a DIO hole overwrite/extend. Holding i_sem in DIO writes doesn't help protect buffered IO reads, since i_sem is not acquired in read paths. (a) Current Solution - i_alloc_sem held for read => protects against truncate - Hold i_mutex at least till completion of DIO getblocks and submission - Force file extending DIO writes to be i_mutex holding until i_size increase (i.e till DIO completes) to avoid exposing new blocks - Fallback to buffered IO for hole overwrites, so that the default page based locking protects buffered IO reads from seeing uninstantiated blocks - filemap_write_and_wait => ensures any pending writes complete writeback to disk - Perform DIO write - invalidate pages => subsequent reads should see newly written data - Force extending AIO-DIO writes to be synchronous - Force AIO-DIO overwrites that cross holes to be synchronous for issued DIO before fallback to buffered IO. - AIO Error handling takes different paths for errors during submission and during IO completion. Fallback to buffered adds another level of complexity, as does forced synchronous behaviour for extends. (b) Alternative Solution (Follow similar locking rules as buffered IO) - i_alloc_sem held for read => protects against truncate - check if this is an extending write, and if not release i_mutex - Lookup and lock pages in the range by tagging the radix tree slots as "TAG_LOCKED", and locking pages that were present. => reads are blocked until DIO write completes, i.e uninstantiated blocks not exposed - Issue equiv of filemap_write_and_wait_range that acts on already locked pages and then invalidate the pages. => subsequent reads should see newly written data - Perform DIO write (including block allocations etc) - On IO completion release lock on pages and unlock the range, including unlocking and releasing dummy cache page, release i_alloc_sem and i_mutex. [i_mutex and i_alloc_sem handling can be moved entirely into higher levels, which avoids DIO_LOCKING checks inside the DIO code; since i_mutex is held only for extending writes it should not be a concurrency issue; also in the AIO case, it is ok to update i_size and release i_mutex (perhaps even i_alloc_sem) w/o having to wait for IO completion since pages are still locked ] - With retry based AIO-DIO => aio_complete happens in the same way during submission and during a retry following completion. Unlocking the range and release of pages can likewise happen in retry context (not in interrupt context).