freakyfridaylinux写⼀个⽂件系统,LinuxKernel⽂件系统写IO流程代码分析
(⼀)
notebook
Linux Kernel⽂件系统写I/O流程代码分析(⼀)
在Linux VFS机制简析(⼆)这篇博客上介绍了struct address_space_operations⾥底层⽂件系统需要实现的操作,实际编码过程中发现不是那么清楚的知道这⾥⾯的函数具体是⼲啥,在什么时候调⽤。尤其是写IO相关的操作,包括write_begin, write_end, writepage, writepages, direct_IO以及t_page_dirty等函数指针。
要搞清楚这些函数指针,就需要纵观整个写流程⾥这些函数指针的调⽤位置。因此本⽂重点分析和梳理了Linux⽂件系统写I/O的代码流程,以帮助实现底层⽂件系统的读写接⼝。
概览
宣传部部长竞选演讲稿先放⼀张图镇贴,该流程图没有包括bdi_writeback回写机制(将在下⼀篇中展⽰):
VFS流程
sys_write()
Glibc提供的write()函数调⽤由内核的write系统调⽤实现,对应的系统调⽤函数为sys_write()定义如下:
asmlinkage long sys_write(unsigned int fd, const char __ur *buf,
size_t count);
sys_write()的实现在fs/read_write.c⾥:
SYSCALL_DEFINE3(write, unsigned int, fd, const char __ur *, buf,
size_t, count)
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos = file_pos_read(f.file);
ret = vfs_write(f.file, buf, count, &pos);
file_pos_write(f.file, pos);
fdput_pos(f);
}
return ret;
}
该函数获取struct fd引⽤计数和pos锁定,获取pos并主要通过调⽤vfs_write()实现数据写⼊。vfs_write()
vfs_write()函数定义如下:
ssize_t vfs_write(struct file *file, const char __ur *buf, size_t count, loff_t *pos)
{
quite a few
ssize_t ret;
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
科学的英语
return -EINVAL;
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
return -EFAULT;
ret = rw_verify_area(WRITE, file, pos, count);
if (ret >= 0) {
count = ret;
file_start_write(file);
if (file->f_op->write)
ret = file->f_op->write(file, buf, count, pos);
el
ret = do_sync_write(file, buf, count, pos);
belgian
if (ret > 0) {
fsnotify_modify(file);
add_wchar(current, ret);
inc_syscw(current);
file_end_write(file);
}
return ret;
}
该函数⾸先调⽤rw_verify_area()检查pos和count对应的区域是否可以写⼊(如是否获取写锁等)。然后如果底层⽂件系统指定了struct
file_operations⾥的write()函数指针,则调⽤file->f_op->write()函数,否则直接调⽤VFS的通⽤写⼊函数do_sync_write()。
do_sync_write()
VFS的do_sync_write()函数在底层⽂件系统没有指定f_op->write()函数指针时默认调⽤,它也被很多底层系统直接指定为f_op->write()。其定义如下所⽰:
ssize_t do_sync_write(struct file *filp, const char __ur *buf, size_t len, loff_t *ppos)
{
struct iovec iov = { .iov_ba = (void __ur *)buf, .iov_len = len };
struct kiocb kiocb;
ssize_t ret;
init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
kiocb.ki_left = len;
kiocb.ki_nbytes = len;
ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
return ret;
}
通过时上⾯的代码可知,该函数主要⽣成struct kiocb,将其提交给f_op->aio_write()函数,并等待该kiocb的完成。所以底层⽂件系统必须实现f_op->aio_write()函数指针。
底层⽂件系统⼤部分实现了⾃⼰的f_op->aio_write(),也有部分⽂件系统(如ext4, nfs等)直接指向了通⽤的写⼊⽅法:
generic_file_aio_write()。我们通过该函数代码来分析写⼊的⼤致流程。
generic_file_aio_write()
VFS(其实是mm模块)提供了通⽤的aio_write()函数,其定义如下:
ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_gs, loff_t pos)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
ssize_t ret;
BUG_ON(iocb->ki_pos != pos);
mutex_lock(&inode->i_mutex);
ret = __generic_file_aio_write(iocb, iov, nr_gs, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);
if (ret > 0) {
ssize_t err;
err = generic_write_sync(file, pos, ret);
if (err < 0 && ret > 0)
ret = err;
维护英语
}
return ret;
}
该函数对inode加锁之后,调⽤__generic_file_aio_write()函数将数据写⼊。如果ret > 0即数据写⼊成功,并且写操作需要同步到磁盘(如设置了O_SYNC),则调⽤generic_write_sync(),这⾥⾯将调⽤f_op->fsync()函数指针将数据写盘。
函数__generic_file_aio_write()的代码略多,这⾥贴出主要的⽚段如下:
ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_gs, loff_t *ppos)
{
...
if (io_is_direct(file)) {
loff_t endbyte;
ssize_t written_buffered;
written = generic_file_direct_write(iocb, iov, &nr_gs, pos,
ppos, count, ocount);
...
} el {
written = generic_file_buffered_write(iocb, iov, nr_gs,
pos, ppos, count, written);
}
...
从上⾯代码可以看到,如果是Direct IO,则调⽤generic_file_direct_write(),不经过page cache直接写⼊磁盘;如果不是Direct IO,则调⽤generic_file_buffered_write()写⼊page cache。
Direct IO实现
generic_file_direct_write()
尺子的英文
函数generic_file_direct_write()的主要代码如下所⽰:
ssize_t
generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov, unsigned long *nr_gs, loff_t pos, loff_t *ppos,
size_t count, size_t ocount)
{
...
if (count != ocount)
*nr_gs = iov_shorten((struct iovec *)iov, *nr_gs, count);
write_len = iov_length(iov, *nr_gs);
end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1); if (written)
goto out;
if (mapping->nrpages) {
written = invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
if (written) {
if (written == -EBUSY)
return 0;
a brief descriptiongoto out;
}
}
written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_gs);
if (mapping->nrpages) {
invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
}
if (written > 0) {
pos += written;
if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {connie
i_size_write(inode, pos);