--- title: "How we spent two weeks hunting an NFS bug in the Linux kernel" categories: engineering author: Stan Hu author_gitlab: stanhu author_twitter: stanhu categories: engineering image_title: '/images/blogimages/nfs-debug/nfs-bug-hunt-detective.jpg' description: "Here's an in-depth recap of debugging a GitLab issue that culminated in a patch for the Linux kernel." twitter_text: "How @gitlab spent two weeks hunting an NFS kernel bug" tags: community, git, inside GitLab, open source featured: yes postType: content marketing ee_cta: false --- UPDATE 2019-08-06: This bug has now been resolved in the following distributions: * [Red Hat Enterprise Linux 7](https://access.redhat.com/errata/RHSA-2019:2029) * [Ubuntu](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802585) * Linux mainline: Backported to [4.14-stable](https://lkml.org/lkml/2019/8/2/562) and [4.19-stable](https://lkml.org/lkml/2019/8/2/639) On Sep. 14, the GitLab support team escalated a critical problem encountered by one of our customers: GitLab would run fine for a while, but after some time users encountered errors. When attempting to clone certain repositories via Git, users would see an opaque `Stale file error` message. The error message persisted for a long time, blocking employees from being able to work, unless a system administrator intervened manually by running `ls` in the directory itself. Thus launched an investigation into the inner workings of Git and the Network File System (NFS). The investigation uncovered a bug with the Linux v4.0 NFS client and culiminated with a [kernel patch that was written by Trond Myklebust](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=be189f7e7f03de35887e5a85ddcf39b91b5d7fc1) and [merged in the latest mainline Linux kernel](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=c7a2c49ea6c9eebbe44ff2c08b663b2905ee2c13) on Oct. 26. This post describes the journey of investigating the issue and details the thought process and tools by which we tracked down the bug. It was inspired by the fine detective work in [How I spent two weeks hunting a memory leak in Ruby](http://www.be9.io/2015/09/21/memory-leak/) by Oleg Dashevskii. More importantly, this experience exemplifies how open source software debugging has become a team sport that involves expertise across multiple people, companies, and locations. The GitLab motto "[everyone can contribute](/company/strategy/#mission)" applies not only to GitLab itself, but also to other open source projects, such as the Linux kernel. ## Reproducing the bug While we have run NFS on GitLab.com for many years, we have stopped using it to access repository data across our application machines. Instead, we have [abstracted all Git calls to Gitaly](/blog/2018/09/12/the-road-to-gitaly-1-0/). Still, NFS remains a supported configuration for our customers who manage their own installation of GitLab, but we had never seen the exact problem described by the customer before. [Our customer gave us a few important clues](https://gitlab.com/gitlab-org/gitlab-ce/issues/51437): 1. The full error message read, `fatal: Couldn't read ./packed-refs: Stale file handle`. 2. The error seemed to start when they started a manual Git garbage collection run via `git gc`. 3. The error would go away if a system administrator ran `ls` in the directory. 4. The error also would go away after `git gc` process ended. The first two items seemed obviously related. When you push to a branch in Git, Git creates a loose reference, a fancy name for a file that points your branch name to the commit. For example, a push to `master` will create a file called `refs/heads/master` in the repository: ```bash $ cat refs/heads/master 2e33a554576d06d9e71bfd6814ee9ba3a7838963 ``` `git gc` has several jobs, but one of them is to collect these loose references (refs) and bundle them up into a single file called `packed-refs`. This makes things a bit faster by eliminating the need to read lots of little files in favor of reading one large one. For example, after running `git gc`, an example `packed-refs` might look like: ``` # pack-refs with: peeled fully-peeled sorted 564c3424d6f9175cf5f2d522e10d20d781511bf1 refs/heads/10-8-stable edb037cbc85225261e8ede5455be4aad771ba3bb refs/heads/11-0-stable 94b9323033693af247128c8648023fe5b53e80f9 refs/heads/11-1-stable 2e33a554576d06d9e71bfd6814ee9ba3a7838963 refs/heads/master ``` How exactly is this `packed-refs` file created? To answer that, we ran `strace git gc` with a loose ref present. Here are the pertinent lines from that: ``` 28705 open("/tmp/libgit2/.git/packed-refs.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 3 28705 open(".git/packed-refs", O_RDONLY) = 3 28705 open("/tmp/libgit2/.git/packed-refs.new", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 4 28705 rename("/tmp/libgit2/.git/packed-refs.new", "/tmp/libgit2/.git/packed-refs") = 0 28705 unlink("/tmp/libgit2/.git/packed-refs.lock") = 0 ``` The system calls showed that `git gc` did the following: 1. Open `packed-refs.lock`. This tells other processes that `packed-refs` is locked and cannot be changed. 1. Open `packed-refs.new`. 1. Write loose refs to `packed-refs.new`. 1. Rename `packed-refs.new` to `packed-refs`. 1. Remove `packed-refs.lock`. 1. Remove loose refs. The fourth step is the key here: the rename where Git puts `packed-refs` into action. In addition to collecting loose refs, `git gc` also performs a more expensive task of scanning for unused objects and removing them. This task can take over an hour for large repositories. That made us wonder: for a large repository, does `git gc` keep the file open while it's running this sweep? Looking at the `strace` logs and probing the process with `lsof`, we found that it did the following: ![Git Garbage Collection](/images/blogimages/nfs-debug/git-gc-diagram.svg) Notice that `packed-refs` is closed only at the end, after the potentially long `Garbage collect objects` step takes place. That begged the next question: how does NFS behave when one node has `packed-refs` open while another renames over that file? To experiment, we asked the customer to run the following experiment on two different machines (Alice and Bob): 1. On the shared NFS volume, create two files: `test1.txt` and `test2.txt` with different contents to make it easy to distinguish them: ```bash alice $ echo "1 - Old file" > /path/to/nfs/test1.txt alice $ echo "2 - New file" > /path/to/nfs/test2.txt ``` 2. On machine Alice, keep a file open to `test1.txt`: ```bash alice $ irb irb(main):001:0> File.open('/path/to/nfs/test1.txt') ``` 3. On machine Alice, show the contents of `test1.txt` continuously: ```bash alice $ while true; do cat test1.txt; done ``` 4. Then on machine Bob, run: ```bash bob $ mv -f test2.txt test1.txt ``` This last step emulates what `git gc` does with `packed-refs` by overwriting the existing file. On the customer's machine, the result looked something like: ``` 1 - Old file 1 - Old file 1 - Old file cat: test1.txt: Stale file handle ``` Bingo! We seemed to reproduce the problem in a controlled way. However, the same experiment using a Linux NFS server did not have this problem. The result was what you would expect: the new contents were picked up after the rename: ``` 1 - Old file 1 - Old file 1 - Old file 2 - New file <--- RENAME HAPPENED 2 - New file 2 - New file ``` Why the difference in behavior? It turns out that the customer was using an [Isilon NFS appliance](https://www.dellemc.com/en-us/storage/isilon/index.htm) that only supported NFS v4.0. By switching the mount parameters to v4.0 via the `vers=4.0` parameter in `/etc/fstab`, the test revealed a different result with the Linux NFS server: ``` 1 - Old file 1 - Old file 1 - Old file 1 - Old file <--- RENAME HAPPENED 1 - Old file 1 - Old file ``` Instead of a `Stale file handle`, the Linux NFS v4.0 server showed stale *contents*. It turns out this difference in behavior can be explained by the NFS spec. From [RFC 3010](https://tools.ietf.org/html/rfc3010#page-153): > A filehandle may or may not become stale or expire on a rename. > However, server implementors are strongly encouraged to attempt to keep > file handles from becoming stale or expiring in this fashion. In other words, NFS servers can choose how to behave if a file is renamed; it's perfectly valid for any NFS server to return a `Stale file error` when that happens. We surmised that even though the results were different, the problem was likely related to the same issue. We suspected some cache validation issue because running `ls` in the directory would "clear" the error. Now that we had a reproducible test case, we asked the experts: the Linux NFS maintainers. ## False path: NFS server delegations With a clear set of reproduction steps, I [sent an email to the Linux NFS mailing list](https://marc.info/?l=linux-nfs&m=153721785231614&w=2) describing what we had found. Over the week, I went back and forth with Bruce Fields, the Linux NFS server maintainer, who suggested this was a NFS bug and that it would be useful to look at the network traffic. He thought there might be an issue with NFS server delegations. ### What is an NFS server delegation? In a nutshell, NFS v4 introduced server delegations as a way to speed up file access. A server can delegate read or write access to a client so that the client doesn't have to keep asking the server whether that file has changed by another client. In simpler terms, a write delegation is akin to someone lending you a notebook and saying, "Go ahead and write in here, and I'll take it back when I'm ready." Instead of having to ask to borrow the notebook every time you want to write a new paragraph, you have free rein until the owner reclaims the notebook. In NFS terms, this reclamation process is called a delegation recall. Indeed, a bug in the NFS delegation recall might explain the `Stale file handle` problem. Remember that in the earlier experiment, Alice had an open file to `test1.txt` when it was replaced by `test2.txt` later. It's possible that the server failed to recall the delegation on `test1.txt`, resulting in an incorrect state. To check whether this was an issue, we turned to `tcpdump` to capture NFS traffic and used Wireshark to visualize it. [Wireshark](https://www.wireshark.org/) is a wonderful open source tool for analyzing network traffic, and it's especially good for viewing NFS in action. We captured a trace using the following command on the NFS server: ``` tcpdump -s 0 -w /tmp/nfs.pcap port 2049 ``` This command captures all NFS traffic, which typically is on TCP port 2049. Because our experiment worked properly with NFS v4.1 but did not with NFS v4.0, we could compare and contrast how NFS behaved in a non-working and a working case. With Wireshark, we saw the following behavior: ### NFS v4.0 (stale file case) ![NFS v4.0 flow](/images/blogimages/nfs-debug/nfs-4.0-flow.svg) In this diagram, we can see in step 1 Alice opens `test1.txt` and gets back an NFS file handle along with a `stateid` of 0x3000. When Bob attempts to rename the file, the NFS server tells to Bob to retry via the `NFS4ERR_DELAY` message while it recalls the delegation from Alice via the `CB_RECALL` message (step 3). Alice then returns her delegation via `DELEGRETURN` (step 4), and then Bob attempts to send another `RENAME` message (step 5). The `RENAME` completes in both cases, but Alice continues to read using the same file handle. ### NFS v4.1 (working case) ![NFS v4.1 flow](/images/blogimages/nfs-debug/nfs-4.1-flow.svg) The main difference happens at the bottom at step 6. Notice in NFS v4.0 (the stale file case), Alice attempts to reuse the same `stateid`. In NFS v4.1 (working case), Alice performs an additional `LOOKUP` and `OPEN`, which causes the server to return a different `stateid`. In v4.0, these extra messages are never sent. This explains why Alice continues to see stale content because she uses the old file handle. What makes Alice decide to do the extra `LOOKUP`? The delegation recall seemed to work fine, but perhaps there was still an issue, such as a missing invalidation step. To rule that out, we disabled NFS delegations by issuing this command on the NFS server itself: ```sh echo 0 > /proc/sys/fs/leases-enable ``` We repeated the experiment, but the problem persisted. All this convinced us this wasn't a NFS server issue or a problem with NFS delegations; it was a problem that led us to look into the NFS client within the kernel. ## Digging deeper: the Linux NFS client The first question we had to answer for the NFS maintainers: ### Was this problem still in the latest upstream kernel? The issue occurred with both CentOS 7.2 and Ubuntu 16.04 kernels, which used versions 3.10.0-862.11.6 and 4.4.0-130, respectively. However, both those kernels lagged the most recent kernel, which was 4.19-rc2 at the time. We deployed a new Ubuntu 16.04 virtual machine on Google Cloud Platform (GCP), cloned the latest Linux kernel, and set up a kernel development environment. After generating a `.config` file via `make menuconfig`, we checked two items: 1. The NFS driver was compiled as a module (`CONFIG_NFSD=m`). 2. The [required GCP kernel settings](https://cloud.google.com/compute/docs/images/building-custom-os) were set properly. Just as a geneticist would use fruit flies to study evolution in real time, the first item allowed us to make quick changes in the NFS client without having to reboot the kernel. The second item was required to ensure that the kernel would actually boot after it was installed. Fortunately, the default kernel settings had all the settings right out of the box. With our custom kernel, we verified that the stale file problem still existed in the latest version. That begged a number of questions: 1. Where exactly was this problem happening? 2. Why was this problem happening with NFS v4.0 but not in v4.1? To answer these questions, we began to investigate the NFS source code. Since we didn't have a kernel debugger available, we sprinkled the source code with two main types of calls: 1. `pr_info()` ([what used to be `printk`](https://lwn.net/Articles/487437/)). 2. `dump_stack()`: This would show the stack trace of the current function call. For example, one of the first things we did was hook into the `nfs4_file_open()` function in `fs/nfs/nfs4file.c`: ```c static int nfs4_file_open(struct inode *inode, struct file *filp) { ... pr_info("nfs4_file_open start\n"); dump_stack(); ``` Admittedly, we could have [activated the `dprintk` messages with the Linux dynamic debug](https://www.kernel.org/doc/html/v4.15/admin-guide/dynamic-debug-howto.html) or used [`rpcdebug`](https://www.thegeekdiary.com/how-to-enable-nfs-debug-logging-using-rpcdebug/), but it was nice to be able to add our own messages to verify changes were being made. Every time we made changes, we recompiled the module and reinstalled it into the kernel via the commands: ```sh make modules sudo umount /mnt/nfs-test sudo rmmod nfsv4 sudo rmmod nfs sudo insmod fs/nfs/nfs.ko sudo mount -a ``` With our NFS module installed, repeating the experiments would print messages that would help us understand the NFS code a bit more. For example, you can see exactly what happens when an application calls `open()`: ``` Sep 24 20:20:38 test-kernel kernel: [ 1145.233460] Call Trace: Sep 24 20:20:38 test-kernel kernel: [ 1145.233462] dump_stack+0x8e/0xd5 Sep 24 20:20:38 test-kernel kernel: [ 1145.233480] nfs4_file_open+0x56/0x2a0 [nfsv4] Sep 24 20:20:38 test-kernel kernel: [ 1145.233488] ? nfs42_clone_file_range+0x1c0/0x1c0 [nfsv4] Sep 24 20:20:38 test-kernel kernel: [ 1145.233490] do_dentry_open+0x1f6/0x360 Sep 24 20:20:38 test-kernel kernel: [ 1145.233492] vfs_open+0x2f/0x40 Sep 24 20:20:38 test-kernel kernel: [ 1145.233493] path_openat+0x2e8/0x1690 Sep 24 20:20:38 test-kernel kernel: [ 1145.233496] ? mem_cgroup_try_charge+0x8b/0x190 Sep 24 20:20:38 test-kernel kernel: [ 1145.233497] do_filp_open+0x9b/0x110 Sep 24 20:20:38 test-kernel kernel: [ 1145.233499] ? __check_object_size+0xb8/0x1b0 Sep 24 20:20:38 test-kernel kernel: [ 1145.233501] ? __alloc_fd+0x46/0x170 Sep 24 20:20:38 test-kernel kernel: [ 1145.233503] do_sys_open+0x1ba/0x250 Sep 24 20:20:38 test-kernel kernel: [ 1145.233505] ? do_sys_open+0x1ba/0x250 Sep 24 20:20:38 test-kernel kernel: [ 1145.233507] __x64_sys_openat+0x20/0x30 Sep 24 20:20:38 test-kernel kernel: [ 1145.233508] do_syscall_64+0x65/0x130 ``` What are the `do_dentry_open` and `vfs_open` calls above? Linux has a [virtual filesystem (VFS)](https://www.kernel.org/doc/Documentation/filesystems/vfs.txt), an abstraction layer which provides a common interface for all filesystems. The VFS documentation explains: > The VFS implements the open(2), stat(2), chmod(2), and similar system > calls. The pathname argument that is passed to them is used by the VFS > to search through the directory entry cache (also known as the dentry > cache or dcache). This provides a very fast look-up mechanism to > translate a pathname (filename) into a specific dentry. Dentries live > in RAM and are never saved to disc: they exist only for performance. ### This gave us a clue: what if this was a problem with the dentry cache? We noticed a lot of dentry cache validation was done in `fs/nfs/dir.c`. In particular, `nfs4_lookup_revalidate()` sounded promising. As an experiment, we hacked that function to bail out early: ```diff diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 8bfaa658b2c1..ad479bfeb669 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -1159,6 +1159,7 @@ static int nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags) trace_nfs_lookup_revalidate_enter(dir, dentry, flags); error = NFS_PROTO(dir)->lookup(dir, &dentry->d_name, fhandle, fattr, label); trace_nfs_lookup_revalidate_exit(dir, dentry, flags, error); + goto out_bad; if (error == -ESTALE || error == -ENOENT) goto out_bad; if (error) ``` That made the stale file problem in our experiment go away! Now we were onto something. To answer, "Why does this problem not happen in NFS v4.1?", we added `pr_info()` calls to every `if` block in that function. After running our experiments with NFS v4.0 and v4.1, we found this special condition being run in the v4.1 case: ```c if (NFS_SB(dentry->d_sb)->caps & NFS_CAP_ATOMIC_OPEN_V1) { goto no_open; } ``` What is `NFS_CAP_ATOMIC_OPEN_V1`? We saw [this kernel patch](https://patchwork.kernel.org/patch/2300511/) mentioned this was an NFS v4.1-specific feature, and the code in `fs/nfs/nfs4proc.c` confirmed that this flag was a capability present in v4.1 but not in v4.0: ```c static const struct nfs4_minor_version_ops nfs_v4_1_minor_ops = { .minor_version = 1, .init_caps = NFS_CAP_READDIRPLUS | NFS_CAP_ATOMIC_OPEN | NFS_CAP_POSIX_LOCK | NFS_CAP_STATEID_NFSV41 | NFS_CAP_ATOMIC_OPEN_V1 ``` That explained the difference in behavior: in the v4.1 case, the `goto no_open` would cause more validation to happen in `nfs_lookup_revalidate()`, but in v4.0, the `nfs4_lookup_revalidate()` would return earlier. Now, how do we actually solve the problem? ## The solution I reported the [findings to the NFS mailing list](https://marc.info/?l=linux-nfs&m=153782129412452&w=2) and proposed [a naive patch](https://marc.info/?l=linux-nfs&m=153807208928650&w=2). A week after the report, Trond Myklebust sent a [patch series to the list fixing this bug and found another related issue for NFS v4.1](https://marc.info/?l=linux-nfs&m=153816500525563&w=2). It turns out the fix for the NFS v4.0 bug was deeper in the code base than we had looked. Trond summarized it well in the [patch](https://marc.info/?l=linux-nfs&m=153816500525564&w=2): > We need to ensure that inode and dentry revalidation occurs correctly > on reopen of a file that is already open. Currently, we can end up not > revalidating either in the case of NFSv4.0, due to the 'cached open' > path. Let's fix that by ensuring that we only do cached open for the > special cases of open recovery and delegation return. We confirmed that this fix made the stale file problem go away and filed bug reports with [Ubuntu](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802585) and [RedHat](https://bugzilla.redhat.com/show_bug.cgi?id=1648482). Knowing full well that kernel changes may take a while to make it to stable releases, we also added a [workaround in Gitaly](https://gitlab.com/gitlab-org/gitaly/merge_requests/924) to deal with this issue. We did experiments to test that calling `stat()` on the `packed-refs` file appears to cause the kernel to revalidate the dentry cache for the renamed file. For simplicity, this is implemented in Gitaly regardless of whether the filesystem is NFS; we only do this once before Gitaly "opens" a repository, and there are already other `stat()` calls that check for other files. ## What we learned A bug can be anywhere in your software stack, and sometimes you have to look beyond your application to find it. Having helpful partners in the open source world makes that job much easier. We are extremely grateful to Trond Myklebust for fixing the problem, and Bruce Fields for responding to questions and helping us understand NFS. Their responsiveness and professionalism truly reflects the best of the open source community. Photo by [dynamosquito](https://www.flickr.com/photos/dynamosquito) on [Flickr](https://www.flickr.com/photos/dynamosquito/4265771518) {: .note}