NFS, NFS, and more NFS

I recently read the 1985 paper "Design and Implementation or the Sun Network Filesystem", but didn't find much in there that was interesting to me.

Sunday, in a quest to figure out how chmod is actually used out in the wild, I went through log output we've been getting from a kernel patch of Andreas's that logs uses of the chmod system call. I also grepped through the entire filesystem looking for scripts that call "chmod".

Why are there so many programs that create a file and then immediately chmod it? Why couldn't they just get the umask and the create mode right?

Today was the first day of the Bakeathon here in Ann Arbor. I did a little more ACL work, some debugging, tried to help people set up a little, and talked with Greg Banks a little.

Several of us went out to the Arbor Brewing Company afterwards. It was pleasant enough, though a little noisy.

suicidal panniers

I was rolling down East Medical Center Drive at a pretty good clip this afternoon when my steering suddenly went a little funny and I heard a little "wump" and some metallic klangs.

Looking back confirmed my suspicions: the pannier hung on the left side of my rear wheel had thrown itself off onto the street.

Fortunately, the closest car was far enough back to have plenty of time to drive around my stuff. And all that had actually scattered out of the pannier was my keys and the two halves of my bicycle lock.

This isn't the first time something like this has happened, though it's probably the worst. Some day I'm afraid I'm going to lose control of the bike, or my laptop is going to slide into the next lane and get run over.

So I need to figure out some better way to carry stuff. Or maybe these panniers just need some more adjustment.


Kleiman's "Vnodes: An Architecture for Multiple File System Types in Sun UNIX" describes a basic architecture that's pretty familiar at this point.

One design goal I wouldn't have thought of:

All file system operations should be atomic. In other words, the set of interface operations should be at a high enough level so that there is no need for locking (hard locking, not user advisory locking) across several operations. Locking, if required, should be left up to the file system implementation dependent layer. For example, if a relatively slow computer running a remote file system requires a supercomputer server to lock a file while it does several operations, the users of the supercomputer would be noticeably affected. It is much better to give the file system dependent code full information about what operation is being done and let it decide what locking is necessary and practical.

Linux's VFS doesn't really work that way--the VFS operations are too fine-grained. But that does mean we end up having to do special things for distributed filesystems (e.g. the intents stuff).

Note that VFS described in this paper has no common inode lock, for example; that's left up to the filesystem.

Demolition Derby

Monday night we want with Paul, Dave, and Bill C. to the Saline Community Fair for the demolition derby. Well, it seemed like one of those things that's worth trying once.

The basic idea is really simple: 10 cars, give or take, all start at once. The last car still moving wins. Referees may disqualify cars or stop the whole thing if (as happens once or twice) it looks like an engine fire might get out of control. Then I guess there's a whole bunch of rules about how the cars are prepared. The one obvious rule was that the doors all had to be welded shut.

I think there may have been five rounds altogether, the fifth featuring winners (well, survivors anyway) of the previous four. That tried my patience a bit. But it was still pretty funny.

Paul described it as like an action movie but real, and with all the in-between non-action parts removed. There was lots of noise and confusion. It's suprising how completely the body of a car can be mutilated without stopping it from running. Even the occasional missing wheel or shredded tire didn't always stop them.

So while I don't think I need to see another, I definitely don't regret going.

word's best newspaper article

A few years ago the Michigan Daily's Crime Notes section included an item under the headline "Boyfriend reports missing girlfriend, later found":

A male resident a Cozen's Residence Hall became worried Saturday evening after he was unable to locate his girlfriend, DPS reports state. The man had previously made plans with his girlfriend, who was later located.

You read the article, and you think, that could not possibly be improved on--love lost and regained all in 35 words. And then you realize the headline does it all in 6 words. It's a masterpiece.

I just have the clipping now, stuck to my office door, and don't even know what year it came from.

fire, paper, scissors, rock

Josh, Valree, and two of their friends are all leaving for Los Angeles, so they had a party at their friends' house in Ypsi on Saturday night. We didn't know many of the people there, but they had both a fire-eater and a band, which had been formed just days before for the party and had only 3 songs, the first of which was "paper, scissors,... rock!", also the name of the band.

So you can't complain about a party like that.

Dinner music

The other night Sara and I had one of these instant indian meals that come in a box containing a foil patch with your channa masala or palak paneer ready to be heated.

This time we were surprised to find a CD stuck to the foil patch, with a note to buy more of their meals and collect all five volumes. So of course we put it on and listened to it with dinner. And actually it was very good--it was a few classical indian pieces, a genre of music I like but know almost nothing about.

The standard for supermarket convenience food has now been raised--I'll always expect dinner music packaged with my dinner from now on.

Sangria, hideous names, readahead

My day at work alternated between trying to put out a new release of my kernel patches (complicated by a failing machine at citi making some of our services unreliable), working on various ACL-related stuff (mainly talking over things with Andreas and then trying to read through some of his code), and helping Fred with some debugging (a little; Fred was the one that actually figured it out).

Some days a problem really grabs me and I just can't stop thinking about it; this wasn't such a day.

After work I met Trond, Laura, Sara, and (eventually) Paul at Dominicks. Trond suggested the nfsd readahead problem as the sort of thing that should keep a person up at night.

In particular, the problem is this: every time an application asks for data from a file, you have to go read the data from disk. But that's terribly inefficient. It takes eons (ok, milliseconds--but that's eons on today's hardware, where a processor executes an instruction every nanosecond) to move the disk head to the right place, wait for the disk to spin around to the right spot, and read the data. So, ideally, you'd like the data to already have been read. How is that possible? Well, it's not always possible for the operating system to predict what's going to be asked for next. But often it is, because often applications just read through whole files sequentially from beginning to end.

So any modern operating system recognizes when an application is reading straight through a file from beginning to end, and starts anticipating by performing "readahead"--reading the next few chunks of the file before they're requested, assuming that they'll be needed soon.

The problem comes when you throw NFS into the mix--now the application is split from the disk by a network, and the read requests may sometimes be switched around and arrive in a different order. So even though the application is reading through the file in order, it looks to the NFS server like they're going back and forth a little, and the standard readahead algorithm fails because it doesn't recognize that this is basically not that different from sequential reads.

I actually read a paper recently where they dealt with this problem, but can't for the life of me remember where it was....

When I got home I tried reading through Pike and Weinberger's "the Hideous Name", but found it kind of a pointless paper.


McKusick and Kowalski's "Fsck -- The UNIX File System Check Program" (1985) rehashes some details from the FFS paper before talking about corruption.

Note that all writes required to deallocate a block from an inode are done synchronously, and all directory operations are done synchronously, to prevent, for example, a block ever being referenced from two different inodes (no longer an automatically recoverable situation since you can't decide which it should belong to).

The rest of the paper is a list of the various sanity checks performed by fsck; mostly fairly obvious. There's no discussion of performance, which must be terrible--multiple passes through the whole filesystem seem to be required, and memory requirements for some of the calculations are probably unbounded.

"A Fast File System for UNIX"

I ran across Filesystems Reading List recently and thought I should skim through some of the references.

McKusick, Joy, Leffler, and Fabry, "A Fast File System for UNIX" describes improvements to "the" UNIX file system, including:

  • increasing the minimum block size to 4096, to make it possible to address larger files without adding more indirection, and to reduce seeks (and overhead of data transfers?),
  • adding redundant copies of the superblock (which never changes) to aid recovery
  • representing free/allocated blocks using bitmaps (one for each "cyclinder group") instead of just a single list of free blocks, to allow smarter allocations that manage fragmentation,
  • adding the rule reserving the last few percent of free space to the system administrator, mainly to prevent fragmentation,
  • adding more complex allocation policies designed to group inodes from the same directory together, and file data for the same inode together, to reduce seeks.

To prevent wasting space on (typical) filesystems with lots of small files, they actually allocate space in units of fragments, with size set to some fixed fraction of the block size. To keep down fragmentation (I guess) they'll allocate a new block and copy a bunch of fragments to it if a new write would create enough fragments to fill a block. But this being inefficient, they discourage it by introducing what I guess must be the st_blksize state field.

The basic motivation for all of this was to increase reliability and performance (the old filesystem was using only a few percent of the theoretically available bandwidth to storage).

They were limited by disk drivers that would only transfer one block at a time; so reading two contiguous blocks ends up not being as fast as expected, since by the time you're ready to read the second you've already spun past it. So they have some bizarre allocation policies that try to locate blocks that are spaced at the right intervals around the platter so that you don't overshoot them when you do a big sequential read.

They also introduce long file names (and discuss the directory layout), symlinks, "flock" locks, and quotas.

They also add the rename operation, motivating by desire to atomically replace a file by a new version of the file. If that has to be done by removing the old file, linking in the new file at the old file's location, then unlinking the new file's temporary location, then there are times when the neither the old nor new version is visible at the target location, etc. So "rename" is added to do all of this atomically.

The paper is from 1984, so I would have been in seventh grade.


Subscribe to RSS