Making my home NFS server go faster for $22

Fri, 2021-01-29 22:52 — bfields

It was time to replace my home file server, but first I needed to figure out how to solve a performance problem:

NFS is generally pretty good at reading and writing big files. If you're just copying a big file to an NFS partition, usually all you need to know is the bandwidth your network and your drives are capable of, and that will tell you how long the transfer will take.

So people are sometimes surprised when they need to do something that *isn't* just reading or writing a big file, and suddenly get much worse performance.

For example, let's extract an archive of the Linux kernel source. It's 176M. My client is connected to my server by gigabit ethernet, and the exported filesystem is striped across a couple of hard drives. It takes a couple seconds to copy the archive over, but a simple "tar -xzf ~/linux-5.9.tar.gz" onto the NFS filesystem takes nearly 2 hours.

The problem isn't the 176M; the problem is that it's trying to create about 75,000 files. Each file create requires a round trip to the server. Worse, the NFS protocol forbids the server from replying until it has guaranteed that the file is safely on disk. (The reasons for this are a bit subtle, but it's basically a consequence of the NFS protocol's guarantee to provide correct behavior even if the server crashes and comes back up while you're using it.)

Hard drives have RAM caches which allow them to respond quickly to writes, but those RAM caches are lost on power failure, so the NFS server code waits for the file to create to actually reach disk, an operation which takes the typical hard drive 10ms or more.

And "tar" isn't smart enough to use any paralellism--it's literally just creating a file, then setting some attributes, then writing the data, then closing it, then moving onto the next file. Do that 75,000 times, with several 10ms+ waits at each step, and it adds up.

So, you need faster storage. In particular, storage that's faster at committing data to "stable storage". (By which we mean, storage that will survive a power failure.) The traditional solution is a big fancy disk array with a RAM cache that's backed by battery, but those turn out to be expensive, and not something I want in my basement.

SSDs are fast, right? Well, yes, they're faster at most stuff, but it turns out that at this particular thing--committing data to stable storage--they're not necessarily much faster than hard drives.

The exception is "enterprise SSDs"--look for a feature that's usually called something like "enhanced power loss protection". These drives have big capacitors that work like batteries, so when the power dies unexpectedly, they have enough energy to get the data in their cache to stable storage before they shut down. They're a lot cheaper than big drive arrays. (And also much less annoying to have running in your basement 24/7.) But they're still kind of expensive.

But these days it turns out there's another option for cheapskates: this Intel 16GB Optane Memory. You can use it with Windows to do some special tricks if you have the right kind of motherboard, but it also works as a perfectly standard M.2 NVMe SSD. It's small, but it's only $22, and it turns out it's pretty fast at this one particular thing we need it to do--it can commit a write to stable storage in a fraction of a millisecond instead of needing 10ms or more.

Also, recent Linux kernels have a feature called dm-writecache that allows you to use a fast SSD as a write cache in front of slower drives. In my case I set up striping across two hard drives and the write cache on the Optane device with:

pvcreate /dev/sdb pvcreate /dev/sdc vgcreate export /dev/sdb /dev/sdc; lvcreate -i 2 -l 100%VG stripehds vgextend export /dev/nvme0n1 lvcreate -n optane -l 100%FREE export /dev/nvme0n1 lvchange -a n /dev/export/optane lvconvert --type writecache --cachevol optane export

Then I NFS-exported /dev/export/stripehds.

All those file creates then go to the Otpane, which responds very quickly, and then they get flushed out to the hard drives over time.

The result is that the "tar -xzf linux-5.9.tar.gz" is down to 4-5 minutes. Still not great (it takes about 12s to a local disk on my machine), but a major improvement over 2 hours.

There's another way to do this, by the way: ext4 and xfs both allow you to put the journal onto a separate device from the device that hosts the rest of the filesystem. If you put the journal on the right kind of SSD (either an Optane, or an SSD with power loss protection), then you'll get a similar boost, though I find dm-writecache is doing better.

(Also, the 5.10 kernel I was using had a bug which limited performance; thanks to Mikulas Patocka for figuring it out. Hopefully the patch fixing the bug should be available in any kernel you'd use by the time you read this.)

There are still ways we could do better:

A high-end enterprise NVMe drive would probably be faster. I was going for cheap.
programs like tar could be rewritten to use parallelism.
If the Linux NFS server supported write delegations, that might help.
We've also considered adding support for write directory delegations to the protocol, which would allow the client to create files without waiting for the server. But that's a complicated project that would take a long time.

bfields's blog

You are here

Making my home NFS server go faster for $22

Navigation

You are here

Making my home NFS server go faster for $22

Navigation

User login