ZFS primarycache: all versus metadata

In my previous post, I wrote about tuning a ZFS storage for MySQL. For InnoDB storage engine, I've tuned the primarycache property so that only metadata would get cached by ZFS. It makes sense for this particular use, but in most cases you'll want to keep the default primarycache setting (all).
Here is a real world example showing how a non-MySQL workload is affected by this setting.

On a virtual server, 2 vCPU, 8 GB RAM, running FreeBSD 9.1-RELEASE-p7, I have a huge zpool of about 4 TB. It uses gzip compression, and stores 1.8TB of emails (compressratio 1.61x) and 1TB documents (compressratio 1.15x). Documents and emails have their own dataset, on the same zpool.
As it's a secondary backup server, isolated from production, I can easily make some tests.

I've launched clamscan (the binary part of ClamAV virus scanner) against a small branch of the email storage tree (about 1/1000 of total emails) and measured the zpool IOs, CPU usage and total runtime of the scan.
Before each run, I've rebooted the server to clear cache.
clamscan is set up so that every temporary files are written into a UFS2 filesystem (/tmp).

One run was made with property primarycache set to all, the other run was made with primarycache set to metadata.

Total runtime with default settings primarycache=all is less than 15 minutes, for 20518 files:

Scanned directories: 1
Scanned files: 20518
Infected files: 2
Data scanned: 7229.92 MB
Data read: 2664.59 MB (ratio 2.71:1)
Time: 892.431 sec (14 m 52 s)

Total runtime with default settings primarycache=metadata is more than 33 minutes:

Scanned directories: 1
Scanned files: 20518
Infected files: 2
Data scanned: 7229.92 MB
Data read: 2664.59 MB (ratio 2.71:1)
Time: 2029.921 sec (33 m 49 s)

zpool iostat every 5 seconds, with different primarycache settings, ~10 minutes range.

zpool IO stats

CPU usage for clamscan process, and for kernel{zio_read_intr_0} kernel thread. 5 seconds sampling, with different primarycache settings, ~10 minutes range.

CPU stats

In both tests, the server is freshly rebooted, cache is empty. Nevertheless, when primarycache=all the kernel{zio_read_intr_0} thread consumes very few CPU cycles, and the clamscan process run's more than twice as fast as the same process with primarycache=metadata.
More importantly, clamscan manages to read the exact same amount of data in both tests, using 10 times less IO throughput when primarycache is set to all.

There is something weird. Let's make another test:

I create 2 brand new datasets, both with primarycache=none and compression=lz4, and I copy in each one a 4.8GB file (2.05x compressratio). Then I set primarycache=all on the first one, and primarycache=metadata on the second one.
I cat the first file into /dev/null with zpool iostat running in another terminal. And finally, I cat the second file the same way.

The sum of read bandwidth column is (almost) exactly the physical size of the file on the disk (du output) for the dataset with primarycache=all: 2.44GB.
For the other dataset, with primarycache=metadata, the sum of the read bandwidth column is ...wait for it... 77.95GB.

There is some sort of voodoo under the hood that I can't explain. Feel free to comment if you have any idea on the subject.

A FreeBSD user has posted an interesting explanation to this puzzling behavior in FreeBSD forums:

clamscan reads a file, gets 4k (pagesize?) of data and processes it, then it reads the next 4k, etc.

ZFS, however, cannot read just 4k. It reads 128k (recordsize) by default. Since there is no cache (you've turned it off) the rest of the data is thrown away.

128k / 4k = 32
32 x 2.44GB = 78.08GB


À lire aussi :


  1. No, I have not tried a smaller block size, but the result would be mathematically coherent. The main reason I have not tried is because with small blocksize, the compress ratio is not that good.
    Having a block the same size than your I/O is very relevant when you don't have compression on your ZFS dataset (see my post about MySQL on ZFS for example).

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *