ZFS primary cache is good

Last year I've written a post about ZFS primarycache setting, showing how it's not a good idea to mess with it. Here is a new example based on real world application.
Recently, my server crashed, and at launch-time Splunk decided it was a good idea to re-index a huge apache log file. Apart from exploding my daily index quota, this misbehavior filed the index with duplicated data. Getting rid of 1284408 events in Splunk can be a little bit resource-intensive. I won't detail the Splunk part of the operation: I've ended up having 1285 batches of delete commands that I've launched with a simple for/do/done bash loop. After a while, I noticed that the process was slow and was making lots of disk IOs. Annoying. So I checked:

# zfs get primarycache zdata/splunk
NAME          PROPERTY      VALUE         SOURCE
zdata/splunk  primarycache  metadata      local

Uncool. This setting was set locally so that my toy (Splunk) would not harvest all ARC from the server, hurting production. For efficiency's sake, I've switched back the primary cache to all:

# zfs set primarycache=all zdata/splunk

Effect was almost instantaneous: ARC filled with Splunk data and disk IOs plummeted.

primarycache # of deletes per second
metadata 10.06
all 22.08

A x2.2 speedup on a very long operation (~20 hours here) is a very good argument in favor of primarycache=all for any ZFS user.

acceleration of a repetitive splunk operation thanks to ZFS primarycache setting

Acceleration of a repetitive splunk operation thanks to ZFS primarycache setting

ZFS primarycache: all versus metadata

In my previous post, I wrote about tuning a ZFS storage for MySQL. For InnoDB storage engine, I've tuned the primarycache property so that only metadata would get cached by ZFS. It makes sense for this particular use, but in most cases you'll want to keep the default primarycache setting (all).
Here is a real world example showing how a non-MySQL workload is affected by this setting.

On a virtual server, 2 vCPU, 8 GB RAM, running FreeBSD 9.1-RELEASE-p7, I have a huge zpool of about 4 TB. It uses gzip compression, and stores 1.8TB of emails (compressratio 1.61x) and 1TB documents (compressratio 1.15x). Documents and emails have their own dataset, on the same zpool.
As it's a secondary backup server, isolated from production, I can easily make some tests.

I've launched clamscan (the binary part of ClamAV virus scanner) against a small branch of the email storage tree (about 1/1000 of total emails) and measured the zpool IOs, CPU usage and total runtime of the scan.
Before each run, I've rebooted the server to clear cache.
clamscan is set up so that every temporary files are written into a UFS2 filesystem (/tmp).

One run was made with property primarycache set to all, the other run was made with primarycache set to metadata.

Total runtime with default settings primarycache=all is less than 15 minutes, for 20518 files:

Scanned directories: 1
Scanned files: 20518
Infected files: 2
Data scanned: 7229.92 MB
Data read: 2664.59 MB (ratio 2.71:1)
Time: 892.431 sec (14 m 52 s)

Total runtime with default settings primarycache=metadata is more than 33 minutes:

Scanned directories: 1
Scanned files: 20518
Infected files: 2
Data scanned: 7229.92 MB
Data read: 2664.59 MB (ratio 2.71:1)
Time: 2029.921 sec (33 m 49 s)

zpool iostat every 5 seconds, with different primarycache settings, ~10 minutes range.

zpool IO stats

CPU usage for clamscan process, and for kernel{zio_read_intr_0} kernel thread. 5 seconds sampling, with different primarycache settings, ~10 minutes range.

CPU stats

In both tests, the server is freshly rebooted, cache is empty. Nevertheless, when primarycache=all the kernel{zio_read_intr_0} thread consumes very few CPU cycles, and the clamscan process run's more than twice as fast as the same process with primarycache=metadata.
More importantly, clamscan manages to read the exact same amount of data in both tests, using 10 times less IO throughput when primarycache is set to all.

There is something weird. Let's make another test:

I create 2 brand new datasets, both with primarycache=none and compression=lz4, and I copy in each one a 4.8GB file (2.05x compressratio). Then I set primarycache=all on the first one, and primarycache=metadata on the second one.
I cat the first file into /dev/null with zpool iostat running in another terminal. And finally, I cat the second file the same way.

The sum of read bandwidth column is (almost) exactly the physical size of the file on the disk (du output) for the dataset with primarycache=all: 2.44GB.
For the other dataset, with primarycache=metadata, the sum of the read bandwidth column is ...wait for it... 77.95GB.

There is some sort of voodoo under the hood that I can't explain. Feel free to comment if you have any idea on the subject.

A FreeBSD user has posted an interesting explanation to this puzzling behavior in FreeBSD forums:

clamscan reads a file, gets 4k (pagesize?) of data and processes it, then it reads the next 4k, etc.

ZFS, however, cannot read just 4k. It reads 128k (recordsize) by default. Since there is no cache (you've turned it off) the rest of the data is thrown away.

128k / 4k = 32
32 x 2.44GB = 78.08GB

Damnit.

MySQL on ZFS (on FreeBSD)

logofreebsdLots on FreeBSD users are coming to ZFS since the release of FreeBSD 10, as it's so easy to install the system on top of that powerful filesystem. ZFS default behavior and settings are perfect for a wide range of workloads and uses, but it's not exactly what you need for databases hosting.
I'm running a FreeBSD 9.x server hosting http, php, mysql, mail, and many other things, so a databases-only optimization would be counterproductive. After a good dive into experts do's & don't's here are few steps I've taken to tune my server.

Datasets

If like me you've started hosting MySQL databases a long time ago (say ~15 years ago) you might have some DB using the old MyISAM engine, and some newer ones using InnoDB. Guess what. They use different block sizes, and they use different caching mechanisms.
On top of that, the block size for InnoDB is not the same when you deal with data files, and when you deal with log files.
In order to get the best block size and cache tuning possible, I've created 3 datasets: one for InnoDB data files, one for innodb logs, and one for everything else, including MyISAM databases.

Block size

The block size is engine dependent. MyISAM uses a 8k block size, InnoDB uses a 16k block size for data and 128k for logs. It's easy to setup datasets for a particular block size at creation with -o option, or after creation with zfs set. You must set the proper block size before putting any data on the dataset, otherwise pre-existing data won't use the desired block size.

Caching

MyISAM engine relies on the underlying filesystem caching mechanism, so you must ensure ZFS will cache both data and metadata (that's the default behavior). On the other hand, InnoDB uses an internal cache, so it would be a waste on memory to cache the same data into ZFS and InnoDB. On a dedicated MySQL server, the proper tuning would require to limit ARC size, and to disable data caching in ZFS. On a general-purpose server, ARC size should not be tweaked, but on InnoDB dedicated datasets it's easy to disable ZFS cache for data by setting the property primarycache to "metadata".

On MySQL's side

You must of course tell mysqld where to find data files and logs. I've set those values in my /var/db/mysql/my.cnf file:

innodb_data_home_dir = /var/db/mysql-innodb
innodb_log_group_home_dir = /var/db/mysql-innodb-logs

where /var/db/mysql-innodb and /var/db/mysql-innodb-logs are mount points for the dataset dedicated to InnoDB data files, and the dataset dedicated to InnoDB log files.

As my zpool is a mirror, I've added this setting too:

skip-innodb_doublewrite

Step by step

I've followed this course of action:

1st step: mailing to users, "MySQL will be unavailable for about 5 minutes, hang on."
2nd step: backup /var/db/mysql and edit your my.cnf
3rd step: shutdown your mysql server

sudo service mysql-server stop

4th step: move /var/db/mysql to /var/db/mysql-origin
5th step: create appropriate datasets:

zfs create -o recordsize=16k -o primarycache=metadata zmirror/var/db/mysql-innodb
zfs create -o recordsize=128k -o primarycache=metadata zmirror/var/db/mysql-innodb-logs
zfs create -o recordsize=8k zmirror/var/db/mysql

6th step: move data from /var/db/mysql-origin to your new datasets

cd /var/db/
sudo mv mysql-origin/ib_logfile* mysql-innodb-logs/
sudo mv mysql-origin/ibdata1 mysql-innodb/
sudo mv mysql-origin/* mysql/

and set proper rights:

sudo chown mysql:mysql mysql-innodb-logs mysql-innodb mysql
sudo chmod o= mysql mysql-innodb-logs mysql-innodb

7th step: restart your mysql server

sudo service mysql-server start & tail -f mysql/${HOSTNAME}.err

8th step: mailing to users, "MySQL's back online maintenance duration 5 min 14 sec."

Further reading & references

MySQL Innodb ZFS Best Practices
A look at MySQL on ZFS
Optimizing MySQL performance with ZFS
ZFS for Databases