BSD | Cognitive Overhead

9 mars 2014

MySQL on ZFS (on FreeBSD)

logofreebsd Lots of FreeBSD users are coming to ZFS since the release of FreeBSD 10, as it's so easy to install the system on top of that powerful filesystem. ZFS default behavior and settings are perfect for a wide range of workloads and uses, but it's not exactly what you need for databases hosting.
I'm running a FreeBSD 9.x server hosting http, php, mysql, mail, and many other things, so a databases-only optimization would be counterproductive. After a good dive into experts do's & don't's here are few steps I've taken to tune my server.

Datasets

If like me you've started hosting MySQL databases a long time ago (say ~15 years ago) you might have some DB using the old MyISAM engine, and some newer ones using InnoDB. Guess what. They use different block sizes, and they use different caching mechanisms.
On top of that, the block size for InnoDB is not the same when you deal with data files, and when you deal with log files.
In order to get the best block size and cache tuning possible, I've created 3 datasets: one for InnoDB data files, one for innodb logs, and one for everything else, including MyISAM databases.

Block size

The block size is engine dependent. MyISAM uses a 8k block size, InnoDB uses a 16k block size for data and 128k for logs. It's easy to setup datasets for a particular block size at creation with -o option, or after creation with zfs set. You must set the proper block size before putting any data on the dataset, otherwise pre-existing data won't use the desired block size.

Caching

MyISAM engine relies on the underlying filesystem caching mechanism, so you must ensure ZFS will cache both data and metadata (that's the default behavior). On the other hand, InnoDB uses an internal cache, so it would be a waste on memory to cache the same data into ZFS and InnoDB. On a dedicated MySQL server, the proper tuning would require to limit ARC size, and to disable data caching in ZFS. On a general-purpose server, ARC size should not be tweaked, but on InnoDB dedicated datasets it's easy to disable ZFS cache for data by setting the property primarycache to "metadata".

On MySQL's side

You must of course tell mysqld where to find data files and logs. I've set those values in my /var/db/mysql/my.cnf file:

innodb_data_home_dir = /var/db/mysql-innodb
innodb_log_group_home_dir = /var/db/mysql-innodb-logs

where /var/db/mysql-innodb and /var/db/mysql-innodb-logs are mount points for the dataset dedicated to InnoDB data files, and the dataset dedicated to InnoDB log files.

As my zpool is a mirror, I've added this setting too:

skip-innodb_doublewrite

Step by step

I've followed this course of action:

1st step: mailing to users, "MySQL will be unavailable for about 5 minutes, hang on."
2nd step: backup /var/db/mysql and edit your my.cnf
3rd step: shutdown your mysql server

sudo service mysql-server stop

4th step: move /var/db/mysql to /var/db/mysql-origin
5th step: create appropriate datasets:

zfs create -o recordsize=16k -o primarycache=metadata zmirror/var/db/mysql-innodb
zfs create -o recordsize=128k -o primarycache=metadata zmirror/var/db/mysql-innodb-logs
zfs create -o recordsize=8k zmirror/var/db/mysql

6th step: move data from /var/db/mysql-origin to your new datasets

cd /var/db/
sudo mv mysql-origin/ib_logfile* mysql-innodb-logs/
sudo mv mysql-origin/ibdata1 mysql-innodb/
sudo mv mysql-origin/* mysql/

and set proper rights:

sudo chown mysql:mysql mysql-innodb-logs mysql-innodb mysql
sudo chmod o= mysql mysql-innodb-logs mysql-innodb

7th step: restart your mysql server

sudo service mysql-server start & tail -f mysql/${HOSTNAME}.err

8th step: mailing to users, "MySQL's back online maintenance duration 5 min 14 sec."

Further reading & references

MySQL Innodb ZFS Best Practices
A look at MySQL on ZFS
Optimizing MySQL performance with ZFS
ZFS for Databases

21 octobre 2013

Dtrace on FreeBSD 9.2

As of FreeBSD 9.2-RELEASE, Dtrace is enabled by default in the GENERIC kernel. We where already able to activate Dtrace, on earlier releases, but we would have to make a custom kernel. Not always what you want because it breaks binary updates.

Now that Dtrace is enabled (and updated) in FreeBSD, everyone can use it without the hassle of a custom kernel. You still have to load Dtrace's modules. Let's see how it works.

first you go root and try a Dtrace related command:

$ sudo -Es
Password:
# dtruss ls
dtrace: failed to initialize dtrace: DTrace device not available on system

Looks like Dtrace is not loaded:

# kldstat 
Id Refs Address            Size     Name
 1   23 0xffffffff80200000 15b93c0  kernel
 2    1 0xffffffff817ba000 23d078   zfs.ko
 3    2 0xffffffff819f8000 84e0     opensolaris.ko
 4    1 0xffffffff81a01000 4828     coretemp.ko
 5    1 0xffffffff81c12000 2bce     pflog.ko
 6    1 0xffffffff81c15000 3078d    pf.ko
 7    1 0xffffffff81c46000 57bcf    linux.ko
 8    1 0xffffffff81c9e000 3c3d     wlan_xauth.ko

lets find Dtrace related modules:

# ls /boot/kernel/dtrace*
/boot/kernel/dtrace.ko			/boot/kernel/dtrace_test.ko.symbols
/boot/kernel/dtrace.ko.symbols		/boot/kernel/dtraceall.ko
/boot/kernel/dtrace_test.ko		/boot/kernel/dtraceall.ko.symbols

/boot/kernel/dtraceall.ko looks like a winner.

# kldload dtraceall
# kldstat 
Id Refs Address            Size     Name
 1   59 0xffffffff80200000 15b93c0  kernel
 2    1 0xffffffff817ba000 23d078   zfs.ko
 3   16 0xffffffff819f8000 84e0     opensolaris.ko
 4    1 0xffffffff81a01000 4828     coretemp.ko
 5    1 0xffffffff81c12000 2bce     pflog.ko
 6    1 0xffffffff81c15000 3078d    pf.ko
 7    1 0xffffffff81c46000 57bcf    linux.ko
 8    1 0xffffffff81c9e000 3c3d     wlan_xauth.ko
 9    1 0xffffffff81ca2000 ba2      dtraceall.ko
10    1 0xffffffff81ca3000 4ed3     profile.ko
11    3 0xffffffff81ca8000 402c     cyclic.ko
12   12 0xffffffff81cad000 23dbaa   dtrace.ko
13    1 0xffffffff81eeb000 fb3d     systrace_freebsd32.ko
14    1 0xffffffff81efb000 109bd    systrace.ko
15    1 0xffffffff81f0c000 45bb     sdt.ko
16    1 0xffffffff81f11000 4926     lockstat.ko
17    1 0xffffffff81f16000 bf0b     fasttrap.ko
18    1 0xffffffff81f22000 6673     fbt.ko
19    1 0xffffffff81f29000 4eeb     dtnfsclient.ko
20    1 0xffffffff81f2e000 1dd92    nfsclient.ko
21    1 0xffffffff81f4c000 47c0     nfs_common.ko
22    1 0xffffffff81f51000 55f2     dtnfscl.ko
23    1 0xffffffff81f57000 45cd     dtmalloc.ko
24    1 0xffffffff81f5c000 44fc     dtio.ko

tadaaaam:

# dtruss ls
SYSCALL(args) 		 = return
mmap(0x0, 0x8000, 0x3)		 = 6418432 0
issetugid(0x0, 0x0, 0x0)		 = 0 0
lstat("/etc\0", 0x7FFFFFFFC3C0, 0x0)		 = 0 0
lstat("/etc/libmap.conf\0", 0x7FFFFFFFC3C0, 0x0)		 = 0 0
open("/etc/libmap.conf\0", 0x0, 0x81FA20)		 = 3 0
...

Happy debugging/tracing/auditing.

22 septembre 2013

Munin plugins for CRM114

I'm using CRM114-based SpamAssassin plugin for spam filtering at work, and at home. Client-side, I'm able to check CRM114 contribution by a simple look at headers of an email message. But that won't tell you how CRM114 is behaving on the server side. The main concern on the server, is to check the two "databases" spam.css and nonspam.css.
I've chosen to monitor both average packing density and documents learned metrics for both files. The first one goes from 0 to 1. When its value reaches 0.9 and beyond, you must make sure your antispam filtering does not become sluggish.
The second one represents the number of emails from which CRM114 has been trained. Basically, CRM114 is only trained from its mistakes : if an email is flagged as spam by mistake, you can train CRM114 to learn it as nonspam, but if an email is properly flagged as spam, you can't train CRM114 to learn it as spam.

I've designed two Munin plugins, they run on FreeBSD but portage to another UNIX is just a matter of path.

Monitor "average packing density" (crm114_packingdensity): shows how much your CRM114 bases are encumbered

#!/usr/local/bin/bash
#
# Parameters:
#
# 	config   (required)
# 	autoconf (optional - used by munin-config)
#
# Magick markers (optional - used by munin-config and som installation
# scripts):
#%# family=auto
#%# capabilities=autoconf

export PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# config
if [ "$1" = "config" ]; then
    echo 'graph_title CRM114 css file stat average packing density'
    echo 'graph_vlabel Packing density'
    echo 'graph_category ANTISPAM'
    echo 'graph_args --upper-limit 1 --lower-limit 0'
    echo 'graph_info This graph shows the average packing density for CRM114 css files'
    echo 'spam.label spam'
    echo 'nonspam.label nonspam'
    exit 0
fi

cssutil -r -b /var/amavis/.crm114/spam.css | awk '/Average packing density/ {print "spam.value "$5}'
cssutil -r -b /var/amavis/.crm114/nonspam.css | awk '/Average packing density/ {print "nonspam.value "$5}'

Monitor "documents learned" (crm114_documentslearned): shows how many emails CRM114 has learned from

#!/usr/local/bin/bash
#
# Parameters:
#
# 	config   (required)
# 	autoconf (optional - used by munin-config)
#
# Magick markers (optional - used by munin-config and som installation
# scripts):
#%# family=auto
#%# capabilities=autoconf

export PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# config
if [ "$1" = "config" ]; then
    echo 'graph_title CRM114 css file stat documents learned'
    echo 'graph_vlabel Documents learned'
    echo 'graph_category ANTISPAM'
    echo 'graph_args --lower-limit 0'
    echo 'graph_info This graph shows the documents learned count for CRM114 css files'
    echo 'spam.label spam'
    echo 'nonspam.label nonspam'
    exit 0
fi

cssutil -r -b /var/amavis/.crm114/spam.css | awk '/Documents learned/ {print "spam.value "$4}'
cssutil -r -b /var/amavis/.crm114/nonspam.css | awk '/Documents learned/ {print "nonspam.value "$4}'

Both plugins need read access to spam.css and nonspam.css, depending on your setup it will require special privileges. You might want to add in /usr/local/etc/munin/plugin-conf.d/plugins.conf:

[crm114_*]
user root

19 août 2013

On s’amuse avec IPMI

AOC-IPMI20-E-copyright-supermicro J'ai passé un week end plutôt divertissant en compagnie d'IPMI. Tout a commencé avec la lecture d'un article édifiant sur arstechnica.com présentant IPMI comme un parasite, une sangsue.
J'ai toujours su instinctivement que l'IPMI pouvait, quelque part, poser un souci en terme de sécurité. Mais jamais je n'aurai imaginé que la situation fût aussi dramatique.
Les failles sont à tous les niveaux, tant dans le protocole IPMI, que dans les implémentations semi-propriétaires des fabricants. Elles sont dans les logiciels utilisés, elles sont dans le verrouillage que les fabricants imposent aux BMC, les rendant impossible à auditer, etc. Sysadmin de tout poil, je vous invite fortement à lire l'article d'Ars, et les proses de Dan Farmer et HD Moore.

Bref, j'ai tout lu, et disposant d'exemplaires HP et SuperMicro, j'ai tenté moi-même d'exploiter les failles décrites. Je ne rentre pas dans les détails, tout est déjà très bien expliqué dans la littérature sus-mentionnée. Par contre je vais donner quelques pistes pour l'installation des outils sous FreeBSD.

ipmitool, l'outil de base doit être installé avec FreeIPMI, sinon les manipulations permettant d'exploiter les BMC vulnérables ne fonctionneront pas. Sous FreebSD on peut simplement utiliser les ports :

# cd /usr/ports/sysutils/ipmitool
# make install clean

On active l'option "FREEIPMI" dans la configuration, et c'est parti.

Ensuite il est possible d'installer metasploit, outils puissant et complexe. Pour gagner du temps, et si vous ne souhaitez pas approfondir l'usage de ce logiciel, vous pouvez décocher l'option "DB" au moment de l'installation du port :

# cd /usr/ports/security/metasploit
# make install clean

Si vous souhaitez jouer un peu avec les outils et scripts développés par Dan Farmer, vous devrez installer en plus le module perl CaptureOutput.pm, nécessaire au fonctionnement de rak-the-ripper.pl :

# cd /usr/ports/devel/p5-IO-CaptureOutput
# make install clean

Si vous avez en plus une machine disposant d'une BMC qui tourne sur FreeBSD, voici ce qu'il faut faire pour accéder à ce contrôleur parasite via ipmitool :

# kldload ipmi

Cela charge le module ipmi, si ce n'est déjà fait. Vous aurez quelque chose de ce style dans la sortie de dmesg -a :

ipmi0: <IPMI System Interface> port 0xca2,0xca3 on acpi0
ipmi0: KCS mode found at io 0xca2 on acpi
ipmi0: IPMI device rev. 1, firmware rev. 2.35, version 2.0
ipmi0: Number of channels 2
ipmi0: Attached watchdog
ipmi1: <IPMI System Interface> on isa0
device_attach: ipmi1 attach returned 16

Et vous pourrez ensuite vous livrer à toutes sortes d'expériences en local (donc une fois de plus sans authentification…).
La prudence veut que vous désactiviez immédiatement ce module ensuite :

# kldunload ipmi

même si, ne nous voilons pas la face, ça retarderait juste de quelques dizaines de secondes un éventuel attaquant.

Mettez bien à jour le firmware de vos BMC. Et quand je dis bien à jour, c'est très sérieux, puisque des firmwares récents chez HP (juin) ne corrigent pas l'énorme faille du Cipher-0. Il faut utiliser le firmware 1.60 publié fin juillet pour avoir une chance de corriger cette faille de sécurité.

4 août 2013

Running source dedicated server on FreeBSD 9.x

I've covered this subject in french back in 2010, but things have evolved, and installing scrds on FreeBSD is not as straightforward as it used to be. Prerequisites are the same, you must first install the linux compatibility layer:

# As root, load the module & make sure it will be loaded after reboot:
kldload /boot/kernel/linux.ko
echo 'linux_enable="YES"' >> /etc/rc.conf

Then, and only after loading the linux module, install linux_base:

portinstall -PP linux_base-f10
(or portinstall linux_base-f10 if the command above fails)

Add linproc to /etc/fstab,

linproc         /compat/linux/proc      linprocfs       rw 0 0

and mount it:

mount -a

Install linux-steam:

portinstall linux-steam

Add a steamuser user (or whatever name you want). The user must be unprivileged, and must not be able to log into the machine. Set its home to /usr/local/steam and its shell to /usr/sbin/nologin.
Then, update the client linux-steam:

chown -R steamuser /usr/local/steam
cd /usr/local/steam
sudo -u steamuser ./steam
sudo -u steamuser ./steam (yes, do it twice, to make sure it's up to date)

Back in past, you would have used the steam client to install and update games, but people at steam thought it was way too easy. So they made it more complicated. Now you have to install a dedicated tool in order to install games and keep them updated: steamcmd.
Point your browser to the SteamCMD wiki page at Valvesoftware and read it. Then, download the linux version and extract in a dedicated directory (/usr/games for example).

cd /usr/games
fetch http://media.steampowered.com/client/steamcmd_linux.tar.gz
tar -xzf steamcmd_linux.tar.gz

By default, the steamcmd command will maintain your game library into /usr/local/steam/Steam/SteamApps, you might want to create a soft link in order to put your game library somewhere else (the force_install_dir option of steamcmd would not work on my server). At least, change owners for /usr/games/SteamCMD directory, to make sure steamuser can update its content:

mkdir /usr/local/steam/Steam
mkdir /usr/games/SteamApps
ln -s /usr/games/SteamApps /usr/local/steam/Steam/SteamApps
chown steamuser /usr/games/SteamApps
chown -R steamuser /usr/games/SteamCMD

Then you might have to change shebangs in SteamCMD/steam.sh and SteamCMD/steamcmd.sh to use your own bash (probably /usr/local/bin/bash instead of /bin/bash). After what you can launch steamcmd.sh:

cd /usr/games/SteamCMD
sudo -u steamuser ./steamcmd.sh

The program should auto-update, and present you with a Steam> prompt.
Then, you must login. For L4D2, and most games, you can login anonymously:

login anonymous

Choose your game from the list on the wiki, and use its ID to install/update:

app_update 222860 validate

It's possible to automate SteamCMD, for daily update of your games. For example you can create a shell script like this one:

#!/usr/local/bin/bash
cd /usr/games/SteamCMD || exit 1
/usr/local/bin/sudo -u steamuser ./steamcmd.sh +login anonymous +app_update 222860 validate +quit

Running your game server does not change much from my previous post. The path of game folder is the only important modification. I've created a shell script to launch L4D2 server:

#!/usr/local/bin/bash
ROOT="/patpro/games/SteamApps/common/Left 4 Dead 2 Dedicated Server"
SUDO="/usr/local/bin/sudo -u steamuser"
SCREEN=/usr/local/bin/screen
NICE="/usr/bin/nice -n -5"
STEAMRUNARGS="-ip PUT-YOUR-IP-ADDRESS-HERE -fps_max 0 -sys_ticrate 1000"

cd "${ROOT}" || exit 1
${SCREEN} ${NICE} ${SUDO} ./srcds_run ${STEAMRUNARGS}

Open TCP and UDP ports 26901 and 27015 in your firewall, and edit SteamApps/common/Left 4 Dead 2 Dedicated Server/left4dead2/cfg/server.cfg to tweak your settings.

Happy gaming!

9 juin 2013

À vendre : serveur bi Xeon dual core low voltage

Je vends mon ancienne passerelle internet (serveur maison, routeur…), puisque je l'ai remplacée il y a quelques temps par un modèle plus compact.

C'est un boitier tour ATX noir, contenant :

1 carte mère serveur Tyan Tiger i7520SD S5365 bi processeur ;
2 processeurs Intel Xeon LV dual core "sossaman" 1,67 GHz ;
1 carte WIFI DLink DWL-G520, chipset Atheros ;
1 Go de RAM ;
1 alimentation hybride NX-9003 SFB 350W ;
1 lecteur CD.

La carte dispose d'un slot Compact Flash, idéal pour booter un OS léger. La ventilation silencieuse est à base de Nexustek.

Le matériel est robuste, conçu pour tourner 24/24. C'est une configuration 100% compatible FreeBSD/Linux, idéale pour un serveur "maison" multi-usage, une passerelle internet, ou une station Linux/Unix.
Les processeurs Intel Xeon LV dual core "sossaman" ont une enveloppe thermique de 35W, permettant un refroidissement passif.

LOT COMPLET : 350 euros, remise en mains propres uniquement
LOT PARTIEL (carte mère, 2 CPU, RAM, radiateurs) : 300 euros.

29 avril 2013

Relia Fanless Industrial PC : fanless oui, industriel pas trop.

Depuis un moment, pour des raisons de silence et surtout d'encombrement, je voulais remplacer mon routeur (une tour ATX avec ventilation relativement silencieuse) par un boitier compact sans ventilateur. Les conditions à remplir étaient les suivantes :

ventilation passive,
puissance CPU d'actualité, et 64bits
2 ports ethernet, grand minimum,
WIFI,
port série,
2 emplacements pour disque dur, grand minimum,
ne pas coûter un bras.

En cherchant des petits PC fanless "industriels" je suis tombé sur Aleutia, qui propose le modèle Relia. Celui-ci rempli approximativement le cahier des charges. Surtout, les gens chez Aleutia ont bien assuré l'avant-vente, répondant à mes nombreuses questions avec rapidité et précision. Pour la somme pas vraiment négligeable de 848,32 euros j'ai :

un mini PC sans ventilateur,
Core i3 2.8 GHz,
8 Go de RAM,
carte Wifi/Bluetooth avec deux antennes de 30cm,
(seulement) 2 ports ethernet,
SSD de 64 Go mini-PCIe,
2 emplacements pour disques durs de 2,5".

Aleutia Relia crédit photo : Aleutia

Comme l'engin n'a que 2 ports ethernet, il m'a fallu ajouter un mini switch ethernet silencieux (Netgear GS105E) pour 25 euros de plus. J'ai aussi monté dans le boitier 2 disques HGST Travelstar 7K1000 en SATA III, 1To chaque.
Montant total : 1032 euros et 19 cents.

Sur le plan physique, je suis satisfait de mon achat. L'alimentation externe est petite et discrète et le PC est compact, très robuste, d'un beau noir mat. Il est incontestablement d'aspect industriel, à tel point que je ne laisserai pas des gamins jouer avec : en le manipulant on peut se faire mal sur les angles des radiateurs latéraux. Il est livré dans la boite d'origine du fabricant Streacom.

Sur le plan du silence par contre, on repassera. La machine émet une sorte de couinement aigu, d'intensité variable et inversement proportionnelle à la charge CPU. Ce bruit d'origine électronique est désagréable, car il porte loin et que j'y suis sensible. Moralité, pas besoin de ventilateur pour polluer l'ambiance. Par dessus, j'ai rajouté les deux disques qui font, même inactifs, un léger bruit de ventilateur.

À l'intérieur ça se corse. La carte mère est un simple modèle de bureau, la Intel DQ77KB. Et ça, ce n'est pas une bonne nouvelle (même si je le savais avant d'acheter). Pour faire simple, et court, je découvre les joies des "interrupt storms", des chipsets qui font tout sauf le café, et qui donc ne font pas forcément bien le travail. Je me retrouve aussi face à un BIOS pas très robuste et manquant de souplesse.
Il n'y a que sur une carte mère grand public que le système d'exploitation peut perdre les pédales quand on branche un écran. Là où une carte de serveur aurait une électronique bien mieux cloisonnée, et un simple port VGA bien robuste, on a ici sans doute trop de gadgets sur la même ligne d'IRQ. Au hasard : l'IRQ 16 c'est 6 ports USB3 et USB2, un port Displayport et un port HDMI.
Alors bien sûr, il y a vraisemblablement un bug du côté de FreeBSD, mais avec une carte mère Tyan ou Supermicro, je n'aurais eu aucun problème. Résultat, le port HDMI est totalement inutilisable, sauf en cas d'urgence. C'est une machine headless, donc j'en fais le deuil, mais c'est pénible.
Autre souci, et non des moindres : la carte Wifi vendue avec le Relia est une Intel Centrino Advanced-N 6230. Elle fonctionne, mais elle ne permet pas la création d'un point d'accès (mode hostap). C'est uniquement dans ce but que j'ai besoin d'une carte Wifi. Je dois donc trouver une alternative compatible hostap, qui rentre dans un slot half-mini PCIe. Je cherche encore...
Et pour finir, les câbles SATA fournis par Aleutia ont des prises droites, qui s'avèrent problématiques pour brancher les disques sur les ports SATA III. Heureusement, j'avais des câbles coudés en réserve.

Comme évoqué plus haut, Aleutia a bien assuré la phase d'avant-vente. Par contre, l'après-vente est très décevante. J'ai contacté une fois le support technique : pas de réponse. J'ai donné des info à l'ingénieur avant vente au sujet de la compatibilité de la carte Wifi avec le mode hostap : pas de réponse. En gros, j'ai le sentiment qu'une fois qu'on a payé, on n'est plus trop intéressant pour eux.

Un peu de positif tout de même : j'ai pu installer assez facilement FreeBSD 9.1 sur le SSD, intégralement sur ZFS (même le swap). Les deux disques durs, dont on me reprochera le fait qu'ils ne sont pas très fiables, servent à stocker mes sauvegardes d'autres machines. J'ai donc volontairement choisi un bon rapport volume/prix, la fiabilité est un peu secondaire. Ils sont aussi formatés en ZFS avec un alignement de 4K.
Je constate néanmoins des soucis de performance. Si j'envoie un gros volume de données entre un des disques et le SSD, la charge CPU monte facilement à 15-20, et je peux perdre la main jusqu'à la fin du transfert. La compression gzip que j'utilise sur de nombreux volumes est sans doute en cause.

Le bilan est plus que mitigé. Le Relia remplace un PC bi xeon LV dual core (Sossaman 32 bits), carte mère Tyan, avec 3 ports ethernet et une carte Wifi en access point. Certes il prenait de la place, et était ventilé, mais il avait aussi un port série pour le branchement de ma sonde de température et mon lecteur de iButton. Finalement, je gagne ZFS sur du 64bits, et je multiplie ma RAM par 8, je supprime une tour du salon, mais pour le reste je ne me sens pas gagnant !

30 mars 2013

Track mpm-itk problems with truss

Some background

I've some security needs on a shared hosting web server at work and I've ended up installing Apache-mpm-itk in place of my old vanilla Apache server. MPM-ITK is a piece of software (a set of patches in fact) you apply onto Apache source code to change it's natural behavior.
Out of the box, Apache spawns a handful of children httpd belonging to user www:www or whatever your config is using. Only the parent httpd belongs to root.
Hence, every single httpd must be able to read (and sometimes write) web site files. Now imagine you exploit a vulnerability into a php CMS, successfully injecting a php shell. Now through this php shell, you are www on the server, you can do everything www can, and it's bad, because you can even hack the other web sites of the server that have no known vulnerability.
With MPM-ITK, Apache spawns a handfull of master processes running as root, and accordingly to your config files, each httpd serving a particular virtual host or directory will switch from root to a user:group of your choice. So, one httpd process currently serving files from web site "foo" cannot access file from web site "bar": an attacker exploiting a vulnerability of one particular web site won't be able to hack every other web site on the server.

More background

That's a world you could dream of. In real world, that's not so simple. In particular, you'll start having troubles as soon as you make use of fancy features, especially when you fail to provide a dedicated virtual host per user.
On the shared server we host about 35 vhosts for 250 web sites, and we can't afford to provide every user with his dedicated vhost. The result is a given virtual host with a default value for the fallback user:group (say www:www), and each web site configured via Directory to use a different dedicated user.

When a client GET a resource (web page, img, css...) it generally keeps the connection opened with the httpd process. And it can happen that a resource linked from a web page sits into another directory, belonging to another user. The httpd process has already switched from root to user1 to serve the web page, it can't switch to user2 to serve the linked image from user2's directory. So Apache drops the connection, spawns a new httpd process, switches to user2, and serves the requested resource.
When it happens, you can read things like this into your Apache error log:

[warn] (itkmpm: pid=38130 uid=1002, gid=80) itk_post_perdir_config(): 
initgroups(www, 80): Operation not permitted
[warn] Couldn't set uid/gid/priority, closing connection.

That's perfectly "legal" behavior, don't be afraid, unless you read hundreds of new warning every minute.
If you host various web sites, belonging to various users, into the same vhost, you're likely to see many of these triggered by the /favicon.ico request.

Where it just breaks

When things are getting ugly is the moment a user tries to use one of your available mod_auth* variant to add some user authentication (think .htaccess). Remember, I host many web sites in a single vhost, each one into its own directory with its own user:group.

Suddenly every single visitor trying to access the protected directory or subdirectory is disconnected. Their http client reports something like this:

the server unexpectedly dropped the connection...

and nothing else is available. The error, server-side, is the same initgroups error as above, and it does not help at all. How would you solve this? truss is your friend.

Where I fix it

One thing I love about FreeBSD is the availability of many powerful tools out of the box. When I need to track down a strange software behavior, I feel very comfortable on FreeBSD (it doesn't mean I'm skilled). truss is one of my favorites, it's simple, straightforward and powerful.
What you need to use truss is the PID of your target. With Apache + MPM-ITK, processes won't stay around very long, and you can't tell which one you will connect to in advance. So the first step is to buy yourself some precious seconds so that you can get the PID of your target before the httpd process dies. Remember, it dies as soon as the .htaccess file is parsed. Being in production, I could not just kill everything and play alone with the server, so I choose another way. I've created a php script that would run for few seconds before ending. Server side, I've prepared a shell command that would install the .htaccess file I need to test, and start truss while grabbing the PID of my target. On FreeBSD, something like this should do the trick:

cd /path/to/user1/web/site
mv .htaccess_inactive .htaccess && truss -p $(ps auxw|awk '/^user1/ {print $2}')

First http GET request, the .htaccess file is not present, an httpd process switches from root to user1, starts serving the php script. I launch my command server-side: it puts .htaccess in place, gets the PID of my httpd process, and starts truss.
The php script ends and returns its result, client-side I refresh immediately (second GET request), so that I stay on the same httpd process. My client is disconnected as soon as the httpd process has parsed the .htaccess file. At this point, truss should already be dead. I've the complete trace of the event. The best is to read the trace backward from the point where httpd process issue an error about changing UID or GID:

01: setgroups(0x3,0x80a8ff000,0x14,0x3,0x566bc0,0x32008) 
    ERR#1 'Operation not permitted'
02: getgid()					 = 80 (0x50)
03: getuid()					 = 8872 (0x22a8)
04: getpid()					 = 52942 (0xcece)
05: gettimeofday({1364591872.453335 },0x0)		 = 0 (0x0)
06: write(2,"[Fri Mar 29 22:17:52 2013] [warn"...,142) = 142 (0x8e)
07: gettimeofday({1364591872.453583 },0x0)		 = 0 (0x0)
08: write(2,"[Fri Mar 29 22:17:52 2013] [warn"...,85) = 85 (0x55)
09: gettimeofday({1364591872.453814 },0x0)		 = 0 (0x0)
10: shutdown(51,SHUT_WR)				 = 0 (0x0)

Line 01 is the one I'm looking for: the httpd process is trying to change groups and fails, line 02 to 05 it's gathering data for the log entry, line 06 it's writing the error to the log file. 07 & 08: same deal for the second line of log.

From that point in time, moving up shows that it tried to access an out-of-directory resource, and that resource is an html error page! Of course, it makes sense, and it's an hard slap on the head (RTFM!).

01: stat("/user/user1/public_html/bench.php",{ 
    mode=-rw-r--r-- ,inode=4121,size=7427,blksize=7680 }) = 0 (0x0)
02: open("/user/user1/public_html/.htaccess",0x100000,00) = 53 (0x35)
03: fstat(53,{ mode=-rw-r--r-- ,inode=4225,size=128,blksize=4096 }) = 0 (0x0)
04: read(53,"AuthType Basic\nAuthName "Admin "...,4096) = 128 (0x80)
05: read(53,0x80a8efd88,4096)			 = 0 (0x0)
06: close(53)					 = 0 (0x0)
07: open("/user/user1/public_html/bench.php/.htaccess",0x100000,00) 
    ERR#20 'Not a directory'
08: getuid()					 = 8872 (0x22a8)
09: getgid()					 = 80 (0x50)
10: stat("/usr/local/www/apache22/error/HTTP_UNAUTHORIZED.html.var",{ 
    mode=-rw-r--r-- ,inode=454787,size=13557,blksize=16384 }) = 0 (0x0)
11: lstat("/usr/local/www/apache22/error/HTTP_UNAUTHORIZED.html.var",{ 
    mode=-rw-r--r-- ,inode=454787,size=13557,blksize=16384 }) = 0 (0x0)
12: getuid()					 = 8872 (0x22a8)
13: setgid(0x50,0x805d43d94,0x64,0x800644767,0x101010101010101,0x808080808080
    8080) = 0 (0x0)

line 13 shows the beginning of setgid process, and 10/11 shows the culprit. Up from here is the regular processing of the .htaccess file.

RTFM

When you use mod_auth* to present visitors with authentication, the server issues an error, and most of the time, this error is sent to the client with a dedicated header, and a dedicated html document (think "404"). When the error is about authentication (error 401), most clients hide the html part, and present the user with an authentication popup.
But the html part is almost always a physical file somewhere in the server directory tree. And it's this particular file the httpd process was trying to reach, issuing an initgroups command, and dying for not being allowed to switch users.
I've found in my Apache config the definition of ErrorDocument:

ErrorDocument 400 /error/HTTP_BAD_REQUEST.html.var
ErrorDocument 401 /error/HTTP_UNAUTHORIZED.html.var
ErrorDocument 403 /error/HTTP_FORBIDDEN.html.var
...

and replaced them all by a file-less equivalent, so Apache won't have any error file to read and will just send a plain ASCII error body (it saves bandwidth too):

ErrorDocument 400 "400 HTTP_BAD_REQUEST"
ErrorDocument 401 "401 HTTP_UNAUTHORIZED"
ErrorDocument 403 "403 HTTP_FORBIDDEN"
...

I've restarted Apache, and authentication from mod_auth* started to work as usual.
Same approach applies to almost any mpm-itk problem when it's related to a connection loss with Couldn't set uid/gid/priority, closing connection error log. You locate the resource that makes your server fail, and you find a way to fix the issue.