Track mpm-itk problems with truss

Some background

I've some security needs on a shared hosting web server at work and I've ended up installing Apache-mpm-itk in place of my old vanilla Apache server. MPM-ITK is a piece of software (a set of patches in fact) you apply onto Apache source code to change it's natural behavior.
Out of the box, Apache spawns a handful of children httpd belonging to user www:www or whatever your config is using. Only the parent httpd belongs to root.
Hence, every single httpd must be able to read (and sometimes write) web site files. Now imagine you exploit a vulnerability into a php CMS, successfully injecting a php shell. Now through this php shell, you are www on the server, you can do everything www can, and it's bad, because you can even hack the other web sites of the server that have no known vulnerability.
With MPM-ITK, Apache spawns a handfull of master processes running as root, and accordingly to your config files, each httpd serving a particular virtual host or directory will switch from root to a user:group of your choice. So, one httpd process currently serving files from web site "foo" cannot access file from web site "bar": an attacker exploiting a vulnerability of one particular web site won't be able to hack every other web site on the server.

More background

That's a world you could dream of. In real world, that's not so simple. In particular, you'll start having troubles as soon as you make use of fancy features, especially when you fail to provide a dedicated virtual host per user.
On the shared server we host about 35 vhosts for 250 web sites, and we can't afford to provide every user with his dedicated vhost. The result is a given virtual host with a default value for the fallback user:group (say www:www), and each web site configured via Directory to use a different dedicated user.

When a client GET a resource (web page, img, css...) it generally keeps the connection opened with the httpd process. And it can happen that a resource linked from a web page sits into another directory, belonging to another user. The httpd process has already switched from root to user1 to serve the web page, it can't switch to user2 to serve the linked image from user2's directory. So Apache drops the connection, spawns a new httpd process, switches to user2, and serves the requested resource.
When it happens, you can read things like this into your Apache error log:

[warn] (itkmpm: pid=38130 uid=1002, gid=80) itk_post_perdir_config(): 
initgroups(www, 80): Operation not permitted
[warn] Couldn't set uid/gid/priority, closing connection.

That's perfectly "legal" behavior, don't be afraid, unless you read hundreds of new warning every minute.
If you host various web sites, belonging to various users, into the same vhost, you're likely to see many of these triggered by the /favicon.ico request.

Where it just breaks

When things are getting ugly is the moment a user tries to use one of your available mod_auth* variant to add some user authentication (think .htaccess). Remember, I host many web sites in a single vhost, each one into its own directory with its own user:group.

Suddenly every single visitor trying to access the protected directory or subdirectory is disconnected. Their http client reports something like this:

the server unexpectedly dropped the connection...

and nothing else is available. The error, server-side, is the same initgroups error as above, and it does not help at all. How would you solve this? truss is your friend.

Where I fix it

One thing I love about FreeBSD is the availability of many powerful tools out of the box. When I need to track down a strange software behavior, I feel very comfortable on FreeBSD (it doesn't mean I'm skilled). truss is one of my favorites, it's simple, straightforward and powerful.
What you need to use truss is the PID of your target. With Apache + MPM-ITK, processes won't stay around very long, and you can't tell which one you will connect to in advance. So the first step is to buy yourself some precious seconds so that you can get the PID of your target before the httpd process dies. Remember, it dies as soon as the .htaccess file is parsed. Being in production, I could not just kill everything and play alone with the server, so I choose another way. I've created a php script that would run for few seconds before ending. Server side, I've prepared a shell command that would install the .htaccess file I need to test, and start truss while grabbing the PID of my target. On FreeBSD, something like this should do the trick:

cd /path/to/user1/web/site
mv .htaccess_inactive .htaccess && truss -p $(ps auxw|awk '/^user1/ {print $2}')

First http GET request, the .htaccess file is not present, an httpd process switches from root to user1, starts serving the php script. I launch my command server-side: it puts .htaccess in place, gets the PID of my httpd process, and starts truss.
The php script ends and returns its result, client-side I refresh immediately (second GET request), so that I stay on the same httpd process. My client is disconnected as soon as the httpd process has parsed the .htaccess file. At this point, truss should already be dead. I've the complete trace of the event. The best is to read the trace backward from the point where httpd process issue an error about changing UID or GID:

01: setgroups(0x3,0x80a8ff000,0x14,0x3,0x566bc0,0x32008) 
    ERR#1 'Operation not permitted'
02: getgid()					 = 80 (0x50)
03: getuid()					 = 8872 (0x22a8)
04: getpid()					 = 52942 (0xcece)
05: gettimeofday({1364591872.453335 },0x0)		 = 0 (0x0)
06: write(2,"[Fri Mar 29 22:17:52 2013] [warn"...,142) = 142 (0x8e)
07: gettimeofday({1364591872.453583 },0x0)		 = 0 (0x0)
08: write(2,"[Fri Mar 29 22:17:52 2013] [warn"...,85) = 85 (0x55)
09: gettimeofday({1364591872.453814 },0x0)		 = 0 (0x0)
10: shutdown(51,SHUT_WR)				 = 0 (0x0)

Line 01 is the one I'm looking for: the httpd process is trying to change groups and fails, line 02 to 05 it's gathering data for the log entry, line 06 it's writing the error to the log file. 07 & 08: same deal for the second line of log.

From that point in time, moving up shows that it tried to access an out-of-directory resource, and that resource is an html error page! Of course, it makes sense, and it's an hard slap on the head (RTFM!).

01: stat("/user/user1/public_html/bench.php",{ 
    mode=-rw-r--r-- ,inode=4121,size=7427,blksize=7680 }) = 0 (0x0)
02: open("/user/user1/public_html/.htaccess",0x100000,00) = 53 (0x35)
03: fstat(53,{ mode=-rw-r--r-- ,inode=4225,size=128,blksize=4096 }) = 0 (0x0)
04: read(53,"AuthType Basic\nAuthName "Admin "...,4096) = 128 (0x80)
05: read(53,0x80a8efd88,4096)			 = 0 (0x0)
06: close(53)					 = 0 (0x0)
07: open("/user/user1/public_html/bench.php/.htaccess",0x100000,00) 
    ERR#20 'Not a directory'
08: getuid()					 = 8872 (0x22a8)
09: getgid()					 = 80 (0x50)
10: stat("/usr/local/www/apache22/error/HTTP_UNAUTHORIZED.html.var",{ 
    mode=-rw-r--r-- ,inode=454787,size=13557,blksize=16384 }) = 0 (0x0)
11: lstat("/usr/local/www/apache22/error/HTTP_UNAUTHORIZED.html.var",{ 
    mode=-rw-r--r-- ,inode=454787,size=13557,blksize=16384 }) = 0 (0x0)
12: getuid()					 = 8872 (0x22a8)
13: setgid(0x50,0x805d43d94,0x64,0x800644767,0x101010101010101,0x808080808080
    8080) = 0 (0x0)

line 13 shows the beginning of setgid process, and 10/11 shows the culprit. Up from here is the regular processing of the .htaccess file.

RTFM

When you use mod_auth* to present visitors with authentication, the server issues an error, and most of the time, this error is sent to the client with a dedicated header, and a dedicated html document (think "404"). When the error is about authentication (error 401), most clients hide the html part, and present the user with an authentication popup.
But the html part is almost always a physical file somewhere in the server directory tree. And it's this particular file the httpd process was trying to reach, issuing an initgroups command, and dying for not being allowed to switch users.
I've found in my Apache config the definition of ErrorDocument:

ErrorDocument 400 /error/HTTP_BAD_REQUEST.html.var
ErrorDocument 401 /error/HTTP_UNAUTHORIZED.html.var
ErrorDocument 403 /error/HTTP_FORBIDDEN.html.var
...

and replaced them all by a file-less equivalent, so Apache won't have any error file to read and will just send a plain ASCII error body (it saves bandwidth too):

ErrorDocument 400 "400 HTTP_BAD_REQUEST"
ErrorDocument 401 "401 HTTP_UNAUTHORIZED"
ErrorDocument 403 "403 HTTP_FORBIDDEN"
...

I've restarted Apache, and authentication from mod_auth* started to work as usual.
Same approach applies to almost any mpm-itk problem when it's related to a connection loss with Couldn't set uid/gid/priority, closing connection error log. You locate the resource that makes your server fail, and you find a way to fix the issue.

Cognitive Overhead

Track mpm-itk problems with truss

Laisser un commentaire