This is a pretty long story, so buckle in...
One of the reporting setups I'm in charge of here at work involves a trio of machines, let's call them
WDX is my local box, where my development happens;
WSX hosts a reasonably high-usage (by internal standards) MySQL database with some pertinent statistics;
LXA is a Linux box which, among other things, hosts the Apache/MySQL setup for the whole gig.
The report itself consists of a bare-bones HTML framework, with AJAX posting data into the skeleton. Two separate scripts provide different datasets, one refreshing once per minute and the other every ten seconds. In development, our this setup worked flawlessly.
WDX connected to
mysql_connect(...) and dumped the data, returning in ~250ms; the display was fresh and there was much joy. This success in hand, we pushed "live" (internally) to
LXA and were confronted by an immediate 20s latency on server returns. No biggie — ruins the point of 10-second refreshes, but it's still bearable and can easily be attributed to other loads on the server.
Time goes by, summer turns to autumn turns to winter, and we forget about the issues. Two weeks ago, I receive a panicked e-mail, “[reporting system] is broken! Nothing's showing up!” It's nothing mission-critical, so I'm not terribly concerned, but people seem to enjoy the numbers, so I decide to be kind and look into it. I fire up the reporting tool and — FireBug is a beautiful thing — discover that my little 20s latencies have grown to 60s, and valid XML has become empty server timeouts.
mysqld-nt is showing 99% CPU consistently, so we kill it and restart. We heave a quick sigh of relief, until a few hours later we're back in troubletown. Rinse and repeat a few times, and it's apparent we've got an issue. Using
Win32::Process::Info (link), we whip up a quick Perl script to diagnose the problem; in the process, we discover that usage jumps around lunchtime, and eventually subsides from 99% after everyone has gone home. Coincidence, much?
Armed with this knowledge, we send out the obvious e-mail, “Please refrain from using [reporting system] unless absolutely necessary.” Lo and behold, everything is fine. For a day. I should state that I attack MySQL from the PHP end, and not the other way around; that's why we're still trying to fix the webserver end at this point. We swap out all of the
mysql_* calls in the PHP with their MySQLi equivalents, and notice a 25% (give or take) improvement. Sadly, not enough improvement to actually work.
We finally begin traversing the MySQL documentation, and hit upon the (obvious if you know it) command
show processlist; which, unsurprisingly, lists active database connections and their states. Here's a cap of the results during this period:
+----+----------------------+----------------+---------+------------------+ | Id | User | Host | Command | Info | +----+----------------------+----------------+---------+------------------+ | 2 | write | localhost:2085 | Query | show processlist | | 13 | unauthenticated user | LXA:41333 | Connect | NULL | | 14 | unauthenticated user | LXA:41334 | Connect | NULL | | 15 | unauthenticated user | LXA:41337 | Connect | NULL | | 16 | unauthenticated user | LXA:41339 | Connect | NULL | | 17 | unauthenticated user | LXA:41341 | Connect | NULL | | 18 | unauthenticated user | LXA:41342 | Connect | NULL | | 21 | unauthenticated user | LXA:41344 | Connect | NULL | | 22 | unauthenticated user | LXA:41349 | Connect | NULL | | 23 | unauthenticated user | LXA:41350 | Connect | NULL | | 24 | unauthenticated user | LXA:41354 | Connect | NULL | | 25 | unauthenticated user | LXA:41355 | Connect | NULL | | 27 | unauthenticated user | LXA:41356 | Connect | NULL | | 28 | unauthenticated user | LXA:41357 | Connect | NULL | | 29 | unauthenticated user | LXA:41358 | Connect | NULL | | 30 | unauthenticated user | LXA:41359 | Connect | NULL | | 31 | unauthenticated user | LXA:41360 | Connect | NULL | | 32 | unauthenticated user | LXA:41361 | Connect | NULL | +----+----------------------+----------------+---------+------------------+ 18 rows in set (0.00 sec)
(edit: scrubbed a little bit to remove unneeded information) What? 16 connections stuck in
Connect state? We test the
WSX link: ping is <1ms. Hm. We find that all queries are executing near-instantly (as close as expected). Is the problem only that the connections won't take? Some quick research into this revealed no known solution, but I expect to be proven wrong at some point (help!).
After fumbling for a bit, we found the
mysql_pconnect(...) function; it's not hidden, by any means, but it's not really advertised. What
mysql_pconnect(...) offers is a persistent database connection, which is hand-wavingly claimed to be unique by host, username, and password. If you're not familiar with persistent database connections, here's a primer: every time you connect to a database, a certain overhead is required. If the database is on the same machine, or even on a small network, the overhead is generally minimal and not much of a worry. If, however, the connection overhead is high, there is a possible advantage to never throwing out old connections: allowing them to persist (hence, “persistent”). There are definitely risks to this scheme; PHP actually recommends against it:
And in their section on Persistent Database Connections has this little gem of direction:
That said, our specific situation — high connection time with no other symptoms, and an otherwise little-used database server — seemed ripe for this solution. The implementation is identical to that of the original PHP MySQL API, so we rolled back the MySQLi changes (yes, it's incompatible; no, that makes no sense) and ran
%s/mysql_connect/mysql_pconnect/g in vi. We quickly pushed the code live, as it couldn't get much worse, and latency was reduced from 90s to 0.2s. WOW.
Excited, we ran
show processlist; on
+-------+-------+----------------+---------+------------------+ | Id | User | Host | Command | Info | +-------+-------+----------------+---------+------------------+ | 7 | write | localhost:2453 | Sleep | NULL | | 97744 | write | localhost:2954 | Query | show processlist | | 98986 | read | LXA:49417 | Sleep | NULL | | 99000 | read | LXA:51757 | Sleep | NULL | | 99002 | read | LXA:51770 | Sleep | NULL | | 99034 | read | LXA:36885 | Sleep | NULL | | 99040 | read | LXA:36921 | Sleep | NULL | | 99042 | read | LXA:36924 | Sleep | NULL | | 99080 | read | LXA:59473 | Sleep | NULL | | 99094 | read | LXA:46164 | Sleep | NULL | | 99102 | read | LXA:46306 | Sleep | NULL | | 99124 | read | LXA:43139 | Sleep | NULL | | 99196 | read | LXA:59307 | Sleep | NULL | | 99248 | read | LXA:48019 | Sleep | NULL | | 99427 | read | LXA:48323 | Sleep | NULL | | 99433 | read | LXA:48387 | Sleep | NULL | +-------+-------+----------------+---------+------------------+ 16 rows in set (0.00 sec)
Wait, what? PHP says the connection should be reused if it's the same host (check), user (check), and password (check). After some discussion, we add some IP logging to the database records (to avoid the nasty Apache logfile, which we don't have access to anyway), and discover 7 simultaneous viewers of the report. Aha! 7 viewers times 2 AJAX scripts per viewer == 14 connections! With the scripts refreshing every 10s - 60s, and connections taking >60s to create, it must be that a whole new persistent connection was created for each view (as there were none in cache). We place the XML on standby, restart the MySQL server, and run a one-time script to create a database link from
WSX. One connection in place, we bring the XML back and wait with baited breath...
Immediately there are 13 new connections. What? This one took some time, and I'm still not 100% on it: PHP reuses the database link if it can find one that's not already in use. Because these scripts are executing so closely, the database link is always occupied by another viewer; there will be as many database links as viewers!
In our case, this doesn't matter. With a small numbers of viewers, it's no problem — and if it is, we can work out a caching scheme. Using
mysql_pconnect(...) for database link creation happened to solve all of our symptoms, but we have still failed to address any of the root issues.
For now, we're happy and in green pastures. But still, we're concerned (not enough to put an actual DBA on it, though; if time is money, DBA time is money and a half). Any thoughts on the root issue?
All included graphics are generated at LaTeX to png .
Sorry, further commenting on this post has been disabled. For more information, contact me.