A warm welcome at the new #bytemine-office for a #Zarafa meeting: http://t.co/8WJA3Cx4MO
Last weekend I did a presentation at FOSDEM 2012 on some optimizations we did in Zarafa to increase performance. Although you can find the slides here, here's the general idea:
Groupware applications are almost always I/O-bound; most highly-loaded servers are doing I/O most of the time, leaving all your 8 cores of the CPU just idling waiting for data to process. This is true for Zarafa as well. In fact, load simulations on SSD still show the bottleneck to be at the I/O layer, even though this doesn't happen until you have thousands of users pounding a server with requests and sending and receiving over 50 emails per second.
To use your hardware most effectively, we need to try to balance the I/O load with the CPU load more evenly. Since 99% of all the I/O load on Zarafa is generated by MySQL's innodb storage engine, the simplest way to reduce I/O load is to add more RAM. The reason this works well is that MySQL is very good at deferring I/O until later or even merging multiple I/O's together when it has lots of memory to work with.
So we set out to make use of the available RAM in a server in the best possible way. This means that InnoDB's main cache, the buffer pool, should be catching as much as possible I/O before it actually goes to disk. You can look at this another way; If we look at the contents of our buffer pool, then all the data in there that we have never read or never written to is a waste of space. Ideally, every page in the buffer pool is filled to the rim with data, and we have read or written to each and every one of the records. So we set out to make it so.
Well, not really. In practice it is of course impossible to have *no* unread records in the buffer pool. But it *is* possible to make it as efficient as possible. So that's what we set out to do.
A useful tool for all this is the innodb buffer pool inspection statistics available in MariaDB and Percona Server. They allow you to see the number of pages that are in the buffer pool for each table, if they've been modified, and which index they belong to. We can't actually see how many records we read from them, so we have to do some guessing from here on: The way we did it in practice is to run a load simulation and then stare at the buffer pool statistics until we saw something out of the ordinary.
One nice example of this is a table that we never really payed much attention to; Zarafa has an so called id-to-GUID mapping table which is relatively small, called 'indexedproperties'. This table basically assigns a large random ID to an object in the database, which is used by clients to identify objects like an email or folder. What was interesting was that although the total size of this table was only about 1% of total data size, we were seeing more than 20% of the buffer pool being used by this single table. So what was up there?
Well, GUIDs are evil like that. Due to the fact that they are so random, any lookup of a GUID in our table was doing a random access into that table to get the record. Since this happens each time you open an email or folder, this is a pretty frequent operation. This means that getting the internal database ID of a single 16-byte GUID was actually pulling in a whole page into the innodb buffer pool. To give an idea of how bad this is, getting 10000 GUIDs which are almost guaranteed to be on different pages, we have to read 10000 pages, and a page is 16k. That's 160MB of your precious RAM, almost all wasted since the data we got was only 160K.
The key to fixing this is to order the in-memory data in such a way that when we're reading 10000 records, they are more likely to be on the same 16k page. For GUIDs, this can be done by only using a single fixed prefix, and a counter that just increments the last part of the GUID. This makes the GUIDs at least 'group together' based on age, since the records are stored in a sorted order in the database, and the counter is always increasing. Because the counter is at the end of the fixed GUID, each new object will be written directly after the previous object. Since you are quite likely to be reading GUIDs that were written around the same time (eg emails from the last week), we'd probably only need to read one page for every 10 records or so. This simple change makes our innodb buffer pool 10x more effective for this table.
We did a lot more of these kind of optimizations, doing guesstimations and looking at the buffer pool, boosting performance simply because we can have more in the cache even when we have the same amount of RAM.
This is just one small part of the I/O optimizations we have done in Zarafa 7.0 but it has been an important one. Together with other optimizations that reduce the average number of IOPS needed for a network request, Zarafa 7.0 has been able to reduce I/O load by around a factor of two depending on the workload.
Of course one day, I should write a software solution for innodb so we can get the per-page 'records read' statistics directly and use those...