RT @jvoskuhl: @zarafagroupware Today i received my #Zarafa Certified Engineer certificate... Happy :-) http://t.co/ozMCGuCvCA
Being able to search quickly in your email is really important for users these days. To do this, you either have to have a very small amount of data, or you need an index over your data. This sounds simple since there are a lot of indexing technologies already out there, but quite some care is needed to make sure that you are offering the right kind of performance in the right places.
Back in Zarafa 6.40, we introduced fulltext indexing services in Zarafa's backend. This was aimed at supporting fast fulltext searches on a per-store basis to speed up searching in both our WebAccess and Outlook. We took the logical approach there, and built our indexing services into a separate service, using the CLucene library as an indexing backend, and 'normal' client MAPI calls to the server to get the data to index.
Over the years though, we have seen an opportunity to improve search speeds. Indeed, it is a lot faster than just trolling through all of your data, but it also wasn't giving us millisecond searches all the time. So for Zarafa 7.1, a new implementation was scheduled which has been developed over the past few months.
To take a next step, we first had to find out why CLucene wasn't performing in the expected way all the time. To make a long story short, we are using CLucene in a way that it wasn’t really intended for; This is due to the way that MAPI and Outlook do searches; When you do a search, for say 'foo', normally a CLucene search client would request to find all documents matching 'foo', and then show the 'top 20' of hits in the document.
And indeed, this is as fast as you'd expect it to be with an index. However, in our case, we do not only want the 'top 20' hits, we want *all* matching documents, since sorting of those documents will be done in the MAPI core, and is normally done by date for example. This means that we need to query CLucene for all the matching documents, and we need to retrieve the unique identifier of each of those documents.
Due to the way that this works in CLucene, an uncached search can generate a massive amount of I/O; each hit in your search will generate a single lookup of the unique ID in the CLucene database, causing lots of random I/O for any search. The final effect is that the time it takes to do a search is proportional to the number of hits your search generates, and each hit is (worst case) a single IOP. Your I/O subsystem will probably give you about 200 random IOPS, so that would mean a search with 10000 hits would take a whopping 50 seconds, which is obviously unacceptable. This usually doesn't happen though, due to various caching effects. In practice this has the effect that the 'first' search is rather slow, and subsequent (cached) searches are much, much faster.
So we set out to change things here, and decided that the existing search engines out there were too untransparent or lacking of specific features that we need. This meant that we'd have to implement fulltext indexing ourselves. Since there are entire projects based around making search engines, this sounds rather insane. Luckily we can still reuse various parts of CLucene that are not involved in I/O and we can also use an existing database for storing the data and other libraries for other parts, so all we had to do is choose an appropriate storage system.
Due to the nature of indexing, which basically handles large lists of items (the list of documents containing the word 'foo'), relational database like MySQL go out of the window quite quickly. A quick test of indexing two million documents from the dutch wikipedia dump into an index stored in innodb (via mysql), gave us a 22GB index, while the total data size was only 30GB, over ten times the storage requirements of CLucene. To drop the overhead you need to look at key/value stores. Now there are a lot of those out there, and this is what we need:
You're writing to the index way more often than you're searching it, most people receive more emails on one day than the number of times they search in the same space of time.
We need to access keys ('f', 'fo', 'foo') in order to do prefix searches.
Shouldn't corrupt, but can 'lose' some data as long as it's consistent in a transaction-like manner.
Most of the data is rather repetitive, so compression would be great.
When you find another document containing 'foo', you want to append that to the list.
Since the indexer is already running on a cluster node, and clustering is done on a different level, we want a simple file-based database that needs no configuration or management.
Some form of online defragmentation, Statistics tools, Documented performance characteristics
.. and of course the normal requirements like license, documentation, project activity, etc.
This led us to look at various databases including MySQL, embedded InnoDB, bdb, MongoDB, levelDB, Tokyo Cabinet and Kyoto Cabinet. None of the databases ticked ALL the boxes, and finally we went with Kyoto Cabinet which ticks all the boxes EXCEPT append support. The nice things about Kyoto cabinet are: single-file databases, documented crashproofness, page-level compression, and benchmarks showed that performance was at par with theory; benchmarks showed steady performance all the way through to huge indexes.
We then still had the non-support of 'append' to address. Although Kyoto cabinet, like many other databases, offers an 'append' call, internally it is just doing 'read/append/write', which is rather bad if you consider that you will often be doing 'read 2MB, decompress, append 16 bytes, compress, write 2MB'. So we had to make a tradeoff there and implemented a side-by-side write ourselves, which means we are never appending, but always writing a new value. This has a small tradeoff on the read side of things, but it's cheap enough.
Exporting our 50 million records to our new Kyoto cabinet based index was now a staggering 890 MB (down from 22GB in MySQL/InnoDB, and 2.2GB with CLucene). Search speeds are usually under 100ms for any amount of hits (even matching almost all of the 2 million documents), and indexing speed is just as high as in CLucene. So the final mix was Kyoto Cabinet for storage, CLucene for text analysis, and ICU for text normalization.
We also added a nifty tweaking setting for the advanced user: a 'slider' that allows you to optimize more towards better write performance or better read performance. Indexes are normally built by taking a search term as key, and putting all the documents containing that term in the value. In our case the slider allows you to put only a small prefix of the search term in the key, improving write performance since you have less keys, which in turn improves write-cache performance, but making read performance somewhat worse. The reason for doing this is that on large I/O-bound workloads (e.g. a hosted server), it is interesting from an administrator's point of view to minimize total I/O on the system. Since you are normally writing more to the index than you're reading, it would be better for such a system to optimize writing over reading. For other systems which are lightly loaded, the slider should be set to 'read preference' since a little more I/O will not hurt, and that will give your users the best search speeds.
Some other new features of the indexer in 7.1 will be: