Lessons Learned Scaling the Audiogalaxy Search Engine

Photo by flickr user vadaniaTackling a big, new programming challenge frequently means that you don’t actually understand the problem until after you solve it. Because of that, one of the most important things to do after nailing down a solution is to think about better solutions. Post-mortems seem to be more popular from a “how could we have managed this project better” angle, but the technical aspect should not be overlooked.

Sure, constant criticism of everything you do may not make you a happier person, but even if you don’t actually implement your new ideas, reevaluating your designs will make you a stronger architect. Just try to keep the habit from spilling over too much into your personal life. :)

In that spirit, I’ve put together a follow-up to my post about the Audiogalaxy.com MySQL search engine. To keep from getting discouraged about all the improvements I overlooked, I’ve split the list up into 2 categories. The first contains ideas I was correct in not spending time on during the initial development. Although these ideas may have been interesting programming challenges, at the time, they simply wouldn’t have made the company any money. (I know it sucks, but sometimes programmers forget their job is to make money for the company–not to write cool software.) The second category contains things that I probably should have done. They may have been easy to do if I had put them into the initial code base and tested them along with everything else, or they may have been things that would have saved the company money. These are the ones to really think about. If I find any clever ideas or algorithms that would have helped, I like to add them to my toolbox to consider for next time.

Fun things I couldn’t justify doing

Silky Smooth Searching
I’m not sure if anyone is actually using this technique (or what the official name is), but the basic idea is to start caching results for a search before the user actually clicks the search button. I’m sure all of you really savvy internet types know you can hit enter milliseconds after typing a search term, but I know there are plenty of people that don’t. I suspect that with some Ajax magic, you could take advantage of the precious few hundred milliseconds it takes to move the mouse over to the “submit” button after typing a query. During that time, the backend could be busily building a cache of search results or pre-fetching them from disk.

As neat as this would have been, I don’t think I could have ever justified the development time. A smoother search experience may have made users on fast connections slightly happier, but we had more important things to work on (of course, for some companies, shaving milliseconds off response times really is important).

Better relevance
We sorted results by historical popularity, current availability, and a few other tricks. One thing I would have really liked to do is take into account more information about the ordering of search terms. For example, when a user searched for “A B”, results that contained “A B” possibly should have been ahead of results that contained “B A.” This would have been neat, but I really had no incentive to add the extra complexity. Achieving the right weight balance on something like this would have taken a lot of experimentation. Experimentation means downtime, and the potential payoff didn’t look big enough to justify it.

Search power tools
This includes things like supporting quoted phrases to match text exactly, removing results using -, magic words to restrict results to certain bit rates, etc. While programmers and technical folks love this sort of thing, I would bet that most people don’t. And as much as I wanted to put stuff like this in, the existing search was good enough. It is always important to remember that adding any feature has an ongoing cost in code maintenance and debuggability, so you really have to be careful when adding non-essential ones.

Sharded Indexes
One technique to speed up non-cached searches (which were primarily disk IO bound) would have been to partition my index across multiple servers. For a single search, multiple index servers would each return a chunk of search results that the cache servers would aggregate into a single result set. This would have become a requirement once our index grew too large, but we did not get to that point.

This was a tempting project for me because I didn’t know how to do it and I love learning about this sort of thing. However, we had such a high hit rate in our caching layer that most of our searches were lightening fast, so adding this huge chunk of complexity wouldn’t have helped the company.

Things I should have done

Faster indexing
Every few months we had to take the site down overnight to clean out our multi-hundred million row databases (I assure you this will be a bullet point when I do a critique of the entire system design). After removing all the stale data, it was faster to recreate the search index than to edit the existing one. The indexing process was fairly straightforward–it simply walked through all the rows in a single database server. As you might imagine, this tended to bottleneck on disk IO (both in reading rows from the DB and writing them out in the index). An indexing process that ran on multiple index servers and took advantage of more disk spindles could have shaved hours of downtime off the site each month, which would have been a big win.

As a more general rule, whenever you are thinking about doing anything with disk IO, ask yourself whether or not you can speed things up by using more spindles. Disk IO is going to be a bottleneck in the datacenter for a long time. Even after solid state disks get mass acceptance, it will never hurt to at least consider how you can use more than one of them at once. As a side note, it is always a good idea to have a dollar figure in mind for what downtime costs per minute. Knowing that can help you make better decisions about how you invest your developer resources.

Redundant cache servers
Each cache server handled a unique subset of searches, which meant that some percentage of searches would fail when one of them went down. This design was easy to implement and the cheapest way to get good performance, but it periodically led to avoidable downtime. Fixing this would have helped work around another problem–the IO crunch if the cache for one machine was wiped or rebooted. Designing for this sort of failure is so important and so easy that I’ll cover the best way to do it in a future article.

More compression
As I’ve mentioned, one common bottleneck throughout the system was disk IO. One way we could have addressed this is by compressing anything we stored on disk. Raw lists of IDs don’t compress very well, but the compression ratio can be improved by sorting the list and storing it as an initial offset plus a set of deltas. While “1,2,3,4″ might not compress well, “1,1,1,1″ will compress fabulously. This optimization is slightly iffy — we only had 4 index servers and 4 cache servers, so the savings would have to be pretty spectacular to save a significant amount of money. Despite that, I put this into the thumbs-up list because it would have been easy to do at the very beginning of the project.

Zlib is already easy to use, but it is wise to integrate it at the very beginning of your project. One obvious performance problem with this sort of service is buffer copying, and the sooner you think about compression, the saner the buffer management will be. One final caveat though–compressing data to send across a network may not be worth it. Gigabit ethernet is awfully fast, and I found it was generally less of a bottleneck than the disk.

More caching
I mentioned that the cache servers only provided a list of match IDs to the web server’s PHP script, which then had to hit one of the database slaves before rendering the page. Those queries were indexed by ID, so they were able to use the higher performance slaves, which were capable of something like 1000 queries per second. However, the cache server could have easily done those final queries once and saved the results alongside the cached IDs.

I always like to go back to the numbers for decisions like this. If it took 2 queries per page to fetch the info and the cluster was handling 2000 searches per second, saving that information in the cache layer could have saved us about 4 database slaves. Total costs associated with those servers would be at least $10K, so this would have been an interesting change assuming the design didn’t increase IO usage in the cluster layer significantly. Of course, the change would also have to have been simple enough that it didn’t require any big changes to the architecture, introduce time consuming new bugs, or take more than a few days to code and test.

A project like this is exactly what startups are good at doing. I would have been excited to build this over a weekend I might otherwise have taken off because I know I could put it into production and see the effects right away. I would have initially enabled it only for some small percentage of queries on a single server and rolled it out at 2 or 3 in the morning.

Better Monitoring
Usually, I could easily determine if there was a problem with the site simply by watching the rate of incoming requests and the rate of completed transfers. Those two numbers were very predictable, and any problem on the front or back end was likely to cause a drop in one or both of them. The tricky part came when I had to narrow down which machine was having the problem.

Good monitoring is a key part of scalability. One best practice I recommend for custom services is to publish a set of values via HTTP that can be easily parsed and dumped into RRDTool or viewed in lynx from the command line. Our search cluster didn’t support this, and finding problems usually meant secure shelling into each box and checking the error log. This was fine with a small number of servers, but it would have been much easier to see anomalies with a set of graphs tracking how many queries per second each server was doing.

Leave a Reply