The Blog

How we quadrupled the performance of Elasticsearch

Posted on July 19, 2015 by

Well, that’s a misleading title. We actually quadrupled the performance of our brand monitoring alert system that uses Elasticsearch’s Percolator, but that would have been a much longer title.

Some background

Buzzilla has two main products. The first is Webhose.io which provides businesses worldwide access to structured data from the open web, and the second is the leading brand monitoring system in Israel.

 

The brand monitoring system

Although Israel is a small country, Israelis usually create complex queries that puts a lot of stress on our servers (this could also be attributed to the complexities of the Hebrew language, but that’s for another post). One of the most popular features of the system, is its ability to send push notifications (usually by email) when a post matches a Boolean query.

Configuration

As I mentioned, we use the Elasticsearch Percolator to register our queries (about 3,500 of them) and run each post we crawl against them. We run about 1 million posts a day against those queries and when they match they are sent to our clients. The system is distributed and uses RabbitMQ to pull posts from our crawlers queue.

We made some optimizations in the past, where we didn’t run the Boolean query against a document if we knew beforehand it wouldn’t match. We did that by comparing some properties of the query and the document. For example, if the language didn’t match, there was no need to check the rest of the query.

The problem we faced

At our old configuration, we were able to run about 30 documents per minute against all of our queries per server (and a strong one). As the volume of crawled data and the number of queries grew, we began to have a problem keeping up, at times causing delays of a few hours between crawl time and alert match. We found ourselves adding more and more hardware to try and solve the problem.

The solution

What did the trick was to create pre-percolation process, that concatenates multiple posts and runs the queries against the concatenated string (you of course must remove the Boolean NOT clause of the query, I will explain why later on). If there is no match, then great, you just saved time checking each individual post, if there is a match, then bummer, you wasted time checking the concatenated string. Fortunately the former is much more frequent than the latter.

So now I will explain why it worked. Let’s take two phrases, or posts as an example:

  1. First post: “The quick brown fox jumps over the lazy dog”
  2. Second post: “This is a quick example since I’m lazy”

The combined text would be: “The quick brown fox jumps over the lazy dog This is a quick example since I’m lazy”

It’s obvious that a query that didn’t match the combined text wouldn’t match its children. So by running the query once against a long chunk of text, we didn’t need to run it against two shorter chunks of text. If on the other hand it did match, we would then need to run it against each post to see which query matched which post. But even then we know which query matched and we wouldn’t have to run all the queries again on each post.

So why is running a query against a large chunk of text faster than running it against two short chunks of text? That’s because we run the query against the index, and the size of the index of  the concatenated texts is smaller than the size of each posts index combined:

SizeOfIndex(Post A + Post B) < SizeOfIndex(Post A) + SizeOfIndex(Post B)

Why stop at two posts combined? Why not 100? You can, and should of course concatenate more than two posts, but be careful and remember that once a query matches the concatenated text, you actually wasted resources, as you now need to query against each post (or do a binary search). You want to reach a balance point where your chances to not match are much greater, as on that point your system will be optimized.

I mentioned earlier that you must remove the Boolean NOT clause of the query. If you don’t remove it, you might miss relevant posts. Let’s take the query “quick -example” and run it against the above concatenated text, this of course won’t match as the keyword “example” exists in the text, but it should have matched since the first post matched the query.

That’s it. The solution takes more memory as we are now running two percolators (pre-percolator and the actual alert percolator), but it’s 4 times faster! Hooray!

Share this:
Share on FacebookTweet about this on TwitterPin on PinterestShare on LinkedInShare on TumblrShare on Google+

This entry was posted in Technology. Bookmark the permalink.

5 thoughts on “How we quadrupled the performance of Elasticsearch”

  1. elasticmark says:

    If you used nested docs would that remove the issue with the NOT?

  2. Ran Geva says:

    Why would it remove the issue? The problem is that the fact that you concatenate text might case a false positive test if you run a query against it, so you might miss relevant posts.

  3. elasticmark says:

    I’m assuming some of your efficiency gains are from batching multiple docs into the one memory-based index. Nested docs/queries are a way of doing this without introducing the cross-matching problem you encountered with your not clause. I know Percolator was fixed to work with nested docs a while back (https://github.com/elastic/elasticsearch/issues/6540).

  4. Ran Geva says:

    Well we didn’t try it, but I’m not certain it will do the trick as the main reason we see a boost is because we unite indexes. With nested docs, each document still has its own index.

Leave a Reply

Your email address will not be published. Required fields are marked *