The Blog

Vertical aggregation & Pattern matching crawlers

Posted on November 27, 2014 by

After bashing various crawling techniques, I would like to describe the technique we use here, at webhose.io, a technology that was developed over the past 8 years.

Our crawlers were developed with the following demands in mind:

  • Efficient on server resources, i.e CPU & bandwidth
  • Fast in fetching and extracting content
  • Easily add new sites to the crawling cycle
  • Simple but powerful way of “teaching” the crawler about new sites structure
  • Robust to site’s layout changes

 

We started by developing our crawlers in Python due to its dynamic module loading. It was important, as we wanted to easily write new parsers and quickly add or fix them, without the need to restart the system.

The crawler downloads only the HTML content and not all the images/js/CSS files. It doesn’t wander around the site, but chooses the exact links to fetch, and by doing so, it takes the bandwidth consumption to a minimum.

We don’t use headless browsers to parse the content, nor use a DOM parser. We extract the content by using regular expressions and various heuristic functions, resulting in a robust solution to HTML structure change.

We established knowledge about multiple content platforms, and we leverage this knowledge to easily add new sources without the need to write new parsers, as the system recognizes the basic structure of the platform.

Since the crawlers are written in Python, writing a parser can take from a few minutes, when you only need to fill out a template with a regular expression, to a very powerful parser that can deal with a combination of JSONs retrieved via AJAX utilizing cookies, and different HTTP headers.

True, our solution requires basic knowledge in Python and regular expressions, but in return it provides power and efficiency unmatched by any other technique.

 

 

Share this:
Share on FacebookTweet about this on TwitterPin on PinterestShare on LinkedInShare on TumblrShare on Google+

This entry was posted in Technology. Bookmark the permalink.

11 thoughts on “Vertical aggregation & Pattern matching crawlers”

  1. Amir says:

    please describe the following paragraph with a simple example, I dont understand it!
    “We established knowledge about multiple content platforms, and we leverage this knowledge to easily add new sources without the need to write new parsers, as the system recognizes the basic structure of the platform.”

  2. Ran Geva says:

    There are many message board platforms like phpBB, vBullein, invision and others. Let’s take phpBB for example. Here is a discussion thread:
    https://www.phpbb.com/community/viewtopic.php?f=6&t=2296141
    If you view its source, you will notice that each post content is wrapped by:

    some content

    Every forum that relies on phpbb will have the same structure. There are of course those who will modify the HTML, but they usually do it based on the same HTML structure. You will need to use regular expressions and wildcards to tackle those modifications (so minor changes won’t break your parser), but that’s the basic idea.

  3. Amir says:

    Ahaa, thank you Ran.

  4. adibalcan says:

    Why not dom parser like PyQuery (based on lxml)?

  5. adibalcan says:

    Why not a dom parser?

  6. Ran Geva says:

    Two main reasons:
    1. DOM parser needs to do just that – parse the complete DOM which is very heavy as it builds an object model for the complete document, where many times you only need a fraction of it.
    2. Malformed HTML documents can cause the parser to fail – it will never happen with pattern matching

  7. adibalcan says:

    Sometimes it’s easier to write a CSS selector instead of a regex, and I know that multiple regex can be heavy on CPU, on the other hand a DOM parser uses a lot of memory…

  8. Ran Geva says:

    It’s much easier doing it with a selector that’s for sure, but it will break faster if the owner of the site changes the HTML code. With regex you can create a more lenient patterns that will work even if the owner changes the site’s structure.

  9. Avi Hayun says:

    Hi Ran,

    Went over your blog – thanks for the content.

    If I understand you correctly then you do the following:
    1. Recognize the platform the site is using (wordpress, phpbb, joomla etc)
    2. Download the html
    3. Parse the html using regex which were developed for each platform, so you will get clean content.

    As you are downloading the html anyway, I think that using CSS selectors will work much more precisely and quickly (JSoup is a great html parser which also overcomes malformed html). – did you check that CPU for building the DOM is a bottleneck?

    Try testing it, I am sure you will find out that your bottleneck is networking, which consumes much more time than building the dom and parsing it (and while the networking is used, your CPU is idle, so just multithread those two tasks, where fetching the page uses the networking resources, and on the same time parsing of the pages uses the CPU – I, myself have checked and CPU was never an issue while checking)

    I would also use a well known and robust web crawler instead of inventing the wheel, as crawlers have been around for lots of time and have already overcome the crawling pitfalls any new crawler will need to struggle with – from right URL Normalization to looping traps in websites (and lots of other issues)

  10. Ran Geva says:

    Hi Avi,

    I’ve used JSoup and Beautifulsoup – they are both great and very easy to use, but they break easily when the site changes. With regular expressions you have more power generalizing your pattern, rather than with specific selectors. Furthermore, there are cases where both parsers will break due to malformed HTML.

    Regarding using a well known crawler – we have been developing our crawlers since 07′ and therefor we implemented many optimizations that are type specific (i.e specific for forums, for blogs and for news sites).

    Usually a crawler software is generalized for “crawling the web”. When you designate a crawl for a specific mission you are much more efficient than when you have to take a generic piece of software and bend it to your needs.

  11. Avi Hayun says:

    Makes sense.

    I will be happy to read a post of yours about platform specific optimizations if you will decide to share such info.

    Avi.

Leave a Reply

Your email address will not be published. Required fields are marked *