The Blog

To crawl or not to crawl, that is the question

Posted on August 24, 2015 by

In order to write an efficient crawler, you must be smart about the content you download. When your crawler downloads an HTML page it uses bandwidth, memory and CPU, not only its own, but also of the server the resource resides on.

Knowing when not to download a resource is more important than downloading one, as with massive crawling it will save you time, money and probably decrease the chances of getting your crawler blocked.

So how would you go about deciding if to crawl or not to crawl a page? HTTP protocol to the rescue. What you want to do is an HTTP GET request with an If-Modified-Since header stating when you last crawled the resource (it is usually returned with the Last-Modified  HTTP header). If the resource hasn’t been modified, the server will return a 304 status code (rather than 200) and the content won’t be sent – you just saved yourself and the crawled server some resources – hooray!

Another important header the server sends your way is the Expires header. You shouldn’t request a resource if it hasn’t expired.

Here is a sample code to try the above:

request = urllib2.Request("http://techcrunch.com/")
timestamp = time.time()
since = time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime(timestamp))
request.add_header('If-Modified-Since', since)
try:
    html = urllib2.build_opener().open(request).read()
except Exception, e:
    print e

But what if the server you are crawling doesn’t support the “If-Modified-Since” header request? Or doesn’t return the “Last-Modified” value in its response header?

ETag to the rescue

The ETag (Entity Tag) works similarly to the Last-Modified header. The Etag value is a digest of the resource’s content (for instance, an MD5 hash). By sending the server the last Etag value you have (i.e the digest of the content you last crawled, obtained from the ETag response header), the server knows if the digest of the resource you currently have matches the most recent version.

Here is a sample code for the ETag implementation, note that this time you are sending an HTTP GET request with an If-None-Match header stating the latest ETag value you got from this resource:

request = urllib2.Request("http://techcrunch.com/")
req.add_header("If-None-Match", etag)
try:
    html = urllib2.build_opener().open(request).read()
except Exception, e:
    print e

But what if the server you are crawling doesn’t support the ETag header???

Content-Length to the rescue?

This is a last resort and it’s quite aggressive. If the server you are accessing doesn’t support the above headers, a possible solution would be to use the Content-Length header. This header is quite self explanatory as it indicates the size of the content you are about the receive. The assumption here (again an aggressive assumption) is that if the content changes, most likely its size also changes. By only requesting the resource’s header and comparing the current content length with the previous length you obtained, you can determine if the resource has been modified. As this assumption isn’t always true (i.e two different content version might have the same content length), by using this method your crawler might skip resources that did change.

Hope this little article helped, I will be happy to answer any questions you might have.

Share this:
Share on FacebookTweet about this on TwitterPin on PinterestShare on LinkedInShare on TumblrShare on Google+

This entry was posted in Technology. Bookmark the permalink.

3 thoughts on “To crawl or not to crawl, that is the question”

  1. Avi Hayun says:

    Your methods are ok, and might even save you some time and bandwith.

    But they will save you much less than you think.

    Most of the networking time goes on the connection handling and handshaking etc, so that heavy overhead is done if you are downloading all of the page content or only the headers.
    Not parsing a page does save you some time also, but it is minor as the networking is much heavier than the content parsing (which takes a fraction of time compared to the page fetching).

    Not only that, but if you decide that you need to download the page, then you wasted your time with downloading the headers before as you will download the whole thing now, so, when downloading a page you waste more time if you check the headers beforehand.

    And I don’t think that fetching headers will change anything about the host’s banning, a fetch of headers is like fetching content from the server’s point of view.

    The method you are writing about here are good, but they must be combined with other methods.

    The main thing a good crawler which checks for site’s changes must do is learn the change rate of every specific site page (and not relying on the robots.txt recommendations).
    If a upon checking you see that a page’s content changes a lot then you need to increase the crawling rate to that page.
    If you find that a page doesn’t change, there is no need to check it twice an hour.

    That’s the theory in a nutshell.

    In practice, you need to implement well researched sites change frequency algorithms, something like this:
    http://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages

    But there are full articles about the subject…

  2. Ran Geva says:

    Hi Avi,

    Thank you for your feedback. I agree that an important optimization would be to know if you should even approach a site, by its update frequency.

    But I do not agree that fetching the header is as bad as fetching the content itself. It’s true that HTTP handshake takes time, but nothing compared to download tens or hundreds of kilobytes of data (some pages are heavy). Remember, you are not downloading only one page per site, you may download hundreds or thousands of pages per site. It will result in tens or hundreds of mega bytes both on your side, but even worse on the side of the source you are crawling. It might be that you are willing to spend the money on the bandwidth you consume, but the owner of the site will not, and will be happy to block you if you waste to much of its resources.

Leave a Reply

Your email address will not be published. Required fields are marked *