In order to write an efficient crawler, you must be smart about the content you download. When your crawler downloads an HTML page it uses bandwidth, memory and CPU, not only its own, but also of the server the resource resides on.
Knowing when not to download a resource is more important than downloading one, as with massive crawling it will save you time, money and probably decrease the chances of getting your crawler blocked.
So how would you go about deciding if to crawl or not to crawl a page? HTTP protocol to the rescue. What you want to do is an HTTP GET request with an If-Modified-Since header stating when you last crawled the resource (it is usually returned with the Last-Modified HTTP header). If the resource hasn’t been modified, the server will return a 304 status code (rather than 200) and the content won’t be sent – you just saved yourself and the crawled server some resources – hooray!
Another important header the server sends your way is the Expires header. You shouldn’t request a resource if it hasn’t expired.
Here is a sample code to try the above:
request = urllib2.Request("http://techcrunch.com/") timestamp = time.time() since = time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime(timestamp)) request.add_header('If-Modified-Since', since) try: html = urllib2.build_opener().open(request).read() except Exception, e: print e
But what if the server you are crawling doesn’t support the “If-Modified-Since” header request? Or doesn’t return the “Last-Modified” value in its response header?
ETag to the rescue
The ETag (Entity Tag) works similarly to the Last-Modified header. The Etag value is a digest of the resource’s content (for instance, an MD5 hash). By sending the server the last Etag value you have (i.e the digest of the content you last crawled, obtained from the ETag response header), the server knows if the digest of the resource you currently have matches the most recent version.
Here is a sample code for the ETag implementation, note that this time you are sending an HTTP GET request with an If-None-Match header stating the latest ETag value you got from this resource:
request = urllib2.Request("http://techcrunch.com/") req.add_header("If-None-Match", etag) try: html = urllib2.build_opener().open(request).read() except Exception, e: print e
But what if the server you are crawling doesn’t support the ETag header???
Content-Length to the rescue?
This is a last resort and it’s quite aggressive. If the server you are accessing doesn’t support the above headers, a possible solution would be to use the Content-Length header. This header is quite self explanatory as it indicates the size of the content you are about the receive. The assumption here (again an aggressive assumption) is that if the content changes, most likely its size also changes. By only requesting the resource’s header and comparing the current content length with the previous length you obtained, you can determine if the resource has been modified. As this assumption isn’t always true (i.e two different content version might have the same content length), by using this method your crawler might skip resources that did change.
Hope this little article helped, I will be happy to answer any questions you might have.