One of the fastest, simplest and unfortunately wrong ways of extracting content out of a website, is by reading its RSS feeds. I will show you how its done and why it’s useless.
Each RSS feed already contains the data, structured and ready for harvesting, so content extraction is indeed simple and fast. Let’s take for example the RSS feed from TechCrunch (Many times you can find the RSS feed URL by reading the <link rel=”alternate” type=”application/rss+xml”…> tag from the main html page. In TechCrunch’s case, it’s http://techcrunch.com/feed/). The output is an XML that includes an <item> element within you can find the author name, the post date, images and even part of the content.
So why is this wrong you ask? Because getting only part of the content, misses the purpose of a good crawler. Getting 2-3 lines out of the complete article is useless, not to mention that you don’t get the comments for the article (some sites provides a comments feed, but again it contains a fraction of the comment content)
True, it’s fast, simple, very low on bandwidth, and you get structured data, but you don’t get the complete data, and in my book it disqualifies this method as a valid crawling option. You can use an RSS crawler as a starting point to discover article URLs, but not as a content extractor.