After bashing various crawling techniques, I would like to describe the technique we use here, at webhose.io, a technology that was developed over the past 8 years. Our crawlers were developed with the following demands in mind: Efficient on server resources, i.e CPU & bandwidth Fast in fetching and extracting content Easily add new sites
So if RSS Crawlers are bad, Browser Scraping isn’t efficient, what about computer vision web-page analyzers? This technology uses machine learning and computer vision to extract information from web pages by interpreting pages visually as a human being might.
In my previous blog post, I wrote about RSS crawlers, and why they don’t really work. In this post I want to discuss the technique of using a headless browser to parse a website and extract its content. A headless browser is a web browser without a graphical user interface. The logic behind using a
One of the fastest, simplest and unfortunately wrong ways of extracting content out of a website, is by reading its RSS feeds. I will show you how its done and why it’s useless. Each RSS feed already contains the data, structured and ready for harvesting, so content extraction is indeed simple and fast. Let’s take