Back to Blog

Crawling Horrors – Computer Vision Crawlers

November 26, 2014

So if RSS Crawlers are bad, Browser Scraping isn’t efficient, what about computer vision web-page analyzers? This technology uses machine learning and computer vision to extract information from web pages by interpreting pages visually as a human being might.

Computer vision crawlers present some great advantages over RSS/Browser or even code based crawlers. They offer simplicity when it comes to DIY crawlers, i.e letting non-developers teach the system what content needs to be extracted. In many cases it does a decent job at extracting structured content from sources it has no knowledge about.

So how am I going to ruin this one for you? Well it suffers from some of the downfalls of the browser bases crawlers:

Slow, heavy and resource hogger, as it has to download all the content and render the page to “see” it.
It won’t know what to do if the content is revealed by an action (like clicking on a comment to show it)
In many cases if the page is complicated enough, i.e a discussion thread, a site with dynamic ads, it can get “confused” and extract the wrong content.

Like with many machine learning systems, there is a precision and recall tradeoff, which means that if you want high precision (and you do), your recall will be low, which means that for many pages you won’t be able to extract the right content.

Computer vision crawlers are great for DIY-missions, for specific sites that look about the same, but I’m afraid not for large scale, precise crawling.

Spread the News

Subscribe to our newsletter for more news and updates!

By submitting you agree to Webz.io's Privacy Policy and further marketing communications.

Crawling Horrors – Computer Vision Crawlers

Subscribe to our newsletter for more news and updates!

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need

Crawling Horrors – Computer Vision Crawlers

Subscribe to our newsletter for more news and updates!

Read More

AI Takeover? 4 Big Web Data Predictions for 2024

Structured or Unstructured Data? The Big Web Data Question for Businesses

Common Crawl vs. Webz.io Data: Which One Works Best for Large Language Models?

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need