The Blog

How to Extract Data from Websites: Scraping Tools, DIY or DaaS

Posted on September 7, 2017 by

This is part 2 of our guide to web data extraction. Read part 1 to learn about the questions to ask before you start, or download the complete Web Data Extraction Playbook (PDF).

Now that you’ve covered both the business and technical requirements for your web data extraction project (and if you haven’t, check out the previous post), you should already have a firm understanding of your goals and challenges. The next step is to start considering the various tools, technologies and techniques that are available to get the data you need.

There are dozens of free, freemium and premium tools that might be relevant for your web data extraction project, but we can schematically divide them into three subgroups:

DIY for Complete Control

The first option, which might be appealing to the more gung-ho developers among us, would be to simply write your own web crawler, scrape whatever data you need and run it as often as you need. You could write such a crawler yourself from scratch in Python, PHP or Java, or use one of many open source options.

The main advantage of this approach is the level of flexibility and customizability you have: you can define exactly what data you want to grab, at what frequency, and how you would like to parse the data in your own databases. This allows you to tailor your web extraction solution to the exact scope of your initiative. If you’re trying to answer a specific, relatively narrow question, or monitor a very specific group of websites on an ad-hoc basis, this could be a good and simple solution.

However, manual crawling and scraping is not without its downsides, especially when it comes to more complex projects.

If you’re looking to understand wider trends across a large group of sites, some of which you might not even know you’re looking for in advance, DIY crawling becomes much more complex – requiring larger investments in computational resources and developer hours that could be better spent on the core aspects of your business.

To learn more about the pros and cons of building your own web crawling infrastructure, check out our Build vs Buy comparison guide.

Scraping Tools for Ad-Hoc Analysis

Another common technique to turn websites into data is to purchase a commercial scraping tool and use it to crawl,extract and parse whichever areas of the web you need for your project. There are dozens of scraping tools available, with features and pricing varying wildly – from simple browser-based tools that mimic a regular user’s behavior to highly sophisticated visual and AI-based products.

Scraping tools remove some of the complications of the DIY approach since your developers will be able focus on their (and your company’s) core competencies rather than spending precious time and resources on developing crawlers. However, they are still best suited for an ad-hoc project – i.e., scraping a specific group of websites in specific time intervals, to answer a specific set of questions. Scraping tools are very useful for these types of ad-hoc analyses, and they have the added advantage of generally being easy to use and allowing you to customize the way the extracted data is parsed and stored.

On the other hand, if you’re looking to set up a larger scale operation in which the focus is not on custom parsing but rather on comprehensive coverage of the open web, frequent data refresh rates and easy access to massive datasets, web scraping tools are less viable as you run into several types of limitations:

  • By definition, web scraping tools only grab the data from whichever web site you’ve “pointed” them at. If you don’t know exactly where to look in advance, you could miss out important data – e.g., in a media monitoring use case where you’re not aware of every possible publication that could mention your clients.
  • Advanced scraping tools are built for customized extraction, and often have very advanced capabilities in terms of identifying and parsing the data for analytical usage. However, this often manifests itself in pricing models that are based on the amount of sites scraped – resulting in ballooning costs for larger projects.
  • Developer overhead still exists in the form of managing lists of crawled sites and maintaining the scraping tools.
  • Since the data is not collected before you activate the scraping tool, you won’t have access to historical data.

Modern scraping tools offer powerful solutions for ad-hoc projects, giving you highly sophisticated means of grabbing and parsing data from specific websites. However, they are less scalable and viable when it comes to building a comprehensive monitoring solution for a large “chunk” of the world wide web; and their advanced capabilities could become overkill in terms of pricing and time-to-production when all you really need is access to 7web data in machine-readable format.

Read more about the limitations of scraping tools.

Web Data as a Service for Scalable Operations

The third option is to forego crawling, scraping and parsing entirely and rely on an a data as a service (DaaS) provider. In this model you would purchase access to clean, structured and organized data extracted by the DaaS provider, enabling you to skip the entire process of building or buying your own extraction infrastructure and focus on the analysis, research or product you’re developing.

In this scenario you would generally have less ability to apply customized parsing on the data as it is extracted, instead relying on the data structure dictated by the provider. Additionally, you would need to contact your DaaS provider if you need to add sources (rather than simply point your purchased or in-house scraping tool at whichever source you’re interested in). These factors make web data as a service less viable for ad-hoc projects that require very specific sites to be extracted into very specific data structures.

However, for larger operations, web data as a service offers several unique advantages in terms of scale and ease of development:

  • Working with a proprietary provider allows you to leverage best-in-class crawling and scraping technologies, rather than having your own developers try to re-invent the wheel.
  • A reliable web DaaS provider will offer comprehensive coverage, enabling you to immediately access data from any relevant source on the web. Smart indexing and crawling enable new sources to be added automatically as content spreads across the web, rather than waiting for you to “point” at them.
  • Structured data is easily accessible via an API call, making integration dead simple. To see how this works, you can check out an example of the webhose.io web data API.
  • The ability to consume data on-demand gives you more flexibility to launch and grow your data-driven operations without making any large upfront investments.
  • Access to comprehensive coverage of the web without having to maintain your own lists of sites to crawl.
  • These and other advantages make web data as a service the best solution for media monitoring, financial analysis, cyber security, text analytics and other use cases that center around fast access to comprehensive, frequently updated data feeds.

    Feature Comparison of Web Data Extraction Methods

    DIY Scraping Tools Data as a Service
    Typical Scale Small Small Large
    Custom Parsing Yes Yes No
    Historical Data No No Yes
    Price Project-dependant Tool-dependant Based on usage
    Development Costs High Low Low
    Coverage Low Low High

    Want to learn more?

    Our comprehensive Guide to Extracting Data from Websites covers the topics that appear in this post, as well as gathering requirements before the project starts and gauging ROI when it’s up and running. Download the guide in glorious PDF format now.M.

    Share this:
    Share on FacebookTweet about this on TwitterPin on PinterestShare on LinkedInShare on TumblrShare on Google+

This entry was posted in API, Data Extraction, Technology. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *