Category: Technology

How to Extract Data from Websites: Scraping Tools, DIY or DaaS

Posted on September 7, 2017 by

This is part 2 of our guide to web data extraction. Read part 1 to learn about the questions to ask before you start, or download the complete Web Data Extraction Playbook (PDF). Now that you’ve covered both the business and technical requirements for your web data extraction project (and if you haven’t, check out

Continue reading

Posted in API, Data Extraction, Technology | Leave a comment

How to use rated reviews for sentiment classification

Posted on February 9, 2017 by

Sentiment classification is a fascinating use case for machine learning. Regardless of complexity – you need two core components to deliver meaningful results; a machine learning engine and a significant volume of structured data to train that engine. Last month, we added the new “rating” field for rated review sites covered in the Webhose.io threaded

Continue reading

Posted in Big Data, Technology | Leave a comment

Web Data Visualization of The Hillary Clinton Top 100 Network Graph

Posted on October 20, 2016 by

The web data business can get pretty tricky, especially when your job is to extract the broadest possible dataset from the planet’s biggest database. Last week, Webhose CEO Ran Geva ran a fun experiment to visualize Hillary Clinton’s web network. More precisely, who are the top 100 people most frequently mentioned in news articles and blog

Continue reading

Posted in API, Big Data, Technology | Leave a comment

Top 10 Big Data Stories Leading the Conversation

Posted on September 26, 2016 by

In the right hands, crawled web data can tell an amazing story. We were interested in the top 10 news stories – sorted by social shares on Facebook and LinkedIn. So we set up a simple news API request. We were looking for the stories published over the past 30 days returned by an exact match query for the term “big data”.  Here

Continue reading

Posted in Big Data, Technology | Leave a comment

5 Ways to Measure the Impact of Crawled Web Data on Your Business

Posted on July 27, 2016 by

The analysis you provide is only as good as the raw data you start with. Although data from the open web is often perceived as a commodity, not all crawled data is created equal.  Whether you’re relying on a proprietary crawling technology, tapping into a vendor’s firehose, or implementing a combination of both strategies –

Continue reading

Posted in Big Data, Technology | Leave a comment

The 15 Data Experts You Should be Following on Twitter

Posted on January 14, 2016 by

Twitter is a phenomenal place not only to connect with peers in the analytics industry but also to follow and learn from its leading authorities. Unfortunately, the Twitter marketplace is crowded and trying to wade through and research exactly who’s who on your own is overwhelming Even worse is making your Twitter decisions based on

Continue reading

Posted in Big Data, Technology | Leave a comment

How to Extract Data from a Website: 5 Steps to Transform Unstructured Data into Business Insights

Posted on December 8, 2015 by

Big data is big business. And for good reason. As Harvard Business Review recently reported, an exhaustive study of 330 North American companies led by the MIT Center for Digital Business in conjunction with McKinsey’s Business Technology Office revealed that the use of data in business decisions like product development, hiring and firing, as well

Continue reading

Posted in Big Data, Technology | Leave a comment

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

Posted on August 16, 2015 by

On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website. Use the following script to extract specific information from any website (i.e prices, ids, titles,

Continue reading

Posted in Technology | Leave a comment

Tiny basic multi-threaded web crawler in Python

Posted on August 12, 2015 by

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you. Usage: $ python tinyDirtyIffyGoodEnoughWebCrawler.py http://cnn.com Where http://cnn.com is your seed site. It could be any site that contains content and links to other sites. My colleagues described this piece of code I wrote

Continue reading

Posted in Technology | Leave a comment

How we quadrupled the performance of Elasticsearch

Posted on July 19, 2015 by

Well, that’s a misleading title. We actually quadrupled the performance of our brand monitoring alert system that uses Elasticsearch’s Percolator, but that would have been a much longer title. Some background Buzzilla has two main products. The first is Webhose.io which provides businesses worldwide access to structured data from the open web, and the second

Continue reading

Posted in Technology | Leave a comment

Webhose.io Tips & Tricks: Search for Reviews

Posted on December 10, 2014 by

Are you looking to focus your data search specifically on consumer generated reviews? Here are a couple of simple Webhose.io tricks that might help: Limit your query to specific sites You can limit your search to specific “review sites” like amazon.com, bestbuy.com, newegg.com, cnet.com, engadget.com, pcmag.com etc.. Here is an example for how you should

Continue reading

Posted in Technology | Leave a comment