Author Archives: Ran Geva

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

Posted on August 16, 2015 by Ran Geva

On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website. Use the following script to extract specific information from any website (i.e prices, ids, titles,

Continue reading

Posted in Technology | Leave a comment

Tiny basic multi-threaded web crawler in Python

Posted on August 12, 2015 by Ran Geva

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you. Usage: $ python tinyDirtyIffyGoodEnoughWebCrawler.py http://cnn.com Where http://cnn.com is your seed site. It could be any site that contains content and links to other sites. My colleagues described this piece of code I wrote

Continue reading

Posted in Technology | Leave a comment

How we quadrupled the performance of Elasticsearch

Posted on July 19, 2015 by Ran Geva

Well, that’s a misleading title. We actually quadrupled the performance of our brand monitoring alert system that uses Elasticsearch’s Percolator, but that would have been a much longer title. Some background Buzzilla has two main products. The first is Webhose.io which provides businesses worldwide access to structured data from the open web, and the second

Continue reading

Posted in Technology | Leave a comment

Building a Better Search Query

Posted on December 10, 2014 by Ran Geva

Many factors can affect streaming data relevancy. When the data you consume isn’t ordered by relevancy, rather by the time it was crawled, getting the relevant posts is essential. I would like to share with you a few tips you can use to highly increase the relevancy of the data you consume via Webhose.io API

Continue reading

Posted in Technology | Leave a comment

Webhose.io Tips & Tricks: Search for Reviews

Posted on December 10, 2014 by Ran Geva

Are you looking to focus your data search specifically on consumer generated reviews? Here are a couple of simple Webhose.io tricks that might help: Limit your query to specific sites You can limit your search to specific “review sites” like amazon.com, bestbuy.com, newegg.com, cnet.com, engadget.com, pcmag.com etc.. Here is an example for how you should

Continue reading

Posted in Technology | Leave a comment

Vertical aggregation & Pattern matching crawlers

Posted on November 27, 2014 by Ran Geva

After bashing various crawling techniques, I would like to describe the technique we use here, at webhose.io, a technology that was developed over the past 8 years. Our crawlers were developed with the following demands in mind: Efficient on server resources, i.e CPU & bandwidth Fast in fetching and extracting content Easily add new sites

Continue reading

Posted in Technology | Leave a comment

Crawling Horrors – Browser Scraping

Posted on November 25, 2014 by Ran Geva

In my previous blog post, I wrote about RSS crawlers, and why they don’t really work. In this post I want to discuss the technique of using a headless browser to parse a website and extract its content. A headless browser is a web browser without a graphical user interface. The logic behind using a

Continue reading

Posted in Technology | Leave a comment

Crawling Horrors – RSS Crawlers

Posted on November 24, 2014 by Ran Geva

One of the fastest, simplest and unfortunately wrong ways of extracting content out of a website, is by reading its RSS feeds. I will show you how its done and why it’s useless. Each RSS feed already contains the data, structured and ready for harvesting, so content extraction is indeed simple and fast. Let’s take

Continue reading

Posted in Technology | Leave a comment