Back to Blog

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

August 16, 2015

On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website.

Use the following script to extract specific information from any website (i.e prices, ids, titles, phone numbers etc..). Populate the “fields” parameter with the names and the patterns (regular expression) of the data you want to extract. In this specific example, I extract the product names, prices, ratings and images from Amazon.com.

import sys, thread, Queue, re, urllib2, urlparse, time, csv
<span style="color: #f9fb04;">### Set the site you want to crawl &amp; the patterns of the fields you want to extract ###</span>
siteToCrawl = "<span style="color: #56ff00;">https://www.amazon.com/</span>"
fields = {}
fields["Title"] = '<span style="color: #56ff00;">&lt;title&gt;(.*?)&lt;/title&gt;</span>'
fields["Rating"] = '<span style="color: #56ff00;">title="(S+) out of 5 stars"</span>'
fields["Price"] = '<span style="color: #56ff00;">data-price="(.*?)"</span>'
fields["Image"] = '<span style="color: #56ff00;">src="(https://ecx.images-amazon.com/images/I/.*?)"</span>'
<span style="color: #f9fb04;">########################################################################</span>
dupcheck = set()
q = Queue.Queue(25)
q.put(siteToCrawl)
csvFile = open("output.csv", "w",0)
csvTitles = dict(fields)
csvTitles["Link"] = ""
writer = csv.DictWriter(csvFile, fieldnames=csvTitles)
writer.writeheader()
def queueURLs(html, origLink):
    for url in re.findall('''&lt;a[^&gt;]+href=["'](.[^"']+)["']''', html, re.I):
        try:
            if url.startswith("http") and urlparse.urlparse(url).netlock !=  urlparse.urlparse(siteToCrawl).netlock: # Make sure we keep crawling the same domain
                continue
        except Exception:
            continue
        link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0]
        if link in dupcheck:
            continue
        dupcheck.add(link)
        if len(dupcheck) &gt; 99999:
            dupcheck.clear()
        q.put(link)
def analyzePage(html,link):
    print "Analyzing %s" % link
    output = {}
    for key, value in fields.iteritems():
        m = re.search(fields[key],html, re.I | re.S)
        if m:
            output[key] = m.group(1)
    output["Link"] = link
    writer.writerow(output)
def getHTML(link):
    try:
        request = urllib2.Request(link)
        request.add_header('User-Agent', 'Structured Data Extractor')
        html = urllib2.build_opener().open(request).read()
        analyzePage(html,link)
        queueURLs(html, link)
    except (KeyboardInterrupt, SystemExit):
        raise
    except Exception, e:
        print e
while True:
    thread.start_new_thread( getHTML, (q.get(),))
    time.sleep(0.5)

import sys, thread, Queue, re, urllib2, urlparse, time, csv

### Set the site you want to crawl & the patterns of the fields you want to extract ###

siteToCrawl = "https://www.amazon.com/"

fields = {}

fields["Title"] = '<title>(.*?)</title>'

fields["Rating"] = 'title="(S+) out of 5 stars"'

fields["Price"] = 'data-price="(.*?)"'

fields["Image"] = 'src="(https://ecx.images-amazon.com/images/I/.*?)"'

########################################################################

dupcheck = set()

q = Queue.Queue(25)

q.put(siteToCrawl)

csvFile = open("output.csv", "w",0)

csvTitles = dict(fields)

csvTitles["Link"] = ""

writer = csv.DictWriter(csvFile, fieldnames=csvTitles)

writer.writeheader()

def queueURLs(html, origLink):

for url in re.findall('''<a[^>]+href=["'](.[^"']+)["']''', html, re.I):

try:

if url.startswith("http") and urlparse.urlparse(url).netlock != urlparse.urlparse(siteToCrawl).netlock: # Make sure we keep crawling the same domain

continue

except Exception:

continue

link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0]

if link in dupcheck:

continue

dupcheck.add(link)

if len(dupcheck) > 99999:

dupcheck.clear()

q.put(link)

def analyzePage(html,link):

print "Analyzing %s" % link

output = {}

for key, value in fields.iteritems():

m = re.search(fields[key],html, re.I | re.S)

if m:

output[key] = m.group(1)

output["Link"] = link

writer.writerow(output)

def getHTML(link):

try:

request = urllib2.Request(link)

request.add_header('User-Agent', 'Structured Data Extractor')

html = urllib2.build_opener().open(request).read()

analyzePage(html,link)

queueURLs(html, link)

except (KeyboardInterrupt, SystemExit):

raise

except Exception, e:

print e

while True:

thread.start_new_thread( getHTML, (q.get(),))

time.sleep(0.5)

Some notes:

I have set a user agent name, as some websites block crawling if no user agent is present
No external imports are required
You can define as many fields to extract as you’d like. The field name is the “key” in the “fields” parameter
As I use regular expressions to define where the content is, no DOM parsing is performed, so malformed HTML pages are none issue.
Each time you run the script it will overwrite the content in output.csv

Enjoy,

Ran

Spread the News

Subscribe to our newsletter for more news and updates!

By submitting you agree to Webz.io's Privacy Policy and further marketing communications.

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

Subscribe to our newsletter for more news and updates!

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

Subscribe to our newsletter for more news and updates!

Read More

AI Takeover? 4 Big Web Data Predictions for 2024

Structured or Unstructured Data? The Big Web Data Question for Businesses

Common Crawl vs. Webz.io Data: Which One Works Best for Large Language Models?

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need