Back to Blog

Tiny basic multi-threaded web crawler in Python

August 12, 2015

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you.

Usage:

$ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com

1	$ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com

Where https://cnn.com is your seed site. It could be any site that contains content and links to other sites.

My colleagues described this piece of code I wrote as “Dirty”, “Iffy”, “Bad”, “Not very good”. I say, it gets the job done and downloads thousands of pages from multiple pages in a matter of hours. No setup is required, no external imports, just run the following python code with a seed site and sit back (or go do something else because it could take a few hours, or days depending on how much data you need).

tinyDirtyIffyGoodEnoughWebCrawler.py

import sys, thread, Queue, re, urllib, urlparse, time, os, sys
dupcheck = set()  
q = Queue.Queue(100) 
q.put(sys.argv[1]) 
def queueURLs(html, origLink): 
    for url in re.findall('''&lt;a[^&gt;]+href=["'](.[^"']+)["']''', html, re.I): 
        link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] 
        if link in dupcheck:
            continue
        dupcheck.add(link)
        if len(dupcheck) &gt; 99999: 
            dupcheck.clear()
        q.put(link) 
def getHTML(link): 
    try:
        html = urllib.urlopen(link).read() 
        open(str(time.time()) + ".html", "w").write("" % link  + "n" + html) 
        queueURLs(html, link) 
    except (KeyboardInterrupt, SystemExit): 
        raise
    except Exception:
        pass
while True:
    thread.start_new_thread( getHTML, (q.get(),)) 
    time.sleep(0.5)

import sys, thread, Queue, re, urllib, urlparse, time, os, sys

dupcheck = set()

q = Queue.Queue(100)

q.put(sys.argv[1])

def queueURLs(html, origLink):

for url in re.findall('''<a[^>]+href=["'](.[^"']+)["']''', html, re.I):

link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0]

if link in dupcheck:

continue

dupcheck.add(link)

if len(dupcheck) > 99999:

dupcheck.clear()

q.put(link)

def getHTML(link):

try:

html = urllib.urlopen(link).read()

open(str(time.time()) + ".html", "w").write("" % link + "n" + html)

queueURLs(html, link)

except (KeyboardInterrupt, SystemExit):

raise

except Exception:

pass

while True:

thread.start_new_thread( getHTML, (q.get(),))

time.sleep(0.5)

Features:

Multi-threaded – for fastness
Duplication elimination (kinda) – for link uniqueness
Saves both source and its link – for the purpose it was built
FREE

Enjoy,

Ran

code crawler python Web Data Collection

Spread the News

Subscribe to our newsletter for more news and updates!

By submitting you agree to Webz.io's Privacy Policy and further marketing communications.

Tiny basic multi-threaded web crawler in Python

Subscribe to our newsletter for more news and updates!

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need

Tiny basic multi-threaded web crawler in Python

Subscribe to our newsletter for more news and updates!

Read More

AI Takeover? 4 Big Web Data Predictions for 2024

Structured or Unstructured Data? The Big Web Data Question for Businesses

Common Crawl vs. Webz.io Data: Which One Works Best for Large Language Models?

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need