The Blog

Tiny basic multi-threaded web crawler in Python

Posted on August 12, 2015 by

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you.

Usage:

$ python tinyDirtyIffyGoodEnoughWebCrawler.py http://cnn.com

Where http://cnn.com is your seed site. It could be any site that contains content and links to other sites.

My colleagues described this piece of code I wrote as “Dirty”, “Iffy”, “Bad”, “Not very good”. I say, it gets the job done and downloads thousands of pages from multiple pages in a matter of hours. No setup is required, no external imports, just run the following python code with a seed site and sit back (or go do something else because it could take a few hours, or days depending on how much data you need).

tinyDirtyIffyGoodEnoughWebCrawler.py

import sys, thread, Queue, re, urllib, urlparse, time, os, sys
dupcheck = set()  
q = Queue.Queue(100) 
q.put(sys.argv[1]) 
def queueURLs(html, origLink): 
    for url in re.findall('''<a[^>]+href=["'](.[^"']+)["']''', html, re.I): 
        link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] 
        if link in dupcheck:
            continue
        dupcheck.add(link)
        if len(dupcheck) > 99999: 
            dupcheck.clear()
        q.put(link) 
def getHTML(link): 
    try:
        html = urllib.urlopen(link).read() 
        open(str(time.time()) + ".html", "w").write("" % link  + "n" + html) 
        queueURLs(html, link) 
    except (KeyboardInterrupt, SystemExit): 
        raise
    except Exception:
        pass
while True:
    thread.start_new_thread( getHTML, (q.get(),)) 
    time.sleep(0.5)

Features:

  • Multi-threaded – for fastness
  • Duplication elimination (kinda) – for link uniqueness
  • Saves both source and its link – for the purpose it was built
  • FREE

Enjoy,

Ran

Share this:
Share on FacebookTweet about this on TwitterPin on PinterestShare on LinkedInShare on TumblrShare on Google+

This entry was posted in Technology. Bookmark the permalink.

8 thoughts on “Tiny basic multi-threaded web crawler in Python”

  1. jimcen33 says:

    Where can I locate the result? I cannot find it.

  2. Ran Geva says:

    In the folder you run the script in.

  3. jimcen33 says:

    “`
    [2] 7245
    [3] 7246
    [3]+ Done fromSearch=true
    “`
    That’s the command log it returned. I cannot find any new generated folder when I list all the file(ls -a)

  4. Ran Geva says:

    Originally I removed the exceptions printing to make things clean, but you can replace the
    except Exception:
    pass

    with
    except Exception as e:
    print e.message

    and see if you are getting an error. It might be that the site you are trying to crawl is blocking the crawler. On what site are you trying to run it on?

  5. jimcen33 says:

    New message: not all arguments converted during string formatting.
    It’s a job search website that has dynamic response content. I guess this crawler does not work on this type of website then. Need to come up my own. 🙁

  6. Ran Geva says:

    I’m afraid this simple crawler works only on static pages. There are a multiple methods to work with dynamic content – the simplest one (although the heaviest) would be using an headless browser, but it would take more than a few lines of code to make it work :/

  7. Phat G.Ha says:

    should it be safer to use the new threading module?

  8. Ran Geva says:

    Yes it would be safer but it won’t be tiny anymore 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *