Anyone know of a good Python based web crawler that I could use?

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.

Thanks in advance. Also, this is my first SO question!


Asked by: Kimberly481 | Posted: 01-10-2021






Answer 1

  • Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
  • Twill is a simple scripting language built on top of Mechanize
  • BeautifulSoup + urllib2 also works quite nicely.
  • Scrapy looks like an extremely promising project; it's new.

Answered by: Carlos457 | Posted: 02-11-2021



Answer 2

Use Scrapy.

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

  • Built-in support for parsing HTML, XML, CSV, and Javascript
  • A media pipeline for scraping items with images (or any other media) and download the image files as well
  • Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
  • Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
  • Interactive scraping shell console, very useful for developing and debugging
  • Web management console for monitoring and controlling your bot
  • Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):
    pass

class MininovaSpider(CrawlSpider):
    domain_name = 'mininova.org'
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = Torrent()

        torrent.url = response.url
        torrent.name = x.x("//h1/text()").extract()
        torrent.description = x.x("//div[@id='description']").extract()
        torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
        return [torrent]

Answered by: Eric474 | Posted: 02-11-2021



Answer 3

Check the HarvestMan, a multi-threaded web-crawler written in Python, also give a look to the spider.py module.

And here you can find code samples to build a simple web-crawler.

Answered by: Caroline527 | Posted: 02-11-2021



Answer 4

I've used Ruya and found it pretty good.

Answered by: Brianna373 | Posted: 02-11-2021



Answer 5

I hacked the above script to include a login page as I needed it to access a drupal site. Not pretty but may help someone out there.

#!/usr/bin/python

import httplib2
import urllib
import urllib2
from cookielib import CookieJar
import sys
import re
from HTMLParser import HTMLParser

class miniHTMLParser( HTMLParser ):

  viewedQueue = []
  instQueue = []
  headers = {}
  opener = ""

  def get_next_link( self ):
    if self.instQueue == []:
      return ''
    else:
      return self.instQueue.pop(0)


  def gethtmlfile( self, site, page ):
    try:
        url = 'http://'+site+''+page
        response = self.opener.open(url)
        return response.read()
    except Exception, err:
        print " Error retrieving: "+page
        sys.stderr.write('ERROR: %s\n' % str(err))
    return "" 

    return resppage

  def loginSite( self, site_url ):
    try:
    cj = CookieJar()
    self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

    url = 'http://'+site_url 
        params = {'name': 'customer_admin', 'pass': 'customer_admin123', 'opt': 'Log in', 'form_build_id': 'form-3560fb42948a06b01d063de48aa216ab', 'form_id':'user_login_block'}
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    self.headers = { 'User-Agent' : user_agent }

    data = urllib.urlencode(params)
    response = self.opener.open(url, data)
    print "Logged in"
    return response.read() 

    except Exception, err:
    print " Error logging in"
    sys.stderr.write('ERROR: %s\n' % str(err))

    return 1

  def handle_starttag( self, tag, attrs ):
    if tag == 'a':
      newstr = str(attrs[0][1])
      print newstr
      if re.search('http', newstr) == None:
        if re.search('mailto', newstr) == None:
          if re.search('#', newstr) == None:
            if (newstr in self.viewedQueue) == False:
              print "  adding", newstr
              self.instQueue.append( newstr )
              self.viewedQueue.append( newstr )
          else:
            print "  ignoring", newstr
        else:
          print "  ignoring", newstr
      else:
        print "  ignoring", newstr


def main():

  if len(sys.argv)!=3:
    print "usage is ./minispider.py site link"
    sys.exit(2)

  mySpider = miniHTMLParser()

  site = sys.argv[1]
  link = sys.argv[2]

  url_login_link = site+"/node?destination=node"
  print "\nLogging in", url_login_link
  x = mySpider.loginSite( url_login_link )

  while link != '':

    print "\nChecking link ", link

    # Get the file from the site and link
    retfile = mySpider.gethtmlfile( site, link )

    # Feed the file into the HTML parser
    mySpider.feed(retfile)

    # Search the retfile here

    # Get the next link in level traversal order
    link = mySpider.get_next_link()

  mySpider.close()

  print "\ndone\n"

if __name__ == "__main__":
  main()

Answered by: Grace813 | Posted: 02-11-2021



Answer 6

Trust me nothing is better than curl.. . the following code can crawl 10,000 urls in parallel in less than 300 secs on Amazon EC2

CAUTION: Don't hit the same domain at such a high speed.. .

#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
# vi:ts=4:et
# $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $

#
# Usage: python retriever-multi.py <file with URLs to fetch> [<# of
#          concurrent connections>]
#

import sys
import pycurl

# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
    import signal
    from signal import SIGPIPE, SIG_IGN
    signal.signal(signal.SIGPIPE, signal.SIG_IGN)
except ImportError:
    pass


# Get args
num_conn = 10
try:
    if sys.argv[1] == "-":
        urls = sys.stdin.readlines()
    else:
        urls = open(sys.argv[1]).readlines()
    if len(sys.argv) >= 3:
        num_conn = int(sys.argv[2])
except:
    print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0]
    raise SystemExit


# Make a queue with (url, filename) tuples
queue = []
for url in urls:
    url = url.strip()
    if not url or url[0] == "#":
        continue
    filename = "doc_%03d.dat" % (len(queue) + 1)
    queue.append((url, filename))


# Check args
assert queue, "no URLs given"
num_urls = len(queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM)
print "----- Getting", num_urls, "URLs using", num_conn, "connections -----"


# Pre-allocate a list of curl objects
m = pycurl.CurlMulti()
m.handles = []
for i in range(num_conn):
    c = pycurl.Curl()
    c.fp = None
    c.setopt(pycurl.FOLLOWLOCATION, 1)
    c.setopt(pycurl.MAXREDIRS, 5)
    c.setopt(pycurl.CONNECTTIMEOUT, 30)
    c.setopt(pycurl.TIMEOUT, 300)
    c.setopt(pycurl.NOSIGNAL, 1)
    m.handles.append(c)


# Main loop
freelist = m.handles[:]
num_processed = 0
while num_processed < num_urls:
    # If there is an url to process and a free curl object, add to multi stack
    while queue and freelist:
        url, filename = queue.pop(0)
        c = freelist.pop()
        c.fp = open(filename, "wb")
        c.setopt(pycurl.URL, url)
        c.setopt(pycurl.WRITEDATA, c.fp)
        m.add_handle(c)
        # store some info
        c.filename = filename
        c.url = url
    # Run the internal curl state machine for the multi stack
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break
    # Check for curl objects which have terminated, and add them to the freelist
    while 1:
        num_q, ok_list, err_list = m.info_read()
        for c in ok_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL)
            freelist.append(c)
        for c, errno, errmsg in err_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Failed: ", c.filename, c.url, errno, errmsg
            freelist.append(c)
        num_processed = num_processed + len(ok_list) + len(err_list)
        if num_q == 0:
            break
    # Currently no more I/O is pending, could do something in the meantime
    # (display a progress bar, etc.).
    # We just call select() to sleep until some more data is available.
    m.select(1.0)


# Cleanup
for c in m.handles:
    if c.fp is not None:
        c.fp.close()
        c.fp = None
    c.close()
m.close()

Answered by: Rebecca105 | Posted: 02-11-2021



Answer 7

Another simple spider Uses BeautifulSoup and urllib2. Nothing too sophisticated, just reads all a href's builds a list and goes though it.

Answered by: Blake282 | Posted: 02-11-2021



Answer 8

pyspider.py

Answered by: Chelsea250 | Posted: 02-11-2021



Similar questions

Perl or Python SVN Crawler

Is there an SVN crawler, that can walk thru an SVN repo and spitt out all existing branches, or tags? Preferably in Perl or Python ...


web crawler - python fails to fetch a whole web page

I am working on a scrapy project to scrape some data on http://58.com I find some divs are missing from the page when using scrapy to scrape it. I think this may have something to do with request headers, so I copy the user-agent of Firefox to fake one, just to find it fails. what can be the problem and how can I solve it? I find the problem ...


jsp - Python site crawler, saving files with Scrapy

I'm attempting to write a crawler that will take a certain search entry and saving a whole bunch of .CSV files correlated to the results. I already have the spider logging in, parsing all the html data I need, and now all I have left to do is figure out how I can save the files I need. So the search returns links such as this


web crawler - Python can not get links from web page

I am writing python script which gets links from website. But when I tried with this web page I was unable to get links. My script is: soup = BeautifulSoup(urllib2.urlopen(url)) datas = soup.findAll('div', attrs={'class':'tsrImg'}) for data in datas: link = data.find('a') print str(link.href) it prints ...


python - Does webkit crawler need to use squid proxy?

I am writing a crawler using webkit, does webkit cache stuffs? Do I need to use squid as a proxy for my webkit based crawler?


web crawler - Getting links from web page with python

Hello! I have this script: URL = "http://www.hitmeister.de/" page = urllib2.urlopen(URL).read() soup = BeautifulSoup(page) links = soup.findAll('a') for link in links: print link['href'] This should get links from the web page but it does not, what can be the problem? I have tried with User-Agent headers too, there is no result, but this script works for other web pages.


python - Web Crawler Text Cloud

I need help with a text cloud program I'm working on. I realize it's homework, but I've gotten pretty far on my own, only to now be stumped for hours. I'm stuck on the web crawler part. The program is supposed to open a page, gather all the words from that page, and sort them by frequency. Then it's supposed to open any links on that page and get the words on that page, etc. The depth is controlled by a global variable DEP...


python - web crawler class

class Crawler1(object): def __init__(self): 'constructor' self.visited = [] self.will_visit = [] def reset(self): 'reset the visited links' self.visited = [] self.will_visit = [] def crawl(self, url, n): 'crawl to depth n starting at url' self.analyze(url) if n &lt; 0: self.reset() elif url in self.visted: ...


python - Error with scrapy crawler

This is the error message: 2013-01-20 22:45:02+0700 [scrapy] INFO: Scrapy 0.16.3 started (bot: scrapybot) 2013-01-20 22:45:02+0700 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-01-20 22:45:02+0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHea...


Python Crawler giving me a lot of syntax error


web crawler - How to write a simple spider in Python?

I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python: 1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A 2) from initial url pick up these urls with this regex: hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+') [u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u...


web crawler - Writing a Faster Python Spider

I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed. What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages. I'm completely new to Python, but have used Java and C++ be...


python - How to build a web crawler based on Scrapy to run forever?

I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be: Run forever Means it will periodical re-visit some portal pages to get updates. Schedule priorities. Give different priorities to different type of URLs. Multi thread fetch I've read the Scrapy document but h...


A web crawler in python. Where should i start and what should i follow? - Help needed

I have an intermediate knowledge in python. if i have to write a web crawler in python, what things should i follow and where should i begin. is there any specific tut? any advice would be of much help.. thanks


Scrapy Crawler in python cannot follow links?

I wrote a crawler in python using the scrapy tool of python. The following is the python code: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector #from scrapy.item import Item from a11ypi.items import AYpiItem class AYpiSpider(CrawlSpider): name = "AYpi" allowed_domains = ["a11y....


web crawler - how to use two level proxy setting in Python?

I am working on web-crawler [using python]. Situation is, for example, I am behind server-1 and I use proxy setting to connect to the Outside world. So in Python, using proxy-handler I can fetch the urls. Now thing is, I am building a crawler so I cannot use only one IP [otherwise I will be blocked]. To solve this, I have bunch of Proxies, I want to shuffle through. My question is: This is two level proxy,...


Perl or Python SVN Crawler

Is there an SVN crawler, that can walk thru an SVN repo and spitt out all existing branches, or tags? Preferably in Perl or Python ...


web crawler - python fails to fetch a whole web page

I am working on a scrapy project to scrape some data on http://58.com I find some divs are missing from the page when using scrapy to scrape it. I think this may have something to do with request headers, so I copy the user-agent of Firefox to fake one, just to find it fails. what can be the problem and how can I solve it? I find the problem ...


jsp - Python site crawler, saving files with Scrapy

I'm attempting to write a crawler that will take a certain search entry and saving a whole bunch of .CSV files correlated to the results. I already have the spider logging in, parsing all the html data I need, and now all I have left to do is figure out how I can save the files I need. So the search returns links such as this


web crawler - Python can not get links from web page

I am writing python script which gets links from website. But when I tried with this web page I was unable to get links. My script is: soup = BeautifulSoup(urllib2.urlopen(url)) datas = soup.findAll('div', attrs={'class':'tsrImg'}) for data in datas: link = data.find('a') print str(link.href) it prints ...






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top