What are the best prebuilt libraries for doing Web Crawling in Python [duplicate]

I need to crawl and store locally for future analysis the contents of a finite list of websites. I basically want to slurp in all pages and follow all internal links to get the entire publicly available site.

Are there existing free libraries to get me there? I've seen Chilkat, but it's for pay. I'm just looking for baseline functionality here. Thoughts? Suggestions?

Exact Duplicate: Anyone know of a good python based web crawler that I could use?

Asked by: Chelsea657 | Posted: 28-01-2022

Answer 1

Use Scrapy.

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

  • Built-in support for parsing HTML, XML, CSV, and Javascript
  • A media pipeline for scraping items with images (or any other media) and download the image files as well
  • Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
  • Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
  • Interactive scraping shell console, very useful for developing and debugging
  • Web management console for monitoring and controlling your bot
  • Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):

class MininovaSpider(CrawlSpider):
    domain_name = 'mininova.org'
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = Torrent()

        torrent.url = response.url
        torrent.name = x.x("//h1/text()").extract()
        torrent.description = x.x("//div[@id='description']").extract()
        torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
        return [torrent]

Answered by: Michelle867 | Posted: 01-03-2022

Answer 2

Do you really need a library? I strongly recommend Heritrix as a great general purpose crawler that will preserve the whole webpage (as opposed to the more common crawlers that store only part of the text). It's a bit rough around the edges, but works great.

That said, you could try the HarvestMan http://www.harvestmanontheweb.com/

Answered by: Adelaide505 | Posted: 01-03-2022

Similar questions

How to find all built in libraries in Python

I've recently started with Python, and am enjoying the "batteries included" design. I'e already found out I can import time, math, re, urllib, but don't know how to know that something is builtin rather than writing it from scratch. What's included, and where can I get other good quality libraries from?

OCSP libraries for python / java / c?

Going back to my previous question on OCSP, does anybody know of "reliable" OCSP libraries for Python, Java and C? I need "client" OCSP functionality, as I'll be checking the status of Certs against an OCSP responder, so responder functionality is not that important. Thanks

how do i use python libraries in C++?

I want to use the nltk libraries in c++. Is there a glue language/mechanism I can use to do this? Reason: I havent done any serious programming in c++ for a while and want to revise NLP concepts at the same time. Thanks

How can I use Perl libraries from Python?

I have written a bunch of Perl libraries (actually Perl classes) and I want to use some of them in my Python application. Is there a natural way to do this without using SWIG or writing Perl API for Python. I am asking for a similar way of PHP's Perl interface. If there is no such kind of work for Perl in Python. What is the easiest way to use Perl cl...

Python vs. C# Twitter API libraries

Closed. This question does not meet Stack Overflow guid...

d - Calling gdc/dmd shared libraries from Python using ctypes

I've been playing around with the rather excellent ctypes library in Python recently. What i was wondering is, is it possible to create shared D libraries and call them in the same way. I'm assuming i would compile the .so files using the -fPIC with dmd or gdc and call them the same way using the ctypes library. Has anyone tried this ? ...

plot - Python plotting libraries

Closed. This question does not meet Stack Overflow guid...

HTML Agility Pack or HTML Screen Scraping libraries for Java, Ruby, Python?

I found the HTML Agility Pack useful and easy to use for screen scraping web sites. What's the equivalent library for HTML screen scraping in Java, Ruby, Python?

c - Building a Python shared object binding with cmake, which depends upon external libraries

We have a c file called dbookpy.c, which will provide a Python binding some C functions. Next we decided to build a proper .so with cmake, but it seems we are doing something wrong with regards to linking the external library 'libdbook' in the binding: The CMakeLists.txt is as follows: PROJECT(dbookpy) FIND_PACKAGE(PythonInterp) FIND_PACKAGE(PythonLibs) INCLUDE_DIRECTORIES(${PYTHON_INCLUDE...

shared libraries - Can two versions of the same library coexist in the same Python install?

The C libraries have a nice form of late binding, where the exact version of the library that was used during linking is recorded, and thus an executable can find the correct file, even when several versions of the same library are installed. Can the same be done in Python? To be more specific, I work on a Python project that uses some 3rd-party libraries, such as paramiko. Paramiko is now version 1.7.4, bu...

Still can't find your answer? Check out these communities...

PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python