URL tree walker in Python?
For URLs that show file trees, such as Pypi packages,
is there a small solid module to walk the URL tree and list it like
I gather (correct me) that there's no standard encoding of file attributes, link types, size, date ... in html
so building a solid URLtree module on shifting sands is tough.
But surely this wheel (
Unix file tree -> html -> treewalk API -> ls -lR or find)
has been done?
(There seem to be several spiders / web crawlers / scrapers out there, but they look ugly and ad hoc so far, despite BeautifulSoup for parsing).
Asked by: Sarah710 | Posted: 06-12-2021
Apache servers are very common, and they have a relatively standard way of listing file directories.
Here's a simple enough script that does what you want, you should be able to make it do what you want.
Usage: python list_apache_dir.py
Answered by: Kristian232 | Posted: 07-01-2022
import sys import urllib import re parse_re = re.compile('href="([^"]*)".*(..-...-.... ..:..).*?(\d+[^\s<]*|-)') # look for a link + a timestamp + a size ('-' for dir) def list_apache_dir(url): try: html = urllib.urlopen(url).read() except IOError, e: print 'error fetching %s: %s' % (url, e) return if not url.endswith('/'): url += '/' files = parse_re.findall(html) dirs =  print url + ' :' print '%4d file' % len(files) + 's' * (len(files) != 1) for name, date, size in files: if size.strip() == '-': size = 'dir' if name.endswith('/'): dirs += [name] print '%5s %s %s' % (size, date, name) for dir in dirs: print list_apache_dir(url + dir) for url in sys.argv[1:]: print list_apache_dir(url)
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
It has CSS selectors as well so this sort of thing is trivial.Answered by: Vivian168 | Posted: 07-01-2022
Turns out that BeautifulSoup one-liners like these can turn <table> rows into Python --
from BeautifulSoup import BeautifulSoup def trow_cols( trow ): """ soup.table( "tr" ) -> <td> strings like [None, u'Name', u'Last modified', u'Size', u'Description'] """ return [td.next.string for td in trow( "td" )] def trow_headers( trow ): """ soup.table( "tr" ) -> <th> table header strings like [None, u'Achoo-1.0-py2.5.egg', u'11-Aug-2008 07:40 ', u'8.9K'] """ return [th.next.string for th in trow( "th" )] if __name__ == "__main__": ... soup = BeautifulSoup( html ) if soup.table: trows = soup.table( "tr" ) print "headers:", trow_headers( trows ) for row in trows[1:]: print trow_cols( row )
Compared to sysrqb's one-line regexp above, this is ... longer; who said
Answered by: Daryl425 | Posted: 07-01-2022
"You can parse some of the html all of the time, or all of the html some of the time, but not ..."