Grabbing text from a webpage

I would like to write a program that will find bus stop times and update my personal webpage accordingly.

If I were to do this manually I would

  1. Visit www.calgarytransit.com
  2. Enter a stop number. ie) 9510
  3. Click the button "next bus"

The results may look like the following:

10:16p Route 154
10:46p Route 154
11:32p Route 154

Once I've grabbed the time and routes then I will update my webpage accordingly.

I have no idea where to start. I know diddly squat about web programming but can write some C and Python. What are some topics/libraries I could look into?


Asked by: Kate552 | Posted: 01-10-2021






Answer 1

Beautiful Soup is a Python library designed for parsing web pages. Between it and urllib2 (urllib.request in Python 3) you should be able to figure out what you need.

Answered by: Madaline754 | Posted: 02-11-2021



Answer 2

What you're asking about is called "web scraping." I'm sure if you google around you'll find some stuff, but the core notion is that you want to open a connection to the website, slurp in the HTML, parse it and identify the chunks you want.

The Python Wiki has a good lot of stuff on this.

Answered by: Anna517 | Posted: 02-11-2021



Answer 3

Since you write in C, you may want to check out cURL; in particular, take a look at libcurl. It's great.

Answered by: Emma662 | Posted: 02-11-2021



Answer 4

You can use the mechanize library that is available for Python http://wwwsearch.sourceforge.net/mechanize/

Answered by: Kimberly428 | Posted: 02-11-2021



Answer 5

You can use Perl to help you complete your task.

use strict;
use LWP;

my $browser = LWP::UserAgent->new;

my $responce = $browser->get("http://google.com");
print $responce->content;

Your responce object can tell you if it suceeded as well as returning the content of the page.You can also use this same library to post to a page.

Here is some documentation. http://metacpan.org/pod/LWP::UserAgent

Answered by: Adelaide512 | Posted: 02-11-2021



Answer 6

That site doesnt offer an API for you to be able to get the appropriate data that you need. In that case you'll need to parse the actual HTML page returned by, for example, a CURL request .

Answered by: Miranda255 | Posted: 02-11-2021



Answer 7

This is called Web scraping, and it even has its own Wikipedia article where you can find more information.

Also, you might find more details in this SO discussion.

Answered by: Arthur325 | Posted: 02-11-2021



Answer 8

As long as the layout of the web page your trying to 'scrape' doesnt regularly change, you should be able to parse the html with any modern day programming language.

Answered by: David172 | Posted: 02-11-2021



Similar questions

Python FTP grabbing and saving images issue

EDIT: I got it working it just won't download anything... So here is my code simplified now: notions_ftp = ftplib.FTP(ftp_host, ftp_user, ftp_passwd) folder = "Leisure Arts - Images" notions_ftp.cwd(folder) image = open("015693PR-com.jpg","wb") notions_ftp.retrlines("RETR 015693PR-com.jpg", image.write) send_image = open("015693PR-com.jpg", 'r') And Here is my outp...


netcat - HTTP Banner Grabbing with Python

I am interested in making an HTTP Banner Grabber, but when i connect to a server on port 80 and i send something (e.g. "HEAD / HTTP/1.1") recv doesn't return anything to me like when i do it in let's say netcat.. How would i go about this? Thanks!


Grabbing non-HTML data from a website using python

I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html I would really like a python 2.6 solution. It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspect...


android - Python grabbing JSON from POST method

I have an Android appthat originally posted some strings in json format to a python cgi script, which all worked fine. The problem is when the json object contains lists, then python (Using simplejson) when it gets them is still treating them as a big string Here is a text dump of the json once it reaches python before I parse it: {"Prob1":"[1, 2, 3]","Name":"aaa","action":1,"Prob2":"[20, 20, 20]","Tasks":"[1 t...


html - Grabbing <canvas> pixels through Selenium / Python

I am doing some automation on the sever side. I'd like to somehow interact with the HTML page through Python code and access the pixel data. What kind of options Selenium offers for his kind of approaches? Possible considerations Raw pixel data access in in-process memory Get pixels by saving them to a local image file (PNG) Get pixels by saving them to a memory, l...


Email message python Grabbing parts of a message

Ok so I am using the imap lib to download a message. I do not want to download the entire message only look at the attachment. Like for example when the message is recieved and downloaded as a whole the following .txt attachment is shown: ------=_Part_1476882_26131288.1342315902872-- ------=_Part_1476883_28164997.1342315902872-- Content-Type: application/octet-stream; name=textplain_2.txt Content-Transfer-...


regex - Python: Grabbing the width and height of image from url

A 3rd party app saves images in such naming format where q85 is the quality, AAxBB means width and height: 62620587b.jpg.122x132_q85.jpg -&gt; 122x132 62620587c.jpg.143x85_q85.jpg -&gt; 143x85 6768113_sa.jpg.122x132_q85.jpg -&gt; 122x132 Toshiba_50-Inch_Side.jpg.150x150_q85.jpg -&gt; 150x150 What is the clever way to split width and height numbers from such string urls?


python - Grabbing tensor indices with name of tensor attached

I would like to make an array out of specific components of a tensor. I have found the wonderful command np.argwhere(). This returns the indices of the tensor meeting a specific criteria, however it does not name them as components of the tensor, i.e. they come back as [0,0,1,1] versus x[0,0,1,1] for a tensor x. Is there a built in or slick way to grab the comp...


python - Grabbing current logged in user with Django class views?

I'm trying to grab the currently logged in user and display at the the top of every view. I've searched all over the place for this, but I can't ever find a straight answer for my problem. I was able to get it in the form view, but for some reason I can't display it in a normal view. It's driving me nuts. from django.http import HttpResponse, Http404 from django.views.generic import ListView, Detail...


Python grabbing pages source with PHP in it

I know how to grab a sources HTML but not PHP is it possible with the built in functions?






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top