Getting international characters from a web page? [duplicate]

I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as ÄÄRITALO!
That is, html uses escaped markup for the special characters, such as Ä

Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.


Asked by: Blake427 | Posted: 27-01-2022






Answer 1

I would recommend BeautifulSoup for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:

>>> from BeautifulSoup import BeautifulSoup    
>>> html = "<html>&#196;&#196;RITALO!</html>"
>>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents[0].string
ÄÄRITALO!

(It would be nice if the standard codecs module included a codec for this, such that you could do "some_string".decode('html_entities') but unfortunately it doesn't!)

EDIT: Another solution: Python developer Fredrik Lundh (author of elementtree, among other things) has a function to unsecape HTML entities on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).

Answered by: Patrick542 | Posted: 28-02-2022



Answer 2

Try using BeautifulSoup. It should do the trick and give you a nicely formatted DOM to work with as well.

This blog entry seems to have had some success with it.

Answered by: Sam557 | Posted: 28-02-2022



Answer 3

I haven't tried it myself, but have you tried

http://zesty.ca/python/scrape.html ?

It seems to have a method htmldecode(text) which would do what you want.

Answered by: Michael995 | Posted: 28-02-2022



Similar questions

python - Scrapy output feed international unicode characters (e.g. Japanese chars)

I'm a newbie to python and scrapy and I'm following the dmoz tutorial. As a minor variant to the tutorial's suggested start URL, I chose a Japanese category from the dmoz sample site and noticed that the feed export I eventually get shows the unicode numeric values instead of the actual Japanese characters. It seems like I need to use


python - Tornado request handler mapping to international characters

I want to be able to match URL requests for some internationalized characters, like /Comisión. This is my setup: class Application(tornado.web.Application): def __init__(self): handlers = [ '''some handlers, and then this: ''' (r"/([\w\:\,]+)", InternationalizedHandler) ] tornado.web.Application.__init__(self, handlers, **settings)


python - Completer with international characters

I'm using the following code for text completion: class MyCompleter(object): # Custom completer def __init__(self, options): self.options = sorted(options) def complete(self, text, state): if state == 0: # on first trigger, build possible matches if text: # cache matches (entries that start with entered text) self.matches = [s for s in self.options ...


database - International characters in pyodbc - ODBC python library

I'm using pyodbc to connect to my *.mdb files and store them in a sqlite / spatialite database for further work and analysis. I'm passing DSN like this: DSN="Driver={%s};DBQ=%s;"%(self.find_mdb_driver(),self.mdbPath) and then: conn = pyodbc.connect(DSN) Problem is, that when I try to pass path with international characters cp1250 "čžš" I get error: ...


Handling international characters in email subject lines with Python 3

I’m writing a script to read the subject lines on unread emails. My first attempt: from imaplib import IMAP4_SSL from email.parser import HeaderParser # username = # password = # server = # port = M = IMAP4_SSL(server, port) M.login(username, password) M.select() typ, data = M.search(None, '(UNSEEN)') for num in data[0].split(): rv, data = M.fetch(num, '(BODY.PEEK[HEADER.FIELDS (SUBJECT FROM)])')...


python - How to encode international strings with emoticons and special characters for storing in database

I want to use a API from a game and store the player and clan names in a local database. The names can contain all sorts of characters and emoticons. Here are just a few examples I found: ⭐???? яαℓαηι نکل 窝猫 鐵擊道遊隊 ❤✖❤♠️♦️♣️✖ I use python for reading the api and write it into a mysql database. After that, I want to use the names on a Node.js we...


Showing International Characters using Python and Tkinter

Using Python v3 and Tkinter, I'm trying to read a text file that contains International characters and then display them on the screen (e.g. in a Menu) but the characters display wrongly. a simple example of data in the text file is:- Letzte VTR-Datei öffnen (which I think is German for 'Open recent VTR file' - or something similar) What I see is the ö character being replaced with something like a capital A ...


string - What's a good way to replace international characters with their base Latin counterparts using Python?

Say I have the string "blöt träbåt" which has a few a and o with umlaut and ring above. I want it to become "blot trabat" as simply as possibly. I've done some digging and found the following method: import unicodedata unicode_string = unicodedata.normalize('NFKD', unicode(string)) This will give me the string in unicode format with the i...


python - Preparing a web site for international usage

I am preparing to develop a web application that will (hopefully) be used by an audience with many different native languages. What should I do to prepare my software project to have the user interface be almost entirely internationalized? Are there any software stacks that make this easier?


python - Scrapy output feed international unicode characters (e.g. Japanese chars)

I'm a newbie to python and scrapy and I'm following the dmoz tutorial. As a minor variant to the tutorial's suggested start URL, I chose a Japanese category from the dmoz sample site and noticed that the feed export I eventually get shows the unicode numeric values instead of the actual Japanese characters. It seems like I need to use


python - Different national and international shipping rate in Satchmo?

I'm taking over a Satchmo site and need it to charge a different shipping rate for international versus local postage. Any idea what I need to do to enable this?


python - Tornado request handler mapping to international characters

I want to be able to match URL requests for some internationalized characters, like /Comisión. This is my setup: class Application(tornado.web.Application): def __init__(self): handlers = [ '''some handlers, and then this: ''' (r"/([\w\:\,]+)", InternationalizedHandler) ] tornado.web.Application.__init__(self, handlers, **settings)


python - Email sender / recipient in international format

There are several examples of similar functionality, and I have been following those guidelines, but this is still not working. I am trying to get this test script to run: # -*- coding: utf-8 -*- import smtplib from email.header import Header from email.mime.multipart import MIMEMu...


python - Completer with international characters

I'm using the following code for text completion: class MyCompleter(object): # Custom completer def __init__(self, options): self.options = sorted(options) def complete(self, text, state): if state == 0: # on first trigger, build possible matches if text: # cache matches (entries that start with entered text) self.matches = [s for s in self.options ...


python - Formatting a mobile number to international format

A user can input a phone number in any of the following formats: 07xxxxxxxxx 00447xxxxxxxxx 447xxxxxxxxx +447xxxxxxxxx I need help in creating a function that will take the number in any of the formats above and return it as international format +447xxxxxxxxx. This is ...


Python pandas Google finance international stocks - looking for way to get international stocks price history with Google

Closed. This question needs to be more focused. It ...


database - International characters in pyodbc - ODBC python library

I'm using pyodbc to connect to my *.mdb files and store them in a sqlite / spatialite database for further work and analysis. I'm passing DSN like this: DSN="Driver={%s};DBQ=%s;"%(self.find_mdb_driver(),self.mdbPath) and then: conn = pyodbc.connect(DSN) Problem is, that when I try to pass path with international characters cp1250 "čžš" I get error: ...






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top