Character reading from file in Python
In a text file, there is a string "I don't like this".
However, when I read it into a string, it becomes "I don\xe2\x80\x98t like this". I understand that \u2018 is the unicode representation of "'". I use
f1 = open (file1, "r")
text = f1.read()
command to do the reading.
Now, is it possible to read the string in such a way that when it is read into the string, it is "I don't like this", instead of "I don\xe2\x80\x98t like this like this"?
Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?
Asked by: Aldus154 | Posted: 28-01-2022
Answer 1
Ref: http://docs.python.org/howto/unicode
Reading Unicode from a file is therefore simple:
import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
for line in f:
print repr(line)
It's also possible to open files in update mode, allowing both reading and writing:
with codecs.open('test', encoding='utf-8', mode='w+') as f:
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
EDIT: I'm assuming that your intended goal is just to be able to read the file properly into a string in Python. If you're trying to convert to an ASCII string from Unicode, then there's really no direct way to do so, since the Unicode characters won't necessarily exist in ASCII.
If you're trying to convert to an ASCII string, try one of the following:
Replace the specific unicode chars with ASCII equivalents, if you are only looking to handle a few special cases such as this particular example
Use the
unicodedata
module'snormalize()
and thestring.encode()
method to convert as best you can to the next closest ASCII equivalent (Ref https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python):>>> teststr u'I don\xe2\x80\x98t like this' >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore') 'I donat like this'
Answer 2
There are a few points to consider.
A \u2018 character may appear only as a fragment of representation of a unicode string in Python, e.g. if you write:
>>> text = u'‘'
>>> print repr(text)
u'\u2018'
Now if you simply want to print the unicode string prettily, just use unicode's encode
method:
>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I don‘t like this
To make sure that every line from any file would be read as unicode, you'd better use the codecs.open
function instead of just open
, which allows you to specify file's encoding:
>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I don‘t like this
Answered by: Brad839 | Posted: 01-03-2022
Answer 3
It is also possible to read an encoded text file using the python 3 read method:
f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()
With this variation, there is no need to import any additional libraries
Answered by: Cherry619 | Posted: 01-03-2022Answer 4
But it really is "I don\u2018t like this" and not "I don't like this". The character u'\u2018' is a completely different character than "'" (and, visually, should correspond more to '`').
If you're trying to convert encoded unicode into plain ASCII, you could perhaps keep a mapping of unicode punctuation that you would like to translate into ASCII.
punctuation = {
u'\u2018': "'",
u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
text = text.replace(src, dest)
There are an awful lot of punctuation characters in unicode, however, but I suppose you can count on only a few of them actually being used by whatever application is creating the documents you're reading.
Answered by: Roland797 | Posted: 01-03-2022Answer 5
There is a possibility that somehow you have a non-unicode string with unicode escape characters, e.g.:
>>> print repr(text)
'I don\\u2018t like this'
This actually happened to me once before. You can use a unicode_escape
codec to decode the string to unicode and then encode it to any format you want:
>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I don‘t like this
Answered by: Elian647 | Posted: 01-03-2022
Answer 6
Leaving aside the fact that your text file is broken (U+2018 is a left quotation mark, not an apostrophe): iconv can be used to transliterate unicode characters to ascii.
You'll have to google for "iconvcodec", since the module seems not to be supported anymore and I can't find a canonical home page for it.
>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"
Alternatively you can use the iconv
command line utility to clean up your file:
$ xxd foo
0000000: e280 980a ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a '.
Answered by: Vanessa239 | Posted: 01-03-2022
Answer 7
This is Pythons way do show you unicode encoded strings. But i think you should be able to print the string on the screen or write it into a new file without any problems.
>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I don‘t like this
Answered by: Vanessa838 | Posted: 01-03-2022
Answer 8
Actually, U+2018 is the Unicode representation of the special character ‘ . If you want, you can convert instances of that character to U+0027 with this code:
text = text.replace (u"\u2018", "'")
In addition, what are you using to write the file? f1.read()
should return a string that looks like this:
'I don\xe2\x80\x98t like this'
If it's returning this string, the file is being written incorrectly:
'I don\u2018t like this'
Answered by: Darcy332 | Posted: 01-03-2022
Answer 9
Not sure about the (errors="ignore") option but it seems to work for files with strange Unicode characters.
with open(fName, "rb") as fData:
lines = fData.read().splitlines()
lines = [line.decode("utf-8", errors="ignore") for line in lines]
Answered by: Haris336 | Posted: 01-03-2022
Similar questions
Unicode block of a character in python
Is there a way to get the Unicode Block of a character in python? The unicodedata module doesn't seem to have what I need, and I couldn't find an external library for it.
Basically, I need the same functionality as ...
string - Python unicode character in __str__
I'm trying to print cards using their suit unicode character and their values. I tried doing to following:
def __str__(self):
return u'\u2660'.encode('utf-8')
like suggested in another thread, but I keep getting errors saying UnicodeEncodeError: ascii, ♠, 0, 1, ordinal not in range(128). Wh...
How to: remove part of a Unicode string in Python following a special character
first a short summery:
python ver: 3.1
system: Linux (Ubuntu)
I am trying to do some data retrieval through Python and BeautifulSoup.
Unfortunately some of the tables I am trying to process contains cells where the following text string exists:
789.82 ± 10.28
For this i to work i need two things:
How do i handle "weird" symbols such as: ±
and how do i remove the part of the s...
regex - python re match unicode character
Having trouble matching unicode chars with a regex in python
# -*- coding: utf8 -*-
import re
locations = [
"15°47'S 47°55'W",
"21º 18' N, 157º 51' W",
"32°46′58″N 96°48′14″W",
]
rx = re.compile(ur"""
^\d+[°º]
|
^\d+[\xb0\xba]
""", re.X)
for loc in locations:
if not rx.match(loc):
print loc
Result:
15°47'S 47°55'W
21º 18' N, 157º ...
Get unicode code point of a character using Python
In Python API, is there a way to extract the unicode code point of a single character?
Edit: In case it matters, I'm using Python 2.7.
unicode - Why does python append 0000 to every UTF-32 encoded character?
Try:
codecs.getencoder('hex_codecs')(codecs.getencoder('utf-32')('a')[0])
Python will output:
('fffe000061000000', 8)
Why python append 0000 to the UTF-32 encoded string?
Thanks.
python - Find out the unicode script of a character
Given a unicode character what would be the simplest way to return its script (as "Latin", "Hangul" etc)? unicodedata doesn't seem to provide this kind of feature.
unicode - Unable to print Tamil character in Python
I am trying to use Tamil languge in Python. But ran into difficulties. Here is my code
U=u'\u0B83'
print U
This throws the error,
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0b83' in
position 0 : ordinal not in range(128)
My defaultencoding in ascii. As u0b83 is already in unicode, it should print the character in Tamil.
...
How to print Unicode character in Python?
I want to make a dictionary where English words point to Russian and French translations.
How do I print out unicode characters in Python? Also, how do you store unicode chars in a variable?
unicode - Python convert hanzi character
How do I convert between a hanzi character and it's unicode value as depicted below?
与 to U+4E0E
今 to U+4ECA
令 to U+4EE4
免 to U+514D
Appears unsupported by default:
>>> a = '安'
Unsupported characters in input
python - Dealing with a string containing multiple character encodings
I'm not exactly sure how to ask this question really, and I'm no where close to finding an answer, so I hope someone can help me.
I'm writing a Python app that connects to a remote host and receives back byte data, which I unpack using Python's built-in struct module. My problem is with the strings, as they include multiple character encodings. Here is an example of such a string:
"^LThis is an example ^Gs...
Unicode block of a character in python
Is there a way to get the Unicode Block of a character in python? The unicodedata module doesn't seem to have what I need, and I couldn't find an external library for it.
Basically, I need the same functionality as ...
How can I convert a character to a integer in Python, and viceversa?
I want to get, given a character, its ASCII value.
For example, for the character a, I want to get 97, and vice versa.
python, regex split and special character
How can I split correctly a string containing a sentence with special chars using whitespaces as separator ?
Using regex split method I cannot obtain the desired result.
Example code:
# -*- coding: utf-8 -*-
import re
s="La felicità è tutto" # "The happiness is everything" in italian
l=re.compile("(\W)").split(s)
print " s> "+s
print " wordlist> "+str(l)
for i in l:
print " word> "+i...
Reading a single character (getch style) in Python is not working in Unix
Any time I use the recipe at http://code.activestate.com/recipes/134892/ I can't seem to get it working. It always throws the following error:
Traceback (most recent call last):
...
old_settings = termios.tcgetattr(fd)
termios.error: (22, 'Invalid argument)
My best thought is that it is because I'm ...
python - Is  a valid character in XML?
On this data:
<row Id="37501" PostId="135577" Text="...uses though.&#x10;"/>
I'm getting an error with the Python sax parser:
xml.sax._exceptions.SAXParseException:
comments.xml:29776:332: reference to invalid character number
I trimmed the example; 332 points to "&#x10;".
Is the parser correct in rejecting this character?
...
python - How to strip the 8th bit in a KOI8-R encoded character?
How to strip the 8th bit in a KOI8-R encoded character so as to have translit for a Russian letter? In particular, how to make it in Python?
python - Problem storing Unicode character to MySQL with Django
I have the string
u"Played Mirror's Edge\u2122"
Which should be shown as
Played Mirror's Edge™
But that is another issue. My problem at hand is that I'm putting it in a model and then trying to save it to a database. AKA:
a = models.Achievement(name=u"Played Mirror's Edge\u2122")
a.save()
And I'm getting :
regex - How to replace the quote " and hyphen character in a string with nothing in Python?
I'd like to replace " and -
with "" nothing! make it disappear.
s = re.sub(r'[^\w\s]', '', s) this makes all punctuation disappear, but I just want those 2 characters. Thanks.
python - regex for character appearing at most once
I want to check a string that contains the period, ".", at most once in python.
Still can't find your answer? Check out these communities...
PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python