Dealing with a string containing multiple character encodings

I'm not exactly sure how to ask this question really, and I'm no where close to finding an answer, so I hope someone can help me.

I'm writing a Python app that connects to a remote host and receives back byte data, which I unpack using Python's built-in struct module. My problem is with the strings, as they include multiple character encodings. Here is an example of such a string:

"^LThis is an example ^Gstring with multiple ^Jcharacter encodings"

Where the different encoding starts and ends is marked using special escape chars:

  • ^L - Latin1
  • ^E - Central Europe
  • ^T - Turkish
  • ^B - Baltic
  • ^J - Japanese
  • ^C - Cyrillic
  • ^G - Greek

And so on... I need a way to convert this sort of string into Unicode, but I'm really not sure how to do it. I've read up on Python's codecs and string.encode/decode, but I'm none the wiser really. I should mention as well, that I have no control over how the strings are outputted by the host.

I hope someone can help me with how to get started on this.


Asked by: Daryl694 | Posted: 05-10-2021






Answer 1

Here's a relatively simple example of how do it...

# -*- coding: utf-8 -*-
import re

# Test Data
ENCODING_RAW_DATA = (
    ('latin_1',    'L', u'Hello'),        # Latin 1
    ('iso8859_2',  'E', u'dobrý večer'),  # Central Europe
    ('iso8859_9',  'T', u'İyi akşamlar'), # Turkish
    ('iso8859_13', 'B', u'Į sveikatą!'),  # Baltic
    ('shift_jis',  'J', u'今日は'),        # Japanese
    ('iso8859_5',  'C', u'Здравствуйте'), # Cyrillic
    ('iso8859_7',  'G', u'Γειά σου'),   # Greek
)

CODE_TO_ENCODING = dict([(chr(ord(code)-64), encoding) for encoding, code, text in ENCODING_RAW_DATA])
EXPECTED_RESULT = u''.join([line[2] for line in ENCODING_RAW_DATA])
ENCODED_DATA = ''.join([chr(ord(code)-64) + text.encode(encoding) for encoding, code, text in ENCODING_RAW_DATA])

FIND_RE = re.compile('[\x00-\x1A][^\x00-\x1A]*')

def decode_single(bytes):
    return bytes[1:].decode(CODE_TO_ENCODING[bytes[0]])

result = u''.join([decode_single(bytes) for bytes in FIND_RE.findall(ENCODED_DATA)])

assert result==EXPECTED_RESULT, u"Expected %s, but got %s" % (EXPECTED_RESULT, result)

Answered by: Kellan808 | Posted: 06-11-2021



Answer 2

There's no built-in functionality for decoding a string like this, since it is really its own custom codec. You simply need to split up the string on those control characters and decode it accordingly.

Here's a (very slow) example of such a function that handles latin1 and shift-JIS:

latin1 = "latin-1"
japanese = "Shift-JIS"

control_l = "\x0c"
control_j = "\n"

encodingMap = {
    control_l: latin1,
    control_j: japanese}

def funkyDecode(s, initialCodec=latin1):
    output = u""
    accum = ""
    currentCodec = initialCodec
    for ch in s:
        if ch in encodingMap:
            output += accum.decode(currentCodec)
            currentCodec = encodingMap[ch]
            accum = ""
        else:
            accum += ch
    output += accum.decode(currentCodec)
    return output

A faster version might use str.split, or regular expressions.

(Also, as you can see in this example, "^J" is the control character for "newline", so your input data is going to have some interesting restrictions.)

Answered by: Rafael426 | Posted: 06-11-2021



Answer 3

I would write a codec that incrementally scanned the string and decoded the bytes as they came along. Essentially, you would have to separate strings into chunks with a consistent encoding and decode those and append them to the strings that followed them.

Answered by: Roman610 | Posted: 06-11-2021



Answer 4

You definitely have to split the string first into the substrings wih different encodings, and decode each one separately. Just for fun, the obligatory "one-line" version:

import re

encs = {
    'L': 'latin1',
    'G': 'iso8859-7',
    ...
}

decoded = ''.join(substr[2:].decode(encs[substr[1]])
             for substr in re.findall('\^[%s][^^]*' % ''.join(encs.keys()), st))

(no error checking, and also you'll want to decide how to handle '^' characters in substrings)

Answered by: Stella647 | Posted: 06-11-2021



Answer 5

I don't suppose you have any way of convincing the person who hosts the other machine to switch to unicode?

This is one of the reasons Unicode was invented, after all.

Answered by: Sam187 | Posted: 06-11-2021



Similar questions

Detect wrong character encodings using python

I'm new to serious programming and I was trying to write a python program where I encountered strings in this form while reading from a file: Îêåàí Åëüçè - Ìàéæå âåñíà Ëÿïèñ Òðóáåöêîé - Ñâÿùåííûé Îãîíü which is actually supposed to be in cyrillic (cp-1251), so this string is the victim of wrong encoding (I found it after long searching, with the help of this site:


python - How to deal with unknown character encodings returned in requests?

I've been working with the YouTube API, but have had some issues with characters not in the character encoding set crashing the program. I originally tried encoding the string with .encode('utf-8'), but even still, certain characters still crash the program. For example, ♬. This will cause a crash when used in the program. getVid = urllib2.urlopen('https://www.googleapis.com/youtu...


python - Mapping of character encodings to maximum bytes per character

I'm looking for a table that maps a given character encoding to the (maximum, in the case of variable length encodings) bytes per character. For fixed-width encodings this is easy enough, though I don't know, in the case of some of the more esoteric encodings, what that width is. For UTF-8 and the like it would also be nice to determine the maximum bytes per character depending on the highest codepoint in a stri...


Python Code to Decode Special Character Encodings Like "“"?

I have some strings with character encodings, e.g.: designated building areas as “sensitive” suitability for the company’s needs? organization’s enterprise architecture? I've Googled quite a bit but I haven't yet found the correct way to address this. What's the correct way to translate th...


unicode - Character reading from file in Python

In a text file, there is a string "I don't like this". However, when I read it into a string, it becomes "I don\xe2\x80\x98t like this". I understand that \u2018 is the unicode representation of "'". I use f1 = open (file1, "r") text = f1.read() command to do the reading. Now, is it possible to read the string in such a way that when it is read into the string, it is "I do...


Unicode block of a character in python

Is there a way to get the Unicode Block of a character in python? The unicodedata module doesn't seem to have what I need, and I couldn't find an external library for it. Basically, I need the same functionality as ...


How can I convert a character to a integer in Python, and viceversa?

I want to get, given a character, its ASCII value. For example, for the character a, I want to get 97, and vice versa.


python, regex split and special character

How can I split correctly a string containing a sentence with special chars using whitespaces as separator ? Using regex split method I cannot obtain the desired result. Example code: # -*- coding: utf-8 -*- import re s="La felicità è tutto" # "The happiness is everything" in italian l=re.compile("(\W)").split(s) print " s> "+s print " wordlist> "+str(l) for i in l: print " word> "+i...


Reading a single character (getch style) in Python is not working in Unix

Any time I use the recipe at http://code.activestate.com/recipes/134892/ I can't seem to get it working. It always throws the following error: Traceback (most recent call last): ... old_settings = termios.tcgetattr(fd) termios.error: (22, 'Invalid argument) My best thought is that it is because I'm ...


python - Is  a valid character in XML?

On this data: <row Id="37501" PostId="135577" Text="...uses though."/> I'm getting an error with the Python sax parser: xml.sax._exceptions.SAXParseException: comments.xml:29776:332: reference to invalid character number I trimmed the example; 332 points to "". Is the parser correct in rejecting this character? ...


python - How to strip the 8th bit in a KOI8-R encoded character?

How to strip the 8th bit in a KOI8-R encoded character so as to have translit for a Russian letter? In particular, how to make it in Python?


python - Problem storing Unicode character to MySQL with Django

I have the string u"Played Mirror's Edge\u2122" Which should be shown as Played Mirror's Edge™ But that is another issue. My problem at hand is that I'm putting it in a model and then trying to save it to a database. AKA: a = models.Achievement(name=u"Played Mirror's Edge\u2122") a.save() And I'm getting :


regex - How to replace the quote " and hyphen character in a string with nothing in Python?

I'd like to replace " and - with "" nothing! make it disappear. s = re.sub(r'[^\w\s]', '', s) this makes all punctuation disappear, but I just want those 2 characters. Thanks.


python - regex for character appearing at most once

I want to check a string that contains the period, ".", at most once in python.






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top