In the process of writing and maintaining a service that checks Russian word frequencies, I noticed peculiar phenomenon: certain words could not be located in a sqlite database that I knew actually contained them. For example, a query for the word - английский consistently failed, whereas other words would succeed. Eventually the commonality between the failures became obvious. All of the failures contained the letter й , which led me down a rabbit hole of character encoding and this specific case where it can go astray.
Background on Unicode support for Cyrillic characters
Cyrillic characters are found in the
00000400 range of the Unicode table.
The letter Russian й is related to its cousin и but is regarded as a separate letter. In other words, it’s not an и plus a combining diacritical mark; it has its own entry in the Unicode table:
A brief digression on Unicode vs UTF-8
True confession: I’ve harboured some confusion between the concepts of Unicode and UTF-8. Solving this problem meant diving into this a little deeper. First, a really basic distinction:
- Unicode is a character set that aims to encompass all possible characters.
- UTF-8 is a flexible way of encoding Unicode characters in strings
Looking at the й
entry above, we see that the Unicode value is
U+0439 which is the hexadecimal value 0h0439 and it has a UTF-8 encoding of
D0 B9. How do we get
D0 B9 from
U+0439? Herein lies the distinction between Unicode and UTF-8. In the old days, when computers just had to deal with the Latin alphabet, ASCII was sufficient. It was capable of encoding 127 characters which was more than enough to store English language text. But how to deal with characters in other languages? Clearly 8 bits is not going to be enough. So we’ll solve it by giving it an entry in the Unicode table. Problem solved. Not quite. We could just store all text as multi-byte sequences. For sure it’s necessary for some Asian languages, notably Chinese. But if would be inefficient for English, where most of what we write and store is rendered in the Latin alphabet (which resides in the lowest 127 entries in the Unicode table.)
One of the answers to this dilemma is UTF-8 which flexibly encodes characters of varying Unicode width. I’ll simply a bit here. For characters with Unicode values of 0-127, we can just store the character as a single byte. Nice and compact. For Unicode characters with larger values then we have to distribute their values into bytes that have fixed identifying “header bits” (my terminolgy, don’t quote me.) For Unicode values up to
0h07FF we take 11 bits of the Unicode value and insert it into two bytes with a fixed format:
|Byte 1||Byte 2||Byte 3||Byte 4||# free bits||Maximum Unicode value|
||5+6 → 11||0h7FF|
||4+6+6 → 16||0hFFFF|
||3+6+6+6 → 21||0h10FFFF|
From here, you’ll have to trust me. If you distribute the bits of U+0439 into the 11 free bits of the 2 byte UTF-8 sequence, you end up with
Now to why this matters for my problem of finding й -containing words.
After it dawned on me that some errant encoding might be the cause of the problem. I decided to do a little print debugging and capture what my API was seeing for characters
for l in word: print(l.encode('utf-8'))
So for the word численный for example, what were we seeing?
b'\xd1\x87' b'\xd0\xb8' b'\xd1\x81' b'\xd0\xbb' b'\xd0\xb5' b'\xd0\xbd' b'\xd0\xbd' b'\xd1\x8b' b'\xd0\xb8' b'\xcc\x86'
Whoa, were is the
D0 B9 (U+0439) that we would expect? The problem is right in front of us now. The last two characters in the UTF-8 string are
\xcc\x86. What are those guys? Well, it turns out that the letter и
has a Unicode value of U+0438 and a UTF-8 encoding, therefore of
D0 B8. Then what follows it is probably a diacritical mark. Which one? Knowing now how UTF-8 is constructed, we can work backwards.
11001100 10000110 in binary format. If we extract out the fixed bits of the two-byte sequence, we’re left with
01100000110 which is 0h306. Looking up U+0306 in the Unicode table, we find:
Therein lies the problem. Somewhere the й got turned into и plus a combining breve diacritical mark.
Since I’m not able to control the process that led to this bungling of the encoding, all I can do is fix it on the back end. Therefore, we just do a character substitution:
word = word.replace(u"\u0438\u0306", u"\u0439")
Any time we have this stupid encoding, we just fix it before using the word. Now I’m willing to bet that ё is going to get handled the same way.