unicode

Stripping Russian stress marks from text from the command line

Russian text intended for learners sometimes contains marks that indicate the syllabic stress. It is usually rendered as a vowel + a combining diacritical mark, typically the combining acute accent \u301. Here are a couple ways of stripping these marks on the command line: First is a version using Perl #!/bin/bash f='покупа́ешья́'; echo $f | perl -C -pe 's/\x{301}//g;' And then another using the sd tool: #!/bin/bash f='покупа́ешья́'; echo $f | sd "\u0301" "" Both rely on finding the combining diacritical mark and removing it with regex.

Encoding of the Cyrillic letter й - a UTF-8 gotcha

In the process of writing and maintaining a service that checks Russian word frequencies, I noticed peculiar phenomenon: certain words could not be located in a sqlite database that I knew actually contained them. For example, a query for the word - английский consistently failed, whereas other words would succeed. Eventually the commonality between the failures became obvious. All of the failures contained the letter й , which led me down a rabbit hole of character encoding and this specific case where it can go astray.