Stripping Russian syllabic stress marks in Python

I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea. #!/usr/bin/env python3 def strip_stress_marks(text: str) -> str: b = text.encode('utf-8') # correct error where latin accented ó is used b = b.replace(b'\xc3\xb3', b'\xd0\xbe') # correct error where latin accented á is used b = b.replace(b'\xc3\xa1', b'\xd0\xb0') # correct error where latin accented é is used b = b.

Converting Cyrillic UTF-8 text encoded as Latin-1

This may be obvious to some, but visually-recognizing character encoding at a glance is not always obvious. For example, pronunciation files downloaded form Forvo have the following appearance: pronunciation_ru_оÑ‚бывание.mp3 How can we extact the actual word from this gibberish? Optimally, the filename should reflect that actual word uttered in the pronunciation file, after all. Step 1 - Extracting the interesting bits The gibberish begins after the pronunciation_ru_ and ends before the file extension.

accentchar: a command-line utility to apply Russian stress marks

I’ve written a lot about applying and removing syllabic stress marks in Russian text because I use it a lot when making Anki cards. This iteration is a command line tool for applying the stress mark at a particular character index. The advantage of these little shell tools is that they can be composable, integrating into different tools as the need arises. #!/usr/local/bin/zsh while getopts i:w: flag do case "${flag}" in i) index=${OPTARG};; w) word=${OPTARG};; esac done if [ $word ]; then temp=$word else read temp fi outword="" for (( i=0; i<${#temp}; i++ )); do thischar="${temp:$i:1}" if [ $i -eq $index ]; then thischar=$(echo $thischar | perl -C -pe 's/(.

Stripping Russian stress marks from text from the command line

Russian text intended for learners sometimes contains marks that indicate the syllabic stress. It is usually rendered as a vowel + a combining diacritical mark, typically the combining acute accent \u301. Here are a couple ways of stripping these marks on the command line: First is a version using Perl #!/bin/bash f='покупа́ешья́'; echo $f | perl -C -pe 's/\x{301}//g;' And then another using the sd tool: #!/bin/bash f='покупа́ешья́'; echo $f | sd "\u0301" "" Both rely on finding the combining diacritical mark and removing it with regex.

Normalizing spelling in Russian words containing the letter ё

The Russian letters ё and e have a complex and troubled relationship. The two letters are pronounced differently, but usually appear the same in written text. This presents complications for Russian learners and for text-to-speech systems. In several recent projects, I have needed to normalize the spelling of Russian words. For examples, if I have the written word определенно , is the word actually определенно ? Or is it определённо ?

Scraping Russian word definitions from Wikitionary: utility for Anki

While my Russian Anki deck contains around 27,000 cards, I’m always making more. (There are a lot words in the Russian language!) Over the years, I’ve become more and more efficient with card production but one of the missing pieces was finding a code-readable source of word definitions. There’s no shortage of dictionary sites, but scraping data from any site is complicated by the ways in which front-end developers spread the semantic content across multiple HTML tags arranged in deep and cryptic hierarchies.

Encoding of the Cyrillic letter й - a UTF-8 gotcha

In the process of writing and maintaining a service that checks Russian word frequencies, I noticed peculiar phenomenon: certain words could not be located in a sqlite database that I knew actually contained them. For example, a query for the word - английский consistently failed, whereas other words would succeed. Eventually the commonality between the failures became obvious. All of the failures contained the letter й , which led me down a rabbit hole of character encoding and this specific case where it can go astray.

Extending the Anki Cloze Anything script for language learners

It’s possible to use cloze deletion cards within standard Anki note types using the Anki Cloze Anything setup. But additional scripts are required to allow it to function seamlessly in a typical language-learning environment. I’ll show you how to flexibly display a sentence with or without Anki Cloze Anything markup and also not break AwesomeTTS. Anki’s built-in cloze deletion system The built-in cloze deletion feature in Anki is an excellent way for language learners to actively test their recall.

Complete fix for broken Knowclip .apkg files

I think this is the last word on fixing Knowclip .apkg files. I’ve developed this in bits and pieces; but hopefully this is the last word on the subject. See my previous articles, here and here, for the details. This issue, again, is that Knowclip gives these notes and cards sequential id values starting at 1. But Anki uses the and the as the creation date. I logged it as an issue on Github, but as of 2021-04-15 no action has been taken.

Fixing Knowclip .apkg files: one more thing

(N.B. A much-improved version of this script is published in a later post) Fixing the Knowclip note files as I described previously, it turns out, is only half of the fix with the broken .apkg files. You also need to fix the cards table. Why? Same reason. The rows are number sequentially from 1. But since Anki uses the card id field as the date added, the added field is always wrong.