Russian

Fixing Knowclip Anki apkg creation dates

(N.B. A much-improved version of this script is published in a later post)

Language learners who want to develop their listening comprehension skills often turn to YouTube for videos that feature native language content. Often these videos have subtitles in the original language. A handful of applications allow users to take these videos along with their subtitles and chop them up into sentence-length bites that are suitable for Anki cards. Once such application is Knowclip. Indeed for macOS users, it’s one of the few viable options.1

Pre-processing Russian text for the AwesomeTTS add-on in Anki

The Anki add-on AwesomeTTS has been a vital tool for language learners using the Anki application on the desktop. It allows you to have elements of the card read aloud using text-to-speech capabilities. The new developer of the add-on has added a number of voice options, including the Microsoft Azure voices. The neural voices for Russian are quite good. But they have one major issue, syllabic stress marks that are sometimes seen in text intended for language learners cause the Microsoft Azure voices to grossly mispronounce the word.

Factor analysis of failed language cards in Anki

After developing a rudimentary approach to detecting resistant language learning cards in Anki, I began teasing out individual factors. Once I was able to adjust the number of lapses for the age of the card, I could examine the effect of different factors on the difficulty score that I described previously.

Findings

Some of the interesting findings from this analysis:

  • Prompt-answer direction - 62% of lapses were in the Russian → English (recognition) direction.1
  • Part of speech - Over half (51%) of lapses were among verbs. Since the Russian verbal system is rich and complex, it’s not surprising to find that verb cards often fail.
  • Noun gender - Between a fifth and a quarter (22%) of all lapses were among neuter nouns and among failures due to nouns only, neuter nouns represented 69% of all lapses. This, too, makes intuitive sense because neuter nouns often represent abstract concepts that are difficult to represent mentally. For example, the Russian words for community, representation, and indignation are all neuter nouns.

Interventions

With a better understanding of the factors that contribute to lapses, it is easier to anticipate failures before they accumulate. For example, I will immediately implement a plan to surround new neuter nouns with a larger variety of audio and sample sentence cards. For new verbs, I’ll do the same, ensuring that I include multiple forms of the verb, varying the examples by tense, number, person, aspect and so on.

Refactoring Anki language cards

Regardless of how closely you adhere to the 20 rules for formating knowledge, there are cards that seem destined to leechdom. For me part of the problem is that with languages, straight-up vocabulary cards take words out of the rich context in which they exist in the wild. With my maturing collection of Russian decks, I recently started to go through these resistant cards and figure out why they are so difficult.

Parsing Russian Wiktionary content using XPath

As readers of this blog know, I’m an avid user of Anki to learn Russian. I have a number of sources for reference content that go onto my Anki cards. Notably, I use Wiktionary to get word definitions and the word with the proper syllabic stress marked. (This is an aid to pronunciation for Russian language learners.)

Since I’m lazy to the core, I came up with a system way of grabbing the stress-marked word from the Wiktionary page using lxml and XPath.

Removing stress marks from Russian text

Previously, I wrote about adding syllabic stress marks to Russian text. Here’s a method for doing the opposite - that is, removing such marks (ударение) from Russian text.

Although there may well be a more sophisticated approach, regex is well-suited to this task. The problem is that

def string_replace(dict,text):
   sorted_dict = {k: dict[k] for k in sorted(dict)}
   for n in sorted_dict.keys():
      text = text.replace(n,dict[n])
   return text

dict = { "а́" : "а", "е́" : "е", "о́" : "о", "у́" : "у",
      "я́" : "я", "ю́" : "ю", "ы́" : "ы", "и́" : "и",
      "ё́" : "ё", "А́" : "А", "Е́" : "Е", "О́" : "О",
      "У́" : "У", "Я́" : "Я", "Ю́" : "Ю", "Ы́" : "Ы",
      "И́" : "И", "Э́" : "Э", "э́" : "э"
   } 
   
print(string_replace(dict, "Существи́тельные в шве́дском обычно де́лятся на пять склоне́ний."))

This should print: Существительные в шведском обычно делятся на пять склонений.

URL-encoding URLs in AppleScript

The AppleScript Safari API is apparently quite finicky and rejects Russian Cyrillic characters when loading URLs.

For example, the following URL https://en.wiktionary.org/wiki/стоять#Russian throws an error in AppleScript. Instead, Safari requires URL’s of the form https://en.wiktionary.org/wiki/%D1%81%D1%82%D0%BE%D1%8F%D1%82%D1%8C#Russian whereas Chrome happily consumes whatever comes along. So, we just need to encode the URL thusly:

use framework "Foundation"

-- encode Cyrillic test as "%D0" type strings
on urlEncode(input)
   tell current application's NSString to set rawUrl to stringWithString_(input)
   -- 4 is NSUTF8StringEncoding
   set theEncodedURL to rawUrl's stringByAddingPercentEscapesUsingEncoding:4 
   return theEncodedURL as Unicode text
end urlEncode

When researching Russian words for vocabulary study, I use the URL encoding handler to load the appropriate words into several reference sites in sequential Safari tabs.

Свидетельство того или тому?

I was puzzled by this sentence on the BBC Russian Service:

Нет свидетельств тому, что на нынешних выборах дело обстоит иначе.

ББС
  <cite>Мошенничество на выборах в США? Проверяем факты в речи Трампа</cite>

It means “There is no evidence that in the current election things are any different.” but the puzzle isn’t the meaning, it’s the grammatical case in which the author has placed the demonstrative pronoun то , which is dative here тому . The thing is that you see examples where either the genitive or the dative follows свидетельство . So what’s the difference?

Escaping "Anki hell" by direct manipulation of the Anki sqlite3 database

There’s a phenomenon that verteran Anki users are familiar with - the so-called “Anki hell” or “ease hell.”

Origins of ease hell

The descent into ease hell has to do with the way Anki handles correct and incorrect answers when it presents cards for review. Ease is a numerical score associated with every card in the database and represents a valuation of the difficulty of the card. By default, when cards graduate from the learning phase, an ease of 250% is applied to the card. If you continue to get the card correct, then the ease remains at 250% in perpetuity. As you see the card at its increasing intervals, the ease will remain the same. All good. Sort of.

Typing Russian stress marks on macOS

While Russian text intended for native speakers doesn’t show accented vowel characters to point out the syllabic stress (ударение) , many texts intended for learners often do have these marks. But how to apply these marks when typing?

Typically, for Latin keyboards on macOS, you can hold down the key (like long-press on iOS) and a popup dialog will show you options for that character. But in the standard Russian phonetic keyboard it doesn’t work. Hold down the e key and you’ll get the option for the letter ë (yes, it’s regarded as a separate letter in Russian - the essential but misbegotten ë .)