Anki

Pre-processing Russian text for the AwesomeTTS add-on in Anki

The Anki add-on AwesomeTTS has been a vital tool for language learners using the Anki application on the desktop. It allows you to have elements of the card read aloud using text-to-speech capabilities. The new developer of the add-on has added a number of voice options, including the Microsoft Azure voices. The neural voices for Russian are quite good. But they have one major issue, syllabic stress marks that are sometimes seen in text intended for language learners cause the Microsoft Azure voices to grossly mispronounce the word.

Factor analysis of failed language cards in Anki

After developing a rudimentary approach to detecting resistant language learning cards in Anki, I began teasing out individual factors. Once I was able to adjust the number of lapses for the age of the card, I could examine the effect of different factors on the difficulty score that I described previously.

Findings

Some of the interesting findings from this analysis:

  • Prompt-answer direction - 62% of lapses were in the Russian → English (recognition) direction.1
  • Part of speech - Over half (51%) of lapses were among verbs. Since the Russian verbal system is rich and complex, it’s not surprising to find that verb cards often fail.
  • Noun gender - Between a fifth and a quarter (22%) of all lapses were among neuter nouns and among failures due to nouns only, neuter nouns represented 69% of all lapses. This, too, makes intuitive sense because neuter nouns often represent abstract concepts that are difficult to represent mentally. For example, the Russian words for community, representation, and indignation are all neuter nouns.

Interventions

With a better understanding of the factors that contribute to lapses, it is easier to anticipate failures before they accumulate. For example, I will immediately implement a plan to surround new neuter nouns with a larger variety of audio and sample sentence cards. For new verbs, I’ll do the same, ensuring that I include multiple forms of the verb, varying the examples by tense, number, person, aspect and so on.

Refactoring Anki language cards

Regardless of how closely you adhere to the 20 rules for formating knowledge, there are cards that seem destined to leechdom. For me part of the problem is that with languages, straight-up vocabulary cards take words out of the rich context in which they exist in the wild. With my maturing collection of Russian decks, I recently started to go through these resistant cards and figure out why they are so difficult.

Parsing Russian Wiktionary content using XPath

As readers of this blog know, I’m an avid user of Anki to learn Russian. I have a number of sources for reference content that go onto my Anki cards. Notably, I use Wiktionary to get word definitions and the word with the proper syllabic stress marked. (This is an aid to pronunciation for Russian language learners.)

Since I’m lazy to the core, I came up with a system way of grabbing the stress-marked word from the Wiktionary page using lxml and XPath.

Directly setting an Anki card's interval in the sqlite3 database

It’s always best to let Anki set intervals according to its view of your performance on testing. That said, there are times when directly altering the interval makes sense. For example, to build out a complete representation of the entire Russian National Corpus, I’m forced to enter vocabulary terms that should be obvious to even elementary Russian learners but which aren’t yet in my nearly 24,000 card database. Therefore, I’m entering these cards gradually. When they come up as new cards, I pass them as “Easy” on the first appearance, converting them to review cards. But ideally, I’d like to send them away for years.

Regex to match a cloze

Anki and some other platforms use a particular format to signify cloze deletions in flashcard text. It has a format like any of the following:

  • {{c1::dog::}}
  • {{c2::dog::domestic canine}}

Here’s a regular expression that matches the content of cloze deletions in an arbitrary string, keeping only the main clozed word (in this case dog.)

{{c\d::(.*?)(::[^:]+)?}}

To see it in action, here it is in action in a Python script:

import re

def stripCloze(searchText):
    return re.sub(r'{{c\d::(.*?)(::[^:]+)?}}', r"\1", searchText)

print(stripCloze("The {{c1::passengers::tourist riders}} spotted a breaching {{c2::whale}}."))

It should return The passengers spotted a breaching whale.

An alternative method for keyboard input switching on macOS

macOS offers a variety of virtual keyboard layouts which are accessible through System Preferences > Keyboard > Input Sources. Because I spend about half of my time writing in Russian and half in English, rapid switching between keyboard layouts is important. Optionally in the Input Sources preference pane, you can choose to use the Caps lock key to toggle between sources. This almost always works well with the exception of Anki. Presumably Anki’s non-standard text management system thwarts the built-in Caps Lock/toggle mechanism for reasons that are not clear to me. Equally unclear is why this worked previously but now does not. I’ve not updated either Anki or the system software. It’s a mystery. Nonetheless, began to search for an alternative method for switching between keyboard layout switching. What I developed relies on several tools:

Sunday, September 16, 2018

Regex 101 is a great online regex tester.


Speaking of regular expressions, for the past year, I’ve used an automated process for building Anki flash cards. One of the steps in the process is to download Russian word pronunciations from Wiktionary. When Wiktionary began publishing transcoded mp3 files rather than just ogg files, they broke the URL scheme that I relied on to download content. The new regex for this scheme is: (?:src=.*:)?src=\"(\/\/.*\.mp3)

Peering into Anki using R

Yet another diversion to keep me from focusing on actually using Anki to learn Russian. I stumbled on the R programming language, a language that focuses on statistical analysis.

Here’s a couple snippets that begin to scratch the surface of what’s possible. Important caveat: I’m an R novice at best. There are probably much better ways of doing some of this…

Counting notes with a particular model type

Here we’ll use R to do what we did previously with Python.

Anki database adventures: Counting notes by model type

Continuing my series on accessing the Anki database outside of the Anki application environment, here’s a piece on accessing the note type model. You may wish to start here with the first article on accessing the Anki database. This is geared toward mac OS. (If you’re not on mac OS, then start here instead.)

The note type model

Since notes contain flexible fields in Anki, the model for a note type is in JSON. The best guess definition of the JSON is: