A deep dive into my Anki language learning: Part I (Overview and philosophy)

Although I’ve been writing about Anki for years, it’s been in bits and pieces. Solving little problems. Creating efficiencies. But I realized that I’ve never taken a top-down approach to my Anki language learning system. So consider the post the launch of that overdue effort.

Caveats

A few caveats at the outset:

  • I’m not a professional language tutor or pedagogue of any sort really. Much of what I’ve developed, I’ve done through trial-and-error, some intuition, and a some reading on relevant topics.
  • People learn differently and have different goals. This series will be exclusively focused on language-learning. There are similarities between this type of learning and the memorization of bare facts. But there are important differences, too.
  • As I get further and further into the details, more and more of what I discuss will be macOS specific. I’m not particularly opinionated about operating systems. And my preference has more to do with the accumulated weight of what I’m accustomed to and as a consequence, the potential pain of switching. In the sections that deal with macOS specific solutions, feel free to skip over that content or read it with a view toward thinking about parallel tools on whatever OS you are using.
  • I use Anki almost exclusively for Russian language acquisition and practice. Of necessity, some particularities of the language are going to dictate the specific issues that you need to solve for. For example, if verbs of motion aren’t part of the grammar of your target language (TL) then rather than getting lost in those weeds, think about what unique counterparts your TL does have and how you might adopt the approaches I’m presenting.

We that out of the way, let’s dive in!

About me, about my language learning

A native English speaker, I began studying Russian about 40 years ago. While I intermittently used Anki, I didn’t begin using it in earnest until 2015 - about 7 years ago. In those seven years, I’ve ammassed about 34,000 cards.1 Owing to the complexities of life, I’ve had a few gaps in my study streak but right now it sits at 600 days. Occasionally I use the iOS version of Anki, but mostly I prefer the desktop experience.

Principles

Over the years, I’ve developed a set of principles that guide what kinds of decks I create, the type of content I put on the cards and how I setup my workflow. In no particular order these are my guiding principles around using Anki to study languages.

  1. There are lots of ways to use Anki to study langauges and the more diversity and the “angles” you use to approach it, the better. - A lot of new Anki users download a pre-built deck and just start studying. With a few notable exceptions, these decks are straight L1 ← → L2 single-word vocabulary decks. This is a mistake. Language comes at us in a variety of ways, but seldom in the form of single words at a time.
  2. Favour sentences. - There’s a role for vocabulary decks - I have a large vocabulary deck. But over time, I’ve focused more and more on integration of vocabulary into connected speech.
  3. Avoid using Anki to memorize grammar tables. - Anki can handle tables easily, but that doesn’t mean you should use it to memorize grammar tables. In fact, grammar tables on only useful as reference points. Otherwise, use Anki to learn grammar in the context of applied natural language.
  4. Make your own cards. - If you use a downloaded pre-made deck you’ll be constrained by the choices the creator has made. As often as not, they’re flawed. I’ve tried pre-built decks and ended up discarding every one.
  5. Be visual - Where appropriate, use imagery on cards. But don’t force it. If the image is only tangentially connected to the word or concept on the card, omit it. Otherwise it doesn’t add anything to your efforts.
  6. Learn in multiple directions. I see questions about whether to learn L1 → L2 or L2 → L1 or both. Some people have strong opinions about this. Mine is to learn in both directions to maximize the flexibility of your knowledge. You’ll be able to recognize the word and to produce it. As I’ll describe in later posts, you can even extend it monolingual cards, image-only cards, and other variations.
  7. Be audible. - In the same way that visual imagery is crucial, so is audible content. Every word should have a pronunciation. It further reinforces your knowledge.
  8. Develop a consistent workflow. - By being consistent with when you study, how you study, how many cards you create per day, the types of information that you extract for your cards, your language learning habit will be more secure and you’ll make more consistent progress.
  9. Read the Anki manual. - No joke. RTFM. In general people on r/Anki are pretty nice about it, but the answers to many questions from beginners can be found in the manual. But no one reads. Don’t be that guy.
  10. Avoid frequency lists - Everyone wants to learn words by frequency. It’s logical but it’s a poor way to acquire vocabulary. We learn vocabulary by relevance and context, not by frequency.
  11. Don’t use Anki as a standalone tool. - Use Anki to reinforce what you’re learning through consuming media, conversing with others, reading.
  12. If you have the technical skills, applying just a little technical knowledge can speed up the card-making process. - It’s also a good way to learn some technical skills. I didn’t do much with JavaScript, for example, until I tried to start solving problems in Anki. You can do the same with CSS, etc.
  13. Consistency beats speed. - Maybe you can learn 50 words a day. Most people can’t. I can’t.

Deck overview

The last part of this post is just a teaser. It’s an overview of my deck setup. In subsequent posts, I’ll dive into what each of these decks dose and how it fits into my strategy. Here it is:

From the 30,000 foot view, these decks are:

  1. MCD (“Massive-context cloze deletion”) - These comes up in the Japanese learning community a lot. I’m not 100% sold on it yet. The idea (I think) is to use larger blocks of source material for your cloze cards.
  2. Грамматика (Grammar) - These are cards that emphasize particular grammatical points. Rarely will you find any conjugation or declension tables here. Mostly it’s applied grammar.
  3. Помоги мне (“Help me”) - Just a few cards here. They way a card gets here is that it has been a leech elsewhere and I need something more to help me remember the word - emphasizing the roots of the word, contrasting it with a similar sounding word - something extra. There are very few cards here.
  4. Предложения (Sentences) - Like it says on the tin. Just sentences, all cloze deletion.
  5. Предложения - A/V (Sentences A/V) - Sentences that come from movie subtitles. The point is recognition of the sentence by listening.
  6. Словарный запас (Vocabulary) - Vocabulary terms. Most of the cards here originate in my most complex note type because I want to study from every angle of a word. (See principle 1 above.)
  7. Vocabulary filtered decks - These aren’t in active use. If I want to study transportation or banking terms as a group, this is where I would go.
  8. Числа (Numbers) - Numbers in Russian are somewhat difficult. This is where those cards live.

And that’s it for now. In the next post, I’ll tackle the most complex deck in which my vocabulary terms live. Part II


  1. I have had a few people ask me for my decks. The answer as of right now is no. I’m not opposed to sharing a curated subset of my decks to help someone get started but I’m not there yet. Otherwise, there’s just too much in there for me to feel comfortable releasing into the wild. If I ever get to the point of releasing something, you’ll see it here first. ↩︎

A tool for scraping definitions of Russian words from Wikitionary

In my perpetual attempt to make my language learning process using Anki more efficient, I’ve written a tool to extract English-language definitions from Russian words from Wiktionary. I wrote about the idea previously in Scraping Russian word definitions from Wikitionary: utility for Anki but it relied on the WiktionaryParser module which is good but misses some important edge cases. So I rolled up my sleeves and crafted my own solution. As with WiktionaryParser the heavy-lifting is done by the Beautiful Soup parser. Much of the logic of this tool is around detecting the edge cases that I mentioned. For example, the underlying HTML format changes when we’re dealing with a word that has multiple etymologies versus those with a single etymology. Whenever you’re doing web scraping you have to account for those sorts of variations.

Code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from urllib.request import urlopen, Request
import urllib.parse
from http.client import HTTPResponse
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem, HardwareType
import copy
import re
import sys
from bs4 import BeautifulSoup, element

def remove_html_comments(html: str) -> str:
    """
    Strips HTML comments. See https://stackoverflow.com/a/57996414
    :param html: html string to process
    :return: html string with comments stripped
    """
    result = re.sub(r'(<!--.*?-->)|(<!--[\S\s]+?-->)|(<!--[\S\s]*?$)', "", html)
    return result

def extract_russian_soup(response: HTTPResponse) -> BeautifulSoup:
   new_soup = BeautifulSoup('', 'html.parser')
   # remove HTML comments before processing
   html_str = response.read().decode('UTF-8')
   cleaner_html = remove_html_comments(html_str)
   soup = BeautifulSoup(cleaner_html, 'html.parser')
   # get rid of certain tags to make it lighter
   # to work with
   [s.extract() for s in soup(['head', 'script', 'footer'])]
   for h2 in soup.find_all('h2'):
      for span in h2.children:
         try:
            if 'Russian' in span['id']:
               new_soup.append(copy.copy(h2))
               # capture everything in the Russian section
               for curr_sibling in h2.next_siblings:
                  if curr_sibling.name == "h2":
                     break
                  else:
                     new_soup.append(copy.copy(curr_sibling))
               break
         except:
            pass
   return new_soup
   
def check_excluded_ids(span_id: str) -> bool:
   excluded = ['Pronunciation', 'Alternative_forms', 'Etymology']
   for ex in excluded:
      if re.search(ex, span_id, re.IGNORECASE):
         return True
   return False

def remove_dl_ul(li: element.Tag) -> element.Tag:
   try:
      dl_extract = li.dl.extract()
   except AttributeError:
      pass
      # sometimes citations are presented in <ul> so remove
   try:
      ul_extract = li.ul.extract()
   except AttributeError:
      pass
   return li

def url_from_ru_word(raw_word:str) -> str:
   # strip syllabic stress diacritical marks
   raw_word = re.sub(r'\u0301|\u0300', "", raw_word)
   raw_word = raw_word.replace(" ", "_").strip()
   word = urllib.parse.quote(raw_word)
   return f'https://en.wiktionary.org/wiki/{word}#Russian'

def request_headers() -> dict:
   hn = [HardwareType.COMPUTER.value]
   user_agent_rotator = UserAgent(hardware_types=hn,limit=20)
   user_agent = user_agent_rotator.get_random_user_agent()
   return {'user-agent': user_agent}

if __name__ == "__main__":
   __version__ = 1.0
   
   # accept word as either argument or on stdin
   try:
      raw_word = sys.argv[1]
   except IndexError:
      raw_word = sys.stdin.read()
      
   url = url_from_ru_word(raw_word)
   headers = request_headers()
   
   try:
      response = urlopen(Request(url, headers = headers))
   except urllib.error.HTTPError as e:
      if e.code == 404:
         print("Error - no such word")
      else:
         print(f"Error: status {e.code}")
      sys.exit(1)
   
   # first extract the Russian content because
   # we may have other languages. This just
   # simplifies the parsing for the headword
   new_soup = extract_russian_soup(response)
            
   # use the derived soup to pick out the headword from
   # the Russian-specific content
   definitions = []
   
   # there are cases (as with the word 'бухта' where there are
   # multiple etymologies. In these cases, the page structure is
   # different. We will try both structures.
   
   for tag in ['h3', 'h4']:
      for h3_or_h4 in new_soup.find_all(tag):
         found = False
         for h3_or_h4_child in h3_or_h4.children:
            if h3_or_h4_child.name == 'span':
               if h3_or_h4_child.get('class'):
                  span_classes = h3_or_h4_child.get('class')
                  if 'mw-headline' in span_classes:
                     span_id = h3_or_h4_child.get('id')
                     # exclude any h3 whose span is not a part of speech
                     if not check_excluded_ids(span_id):
                        found = True
                     break
         if found:
            ol = h3_or_h4.find_next_sibling('ol')
            if ol is None:
               continue
            lis = ol.children
            for li in lis:
               # skip '\n' children
               if li.name != 'li':
                  continue
               # remove any extraneous detail tags + children, etc.
               li = remove_dl_ul(li)
               li_def = li.text.strip()
               definitions.append(li_def)
   definition_list = '; '.join(definitions)
   # if a definition has a single line, remove the ;\s
   definition_list = re.sub(r'^(?:;\s)+(.*)$', '\\1', definition_list)
   # remove "see also" links
   definition_list = re.sub(r'\(see also[^\)]*\)+', "", definition_list)
   print(definition_list)

Usage

The script works flexibly accepting a Russian language word either from stdin or as the first argument. For example

echo "собака" | ruendef # or
ruendef "собака"

Both print out:

dog; hound; (derogatory, figuratively) mongrel, cur, bastard (a detestable person); (colloquial, figuratively) fox (a clever, capable person); (Internet) @ (at sign); (computing slang) watchdog timer

Getting plaintext into Anki fields on macOS: An update

A few years ago, I wrote about my problems with HTML in Anki fields. If you check out that previous post you’ll get the backstory about my objection. The gist is this: If you copy something from the web, Anki tries to maintain the formatting. Basically it just pastes the HTML off the clipboard. Supposedly, Anki offers to strip the formatting with Shift-paste, but I’ve point out to the developer specific examples where this fails.

Thursday, May 26 2022

I would like to propose a constitutional amendment that prohibits Sen. Ted Cruz (F-TX)1 from speaking or tweeting for seven days after a national tragedy. I’d also be fine with an amendment that prohibits him from speaking ever. The “F” designation stands for Fascist. The party to which Cruz nominally belongs is more aligned with WW2-era Axis dictatorships than those of a legitimate free civil democracy. ↩︎

Extracting title title of a web page from the command line

I was using a REST API at https://textance.herokuapp.com/title but it seems awfully fragile. Sure enough this morning, the entire application is down. It’s also not open-source and I have no idea who actually runs this thing. Here’s the solution: #!/bin/bash url=$(pbpaste) curl $url -so - | pup 'meta[property=og:title] attr{content}' It does require pup. On macOS, you can install via brew install pup. There are other ways using regular expressions but no dependency on pup but parsing HTML with regex is not such a good idea.

Friday, May 20, 2022

“Enlightenment is the absolute cooperation with the inevitable." - Anthony De Mello. Although he writes like a Buddhist, apparently he’s a Jesuit.

Three-line (though non-standard) interlinear glossing

Still thinking about interlinear glossing for my language learning project. The leizig.js library is great but my use case isn’t really what the author had in mind. I really just need to display a unit consisting of the word as it appears in the text, the lemma for that word form, and (possibly) the part of speech. For academic linguistics purposes, what I have in mind is completely non-standard. The other issue with leizig.