Welcome to Part III of a deep dive into my Anki language learning decks. In Part I I covered the principles that guide how I setup my decks and the overall deck structure. In the lengthy Part II I delved into my vocabulary deck. In this installment, Part III, we’ll cover my sentence decks. Principles First, sentences (and still larger units of language) should eventually take precedence in language study. What help is it to know the word for “tomato” in your L2, if you don’t know how to slice a tomato, how to eat a tomato, how to grow a tomato plant?
In Part I of my series on my Anki language-learning setup, I described the philosophy that informs my Anki setup and touched on the deck overview. Now I’ll tackle the largest and most complex deck(s), my vocabulary decks. First some FAQ’s about my vocabulary deck: Do you organize it as L1 → L2 or as L2 → L1, or both? Actually, it’s both and more. Keep reading. Do you have separate subdecks by language level, or source, or some other characteristic?
Although I’ve been writing about Anki for years, it’s been in bits and pieces. Solving little problems. Creating efficiencies. But I realized that I’ve never taken a top-down approach to my Anki language learning system. So consider the post the launch of that overdue effort. Caveats A few caveats at the outset: I’m not a professional language tutor or pedagogue of any sort really. Much of what I’ve developed, I’ve done through trial-and-error, some intuition, and a some reading on relevant topics.
In my perpetual attempt to make my language learning process using Anki more efficient, I’ve written a tool to extract English-language definitions from Russian words from Wiktionary. I wrote about the idea previously in Scraping Russian word definitions from Wikitionary: utility for Anki but it relied on the WiktionaryParser module which is good but misses some important edge cases. So I rolled up my sleeves and crafted my own solution. As with WiktionaryParser the heavy-lifting is done by the Beautiful Soup parser.
PUNCTnodes in interlinear glossing.
Still thinking about interlinear glossing for my language learning project. The leizig.js library is great but my use case isn’t really what the author had in mind. I really just need to display a unit consisting of the word as it appears in the text, the lemma for that word form, and (possibly) the part of speech. For academic linguistics purposes, what I have in mind is completely non-standard. The other issue with leizig.
leipzig.js is a library for applying interlinear gloss to texts for linguistic analysis. In this post, I experiment a little with this libary to evaluate whether it would work for a little project of mine.
Starting a new devlog about Hedghog, a new language learning app and some thoughts about the interlinear display of lemmas.
Splitting text into sentences is one of those tasks that looks simple but on closer inspection is more difficult than you think. A common approach is to use regular expressions to divide up the text on punction marks. But without adding layers of complexity, that method fails on some sentences. This is a method using spaCy.
I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea. #!/usr/bin/env python3 def strip_stress_marks(text: str) -> str: b = text.encode('utf-8') # correct error where latin accented ó is used b = b.replace(b'\xc3\xb3', b'\xd0\xbe') # correct error where latin accented á is used b = b.replace(b'\xc3\xa1', b'\xd0\xb0') # correct error where latin accented é is used b = b.