Russian

Sunday, September 16, 2018

Regex 101 is a great online regex tester.


Speaking of regular expressions, for the past year, I’ve used an automated process for building Anki flash cards. One of the steps in the process is to download Russian word pronunciations from Wiktionary. When Wiktionary began publishing transcoded mp3 files rather than just ogg files, they broke the URL scheme that I relied on to download content. The new regex for this scheme is: (?:src=.*:)?src=\"(\/\/.*\.mp3)

Language word frequencies

Since one of the cornerstones of my approach to learning the Russian language has been to track how many words I’ve learned and their frequencies, I was intrigued by reading the following statistics today:

  • The 15 most frequent words in the language account for 25% of all the words in typical texts.
  • The first 100 words account for 60% of the words appearing in texts.
  • 97% of the words one encounters in a ordinary text will be among the first 4000 most frequent words.

In other words, if you learn the first 4000 words of a language, you’ll be able to understand nearly everything.

How I use Anki to learn Russian

Learning the vocabulary of a non-native language is a daunting task. The Russian vocabulary encompasses an estimated 200,000 words. Facing the task of learning this massive vocabulary for a foreign speaker is a Herculean task.^[Fortunately, many words are rare or obsolete and my experience with other languages is that you can make yourself understood with far less than the complete vocabulary.] The average adult English speaker is said to use about 20,000 words. Presumably Russian speakers can get by with about number too. Nonetheless, it remains an enormous task, one that can’t be conquered solely by brute force.

Detecting Russian letters with regex

How to identify Russian letters in a string? The short answer is: [А-Яа-яЁё] but depending on your regex flavor, [\p{Cyrillic}] might work. What in the word does this regex mean? It’s just like [A-Za-z] with a twist. The Ёё at the end adds support for ё (“yo”) which is in the Latin group of characters.

See this question on Stack Overflow.