Sunday, September 16, 2018

Regex 101 is a great online regex tester. Speaking of regular expressions, for the past year, I’ve used an automated process for building Anki flash cards. One of the steps in the process is to download Russian word pronunciations from Wiktionary. When Wiktionary began publishing transcoded mp3 files rather than just ogg files, they broke the URL scheme that I relied on to download content. The new regex for this scheme is: (?

Language word frequencies

Since one of the cornerstones of my approach to learning the Russian language has been to track how many words I’ve learned and their frequencies, I was intrigued by reading the following statistics today: The 15 most frequent words in the language account for 25% of all the words in typical texts. The first 100 words account for 60% of the words appearing in texts. 97% of the words one encounters in a ordinary text will be among the first 4000 most frequent words.

Detecting Russian letters with regex

How to identify Russian letters in a string? The short answer is: [А-Яа-яЁё] but depending on your regex flavor, [\p{Cyrillic}] might work. What in the word does this regex mean? It’s just like [A-Za-z] with a twist. The Ёё at the end adds support for ё (“yo”) which is in the Latin group of characters. See this question on Stack Overflow.