regex

A regex to remove Anki's cloze markup

Recently, someone asked a question on r/Anki about changing and existing cloze-type note to a regular note. Part of the solution involves stripping the cloze markup from the existing cloze’d field. A cloze sentence has the form Play {{c1::studid}} games. or Play {{c1::stupid::pejorative adj}} games. To handle both of these cases, the following regular expression will work. Just substitute for $1. {{c\d::([^:}]+)(?:::+[^}])}} However, the Cloze Anything markup is different. It uses ( and ) instead of curly braces.

Stripping Russian syllabic stress marks in Python

I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea. #!/usr/bin/env python3 def strip_stress_marks(text: str) -> str: b = text.encode('utf-8') # correct error where latin accented ó is used b = b.replace(b'\xc3\xb3', b'\xd0\xbe') # correct error where latin accented á is used b = b.replace(b'\xc3\xa1', b'\xd0\xb0') # correct error where latin accented é is used b = b.

Regex to match a cloze

Anki and some other platforms use a particular format to signify cloze deletions in flashcard text. It has a format like any of the following: {{c1::dog::}} {{c2::dog::domestic canine}} Here’s a regular expression that matches the content of cloze deletions in an arbitrary string, keeping only the main clozed word (in this case dog.) {{c\d::(.*?)(::[^:]+)?}} To see it in action, here it is in action in a Python script:

Sunday, September 16, 2018

Regex 101 is a great online regex tester. Speaking of regular expressions, for the past year, I’ve used an automated process for building Anki flash cards. One of the steps in the process is to download Russian word pronunciations from Wiktionary. When Wiktionary began publishing transcoded mp3 files rather than just ogg files, they broke the URL scheme that I relied on to download content. The new regex for this scheme is: (?

Detecting Russian letters with regex

How to identify Russian letters in a string? The short answer is: [А-Яа-яЁё] but depending on your regex flavor, [\p{Cyrillic}] might work. What in the word does this regex mean? It’s just like [A-Za-z] with a twist. The Ёё at the end adds support for ё (“yo”) which is in the Latin group of characters. See this question on Stack Overflow.