Regex

Recently, someone asked a question on r/Anki about changing and existing cloze-type note to a regular note. Part of the solution involves stripping the cloze markup from the existing cloze’d field. A cloze sentence has the form Play {{c1::studid}} games. or Play {{c1::stupid::pejorative adj}} games.

To handle both of these cases, the following regular expression will work. Just substitute for $1.

\{\{c\d::([^:\}]+)(?:::+[^\}]*)*\}\}

However, the Cloze Anything markup is different. It uses ( and ) instead of curly braces. If we want to flexibly remove both the standard and Cloze Anything markup, then our pattern would look like:

I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea.

#!/usr/bin/env python3

def strip_stress_marks(text: str) -> str:
   b = text.encode('utf-8')
   # correct error where latin accented ó is used
   b = b.replace(b'\xc3\xb3', b'\xd0\xbe')
   # correct error where latin accented á is used
   b = b.replace(b'\xc3\xa1', b'\xd0\xb0')
   # correct error where latin accented é is used
   b = b.replace(b'\xc3\xa0', b'\xd0\xb5')
   # correct error where latin accented ý is used
   b = b.replace(b'\xc3\xbd', b'\xd1\x83')
   # remove combining diacritical mark
   b = b.replace(b'\xcc\x81',b'').decode()
   return b

text = "Том столкну́л Мэри с трампли́на для прыжко́в в во́ду."

print(strip_stress_marks(text))
# prints "Том столкнул Мэри с трамплина для прыжков в воду."

The approach is similar to the Perl-based tool we constructed before, but this time we are working working on the bytes object after encoding as utf-8. Since the bytes object has a replace method, we can use that to do all of the work. The first 4 replacements all deal with edge cases where accented Latin characters are use to show the placement of syllabic stress instead of the Cyrillic character plus the combining diacritical mark. In these cases, we just need to substitute the proper Cyrillic character. Then we just strip out the “combining acute accent” U+301 → \xcc\x81 in UTF-8. After these replacements, we just decode the bytes object back to a str.

Anki and some other platforms use a particular format to signify cloze deletions in flashcard text. It has a format like any of the following:

{{c1::dog::}}
{{c2::dog::domestic canine}}

Here’s a regular expression that matches the content of cloze deletions in an arbitrary string, keeping only the main clozed word (in this case dog.)

{{c\d::(.*?)(::[^:]+)?}}

To see it in action, here it is in action in a Python script:

import re

def stripCloze(searchText):
    return re.sub(r'{{c\d::(.*?)(::[^:]+)?}}', r"\1", searchText)

print(stripCloze("The {{c1::passengers::tourist riders}} spotted a breaching {{c2::whale}}."))

It should return The passengers spotted a breaching whale.

Regex 101 is a great online regex tester.

Speaking of regular expressions, for the past year, I’ve used an automated process for building Anki flash cards. One of the steps in the process is to download Russian word pronunciations from Wiktionary. When Wiktionary began publishing transcoded mp3 files rather than just ogg files, they broke the URL scheme that I relied on to download content. The new regex for this scheme is: (?:src=.*:)?src=\"(\/\/.*\.mp3)

How to identify Russian letters in a string? The short answer is: [А-Яа-яЁё] but depending on your regex flavor, [\p{Cyrillic}] might work. What in the word does this regex mean? It’s just like [A-Za-z] with a twist. The Ёё at the end adds support for ё (“yo”) which is in the Latin group of characters.

See this question on Stack Overflow.

Regex

A regex to remove Anki's cloze markup

Stripping Russian syllabic stress marks in Python

Regex to match a cloze

Sunday, September 16, 2018

Detecting Russian letters with regex