Posts
Hedghog and interlinear lemmas
Splitting text into sentences: Russian edition
What's up with Pinboard? And an alternative
My favourite Cyrillic font
The cancellation of Russian music
Bash variable scope and pipelines
I alluded to this nuance involving variable scope in my post on automating pdf processing, but I wanted to expand on it a bit.
Consider this little snippet:
i=0
printf "foo:bar:baz:quux" | grep -o '[^:]\+' | while read -r line ; do
printf "Inner scope: %d - %s\n" $i $line
((i++))
[ $i -eq 3 ] && break;
done
printf "====\nOuter scope\ni = %d\n" $i;If you run this script - not in interactive mode in the shell - but as a script, what will i be in the outer scope? And why?
Automating the handling of bank and financial statements
Bulk rename tags in DEVONthink 3
topic_, but it was really an unnecessary top-level complication. So, the first item on my to-do list was to get rid of the all tags with a topic_ first level.
Stripping Russian syllabic stress marks in Python
I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea.
#!/usr/bin/env python3
def strip_stress_marks(text: str) -> str:
b = text.encode('utf-8')
# correct error where latin accented ó is used
b = b.replace(b'\xc3\xb3', b'\xd0\xbe')
# correct error where latin accented á is used
b = b.replace(b'\xc3\xa1', b'\xd0\xb0')
# correct error where latin accented é is used
b = b.replace(b'\xc3\xa0', b'\xd0\xb5')
# correct error where latin accented ý is used
b = b.replace(b'\xc3\xbd', b'\xd1\x83')
# remove combining diacritical mark
b = b.replace(b'\xcc\x81',b'').decode()
return b
text = "Том столкну́л Мэри с трампли́на для прыжко́в в во́ду."
print(strip_stress_marks(text))
# prints "Том столкнул Мэри с трамплина для прыжков в воду."The approach is similar to the Perl-based tool we constructed before, but this time we are working working on the bytes object after encoding as utf-8. Since the bytes object has a replace method, we can use that to do all of the work. The first 4 replacements all deal with edge cases where accented Latin characters are use to show the placement of syllabic stress instead of the Cyrillic character plus the combining diacritical mark. In these cases, we just need to substitute the proper Cyrillic character. Then we just strip out the “combining acute accent” U+301 → \xcc\x81 in UTF-8. After these replacements, we just decode the bytes object back to a str.