Posts
The cancellation of Russian music
Bash variable scope and pipelines
I alluded to this nuance involving variable scope in my post on automating pdf processing, but I wanted to expand on it a bit.
Consider this little snippet:
i=0
printf "foo:bar:baz:quux" | grep -o '[^:]\+' | while read -r line ; do
printf "Inner scope: %d - %s\n" $i $line
((i++))
[ $i -eq 3 ] && break;
done
printf "====\nOuter scope\ni = %d\n" $i;If you run this script - not in interactive mode in the shell - but as a script, what will i be in the outer scope? And why?
Automating the handling of bank and financial statements
Bulk rename tags in DEVONthink 3
topic_, but it was really an unnecessary top-level complication. So, the first item on my to-do list was to get rid of the all tags with a topic_ first level.
Stripping Russian syllabic stress marks in Python
I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea.
#!/usr/bin/env python3
def strip_stress_marks(text: str) -> str:
b = text.encode('utf-8')
# correct error where latin accented ó is used
b = b.replace(b'\xc3\xb3', b'\xd0\xbe')
# correct error where latin accented á is used
b = b.replace(b'\xc3\xa1', b'\xd0\xb0')
# correct error where latin accented é is used
b = b.replace(b'\xc3\xa0', b'\xd0\xb5')
# correct error where latin accented ý is used
b = b.replace(b'\xc3\xbd', b'\xd1\x83')
# remove combining diacritical mark
b = b.replace(b'\xcc\x81',b'').decode()
return b
text = "Том столкну́л Мэри с трампли́на для прыжко́в в во́ду."
print(strip_stress_marks(text))
# prints "Том столкнул Мэри с трамплина для прыжков в воду."The approach is similar to the Perl-based tool we constructed before, but this time we are working working on the bytes object after encoding as utf-8. Since the bytes object has a replace method, we can use that to do all of the work. The first 4 replacements all deal with edge cases where accented Latin characters are use to show the placement of syllabic stress instead of the Cyrillic character plus the combining diacritical mark. In these cases, we just need to substitute the proper Cyrillic character. Then we just strip out the “combining acute accent” U+301 → \xcc\x81 in UTF-8. After these replacements, we just decode the bytes object back to a str.
Accessing Anki collection models from Python
For one-off projects that target Anki collections, I often use Python in a standalone application rather than an Anki add-on. Since I’m not going to distribute these little creations that are specific to my own needs, there’s no reason to create an add-on. These are just a few notes - nothing comprehensive - on the process.
One thing to be aware of is that there must be a perfect match between the Anki major and minor version numbers for the Python anki module to work. If you are running Anki 2.1.48 on your desktop application but have the Python module built for 2.1.49, it will not work. This is a huge irritation and there’s no backwards compatibility; the versions must match precisely.
Converting Cyrillic UTF-8 text encoded as Latin-1
This may be obvious to some, but visually-recognizing character encoding at a glance is not always obvious.
For example, pronunciation files downloaded form Forvo have the following appearance:
pronunciation_ru_оÑбÑвание.mp3
How can we extact the actual word from this gibberish? Optimally, the filename should reflect that actual word uttered in the pronunciation file, after all.
Step 1 - Extracting the interesting bits
The gibberish begins after the pronunciation_ru_ and ends before the file extension. Any regex tool can tease that out.
accentchar: a command-line utility to apply Russian stress marks
I’ve written a lot about applying and removing syllabic stress marks in Russian text because I use it a lot when making Anki cards.
This iteration is a command line tool for applying the stress mark at a particular character index. The advantage of these little shell tools is that they can be composable, integrating into different tools as the need arises.
#!/usr/local/bin/zsh
while getopts i:w: flag
do
case "${flag}" in
i) index=${OPTARG};;
w) word=${OPTARG};;
esac
done
if [ $word ]; then
temp=$word
else
read temp
fi
outword=""
for (( i=0; i<${#temp}; i++ )); do
thischar="${temp:$i:1}"
if [ $i -eq $index ]; then
thischar=$(echo $thischar | perl -C -pe 's/(.)/\1\x{301}/g;')
fi
outword="$outword$thischar"
done
echo $outwordWe can use it in a couple different ways. For example, we can provide all of the arguments in a declarative way:
sterilize-ng: a command-line URL sterilizer
Introducing sterilize-ng [GitHub link] - a URL sterilizer made to work flexibily on the command line.
Background
The surveillance capitalist economy is built on the relentless tracking of users. Imagine going about town running errands but everywhere you go, someone is quietly following you. When you pop into the grocery, they examine your receipt. They look into the bags to see what you bought. Then they hop in the car with you and keep careful records of where you go, how fast you drive, whom you talk with on the phone. This is surveillance capitalism - the relentless “digital exhaust” left by our actions online.