Cli
Converting Cyrillic UTF-8 text encoded as Latin-1
This may be obvious to some, but visually-recognizing character encoding at a glance is not always obvious.
For example, pronunciation files downloaded form Forvo have the following appearance:
pronunciation_ru_оÑбÑвание.mp3
How can we extact the actual word from this gibberish? Optimally, the filename should reflect that actual word uttered in the pronunciation file, after all.
Step 1 - Extracting the interesting bits
The gibberish begins after the pronunciation_ru_ and ends before the file extension. Any regex tool can tease that out.
accentchar: a command-line utility to apply Russian stress marks
I’ve written a lot about applying and removing syllabic stress marks in Russian text because I use it a lot when making Anki cards.
This iteration is a command line tool for applying the stress mark at a particular character index. The advantage of these little shell tools is that they can be composable, integrating into different tools as the need arises.
#!/usr/local/bin/zsh
while getopts i:w: flag
do
case "${flag}" in
i) index=${OPTARG};;
w) word=${OPTARG};;
esac
done
if [ $word ]; then
temp=$word
else
read temp
fi
outword=""
for (( i=0; i<${#temp}; i++ )); do
thischar="${temp:$i:1}"
if [ $i -eq $index ]; then
thischar=$(echo $thischar | perl -C -pe 's/(.)/\1\x{301}/g;')
fi
outword="$outword$thischar"
done
echo $outwordWe can use it in a couple different ways. For example, we can provide all of the arguments in a declarative way: