Interlinear glossing dealing with punctuation
PUNCT
nodes in interlinear glossing.
PUNCT
nodes in interlinear glossing.
Still thinking about interlinear glossing for my language learning project. The leizig.js library is great but my use case isn’t really what the author had in mind. I really just need to display a unit consisting of the word as it appears in the text, the lemma for that word form, and (possibly) the part of speech. For academic linguistics purposes, what I have in mind is completely non-standard.
The other issue with leizig.js for my use case is that I need to be able to respond to click events on individual words so that they can be tagged, defined or otherwise worked with. It’s straightforward how I could apply CSS id
attributes to word-level elements to support that functionality.
I alluded to this nuance involving variable scope in my post on automating pdf processing, but I wanted to expand on it a bit.
Consider this little snippet:
i=0
printf "foo:bar:baz:quux" | grep -o '[^:]\+' | while read -r line ; do
printf "Inner scope: %d - %s\n" $i $line
((i++))
[ $i -eq 3 ] && break;
done
printf "====\nOuter scope\ni = %d\n" $i;
If you run this script - not in interactive mode in the shell - but as a script, what will i
be in the outer scope? And why?
topic_
, but it was really an unnecessary top-level complication. So, the first item on my to-do list was to get rid of the all tags with a topic_
first level.
I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea.
#!/usr/bin/env python3
def strip_stress_marks(text: str) -> str:
b = text.encode('utf-8')
# correct error where latin accented ó is used
b = b.replace(b'\xc3\xb3', b'\xd0\xbe')
# correct error where latin accented á is used
b = b.replace(b'\xc3\xa1', b'\xd0\xb0')
# correct error where latin accented é is used
b = b.replace(b'\xc3\xa0', b'\xd0\xb5')
# correct error where latin accented ý is used
b = b.replace(b'\xc3\xbd', b'\xd1\x83')
# remove combining diacritical mark
b = b.replace(b'\xcc\x81',b'').decode()
return b
text = "Том столкну́л Мэри с трампли́на для прыжко́в в во́ду."
print(strip_stress_marks(text))
# prints "Том столкнул Мэри с трамплина для прыжков в воду."
The approach is similar to the Perl-based tool we constructed before, but this time we are working working on the bytes
object after encoding as utf-8. Since the bytes
object has a replace
method, we can use that to do all of the work. The first 4 replacements all deal with edge cases where accented Latin characters are use to show the placement of syllabic stress instead of the Cyrillic character plus the combining diacritical mark. In these cases, we just need to substitute the proper Cyrillic character. Then we just strip out the “combining acute accent” U+301
→ \xcc\x81
in UTF-8. After these replacements, we just decode
the bytes object back to a str
.
For one-off projects that target Anki collections, I often use Python in a standalone application rather than an Anki add-on. Since I’m not going to distribute these little creations that are specific to my own needs, there’s no reason to create an add-on. These are just a few notes - nothing comprehensive - on the process.
One thing to be aware of is that there must be a perfect match between the Anki major and minor version numbers for the Python anki
module to work. If you are running Anki 2.1.48 on your desktop application but have the Python module built for 2.1.49, it will not work. This is a huge irritation and there’s no backwards compatibility; the versions must match precisely.