Python

vim: Jump to specific character on a line

In Vim, to jump to a specific character on a line, you can use the following commands:

f{char} - Jump to the next occurrence of {char} on the current line
F{char} - Jump to the previous occurrence of {char} on the current line
t{char} - Jump until (one position before) the next occurrence of {char}
T{char} - Jump until (one position after) the previous occurrence of {char}

For your specific example of “go to first #”:

I use ArchiveBox extensively to save web content that might change or disappear. While a REST API is apparently coming eventually, it doesn’t appear to have been merged into the main fork. So I cobbled together a little application to archive links via a POST request. It takes advantage of the archivebox command line interface. If you are impatient, you can skip to the full source code. Otherwise I’ll describe my setup to provide some context.

Most language learners are familiar with Forvo, a site that allows users to download and contribute pronunciations for words and phrases. For my Russian studies, I make daily use of the site. In fact, to facilitate my Anki card-making workflow, I am a paid user of the Forvo API. But that’s where the trouble started.

When the Forvo API works, it works OK, often extremely slow. But lately, it has been down more than up. In an effort to patch my workflow and continue to download Russian word pronunciations, I wrote this little scraper. I’d prefer to use the API, but experience has shown now that the API is slow and unreliable. I’ll keep paying for the API access, because I support what the company does. And as often as not when a company offers a free service, it’s likely to be involved in surveillance capitalism. So I’d rather companies offer a reliable product at a reasonable price.

In my perpetual attempt to make my language learning process using Anki more efficient, I’ve written a tool to extract English-language definitions from Russian words from Wiktionary. I wrote about the idea previously in Scraping Russian word definitions from Wikitionary: utility for Anki but it relied on the WiktionaryParser module which is good but misses some important edge cases. So I rolled up my sleeves and crafted my own solution. As with WiktionaryParser the heavy-lifting is done by the Beautiful Soup parser. Much of the logic of this tool is around detecting the edge cases that I mentioned. For example, the underlying HTML format changes when we’re dealing with a word that has multiple etymologies versus those with a single etymology. Whenever you’re doing web scraping you have to account for those sorts of variations.

Making the Hugo → S3 upload process much more efficient by tracking file hashes.

I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea.

#!/usr/bin/env python3

def strip_stress_marks(text: str) -> str:
   b = text.encode('utf-8')
   # correct error where latin accented ó is used
   b = b.replace(b'\xc3\xb3', b'\xd0\xbe')
   # correct error where latin accented á is used
   b = b.replace(b'\xc3\xa1', b'\xd0\xb0')
   # correct error where latin accented é is used
   b = b.replace(b'\xc3\xa0', b'\xd0\xb5')
   # correct error where latin accented ý is used
   b = b.replace(b'\xc3\xbd', b'\xd1\x83')
   # remove combining diacritical mark
   b = b.replace(b'\xcc\x81',b'').decode()
   return b

text = "Том столкну́л Мэри с трампли́на для прыжко́в в во́ду."

print(strip_stress_marks(text))
# prints "Том столкнул Мэри с трамплина для прыжков в воду."

The approach is similar to the Perl-based tool we constructed before, but this time we are working working on the bytes object after encoding as utf-8. Since the bytes object has a replace method, we can use that to do all of the work. The first 4 replacements all deal with edge cases where accented Latin characters are use to show the placement of syllabic stress instead of the Cyrillic character plus the combining diacritical mark. In these cases, we just need to substitute the proper Cyrillic character. Then we just strip out the “combining acute accent” U+301 → \xcc\x81 in UTF-8. After these replacements, we just decode the bytes object back to a str.

For one-off projects that target Anki collections, I often use Python in a standalone application rather than an Anki add-on. Since I’m not going to distribute these little creations that are specific to my own needs, there’s no reason to create an add-on. These are just a few notes - nothing comprehensive - on the process.

One thing to be aware of is that there must be a perfect match between the Anki major and minor version numbers for the Python anki module to work. If you are running Anki 2.1.48 on your desktop application but have the Python module built for 2.1.49, it will not work. This is a huge irritation and there’s no backwards compatibility; the versions must match precisely.

I write in Markdown because it’s much easier to keep the flow of writing going without taking my hands off the keyboard.

I also like to write content in Anki cards in Markdown. Over the years there have been various ways in of supporting this through add-ons:

The venerable Power Format Pack was great but no longer supports Anki 2.1, so it became useless.
Auto Markdown worked for a while but as of Anki version 2.1.41 does not.
After Auto Markdown stopped working, I installed the supposed fix Auto Markdown - fix version but that didn’t work either.
It’s possible that the Mini Format Pack will work, but honestly I’m tired of the constant break-fix-break-fix cycle with Anki.

The problem

The real problem with Markdown add-ons for Anki is the same as every other add-on. They are all hanging by a thread. Almost every minor point upgrade of Anki breaks at least one of my add-ons. It’s nearly impossible to determine in advance whether an Anki upgrade is going to break some key functionality that I rely on. And add-on developers, even prominent and prolific ones come and go when they get busy, distracted or disinterested. It’s one of the most frustrating parts of using Anki.

As readers of this blog know, I’m an avid user of Anki to learn Russian. I have a number of sources for reference content that go onto my Anki cards. Notably, I use Wiktionary to get word definitions and the word with the proper syllabic stress marked. (This is an aid to pronunciation for Russian language learners.)

Since I’m lazy to the core, I came up with a system way of grabbing the stress-marked word from the Wiktionary page using lxml and XPath.

W3schools.com has a CSS library that’s quite nice. I often use Bootstrap; but I like some of the visual features here better. For example, I like their tags because they have more flexible use of colour.

If you want to fetch from a Python dictionary, but you need a default value, this is how you do it:

upos_badge = {'noun': 'lime','verb': 'amber', 'adv': 'blue',}
badge_class_postfix = upos_badge.get(value.lower(), 'light-grey')

I recently learned about DeepL as an alternative to Google Translate. It seems really good.

Thursday, April 17, 2025

vim: Jump to specific character on a line

An API (sort of) for adding links to ArchiveBox

Scraping Forvo pronunciations

A tool for scraping definitions of Russian words from Wikitionary

Hugo static site upload woes and a way forward

Stripping Russian syllabic stress marks in Python

Accessing Anki collection models from Python

Generating HTML from Markdown in Anki fields

The problem

Parsing Russian Wiktionary content using XPath

Wednesday, January 27, 2021