Programming

Most language learners are familiar with Forvo, a site that allows users to download and contribute pronunciations for words and phrases. For my Russian studies, I make daily use of the site. In fact, to facilitate my Anki card-making workflow, I am a paid user of the Forvo API. But that’s where the trouble started.

When the Forvo API works, it works OK, often extremely slow. But lately, it has been down more than up. In an effort to patch my workflow and continue to download Russian word pronunciations, I wrote this little scraper. I’d prefer to use the API, but experience has shown now that the API is slow and unreliable. I’ll keep paying for the API access, because I support what the company does. And as often as not when a company offers a free service, it’s likely to be involved in surveillance capitalism. So I’d rather companies offer a reliable product at a reasonable price.

Recently, someone asked a question on r/Anki about changing and existing cloze-type note to a regular note. Part of the solution involves stripping the cloze markup from the existing cloze’d field. A cloze sentence has the form Play {{c1::studid}} games. or Play {{c1::stupid::pejorative adj}} games.

To handle both of these cases, the following regular expression will work. Just substitute for $1.

\{\{c\d::([^:\}]+)(?:::+[^\}]*)*\}\}

However, the Cloze Anything markup is different. It uses ( and ) instead of curly braces. If we want to flexibly remove both the standard and Cloze Anything markup, then our pattern would look like:

I was using a REST API at https://textance.herokuapp.com/title but it seems awfully fragile. Sure enough this morning, the entire application is down. It’s also not open-source and I have no idea who actually runs this thing.

Here’s the solution:

#!/bin/bash

url=$(pbpaste)
curl $url -so - | pup 'meta[property=og:title] attr{content}'

It does require pup. On macOS, you can install via brew install pup.

There are other ways using regular expressions but no dependency on pup but parsing HTML with regex is not such a good idea.

Making the Hugo → S3 upload process much more efficient by tracking file hashes.

Dealing with PUNCT nodes in interlinear glossing.

Still thinking about interlinear glossing for my language learning project. The leizig.js library is great but my use case isn’t really what the author had in mind. I really just need to display a unit consisting of the word as it appears in the text, the lemma for that word form, and (possibly) the part of speech. For academic linguistics purposes, what I have in mind is completely non-standard.

The other issue with leizig.js for my use case is that I need to be able to respond to click events on individual words so that they can be tagged, defined or otherwise worked with. It’s straightforward how I could apply CSS id attributes to word-level elements to support that functionality.

leipzig.js is a library for applying interlinear gloss to texts for linguistic analysis. In this post, I experiment a little with this libary to evaluate whether it would work for a little project of mine.

Starting a new devlog about Hedghog, a new language learning app and some thoughts about the interlinear display of lemmas.

Splitting text into sentences is one of those tasks that looks simple but on closer inspection is more difficult than you think. A common approach is to use regular expressions to divide up the text on punction marks. But without adding layers of complexity, that method fails on some sentences. This is a method using spaCy.

I alluded to this nuance involving variable scope in my post on automating pdf processing, but I wanted to expand on it a bit.

Consider this little snippet:

i=0
printf "foo:bar:baz:quux" | grep -o '[^:]\+' | while read -r line ; do
   printf "Inner scope: %d - %s\n" $i $line
   ((i++))
   [ $i -eq 3 ] && break;
done
printf "====\nOuter scope\ni = %d\n" $i;

If you run this script - not in interactive mode in the shell - but as a script, what will i be in the outer scope? And why?

Programming

Scraping Forvo pronunciations

A regex to remove Anki's cloze markup

Extracting title title of a web page from the command line

Hugo static site upload woes and a way forward

Interlinear glossing dealing with punctuation

Three-line (though non-standard) interlinear glossing

Experimenting with leipzig.js for interlinear gloss

Hedghog and interlinear lemmas

Splitting text into sentences: Russian edition

Bash variable scope and pipelines