Experimenting with leipzig.js for interlinear gloss

One of the key features of my language learning app Hedghog is the display of source text with interlinear gloss. This is of huge benefit in understanding highly-inflected languages. Right now I’m playing around with different ways of achieving this sort of display. I stumbled on leipzig.js which is a library for formatting interlinear gloss according to the Leipzig Rules.

I like what I see, but my first inclination is to get under the hood and fix some of the CSS. For example, the original text is displayed in italic. This is fine, and it may be the convention in linguistics circles, but some Russian letters are a little confusing to Russian learners when displayed in oblique type. It’s not difficult to fix.

Here’s what it looks like:

I just needed to apply some of my own CSS to achieve the desired appearance - Leizig Rules or not.

.gloss__line--0 {
    font-family: "Georgia";
    font-size: 20px;
}

.gloss__line--1 {
    color: gray;
}

.gloss__word .gloss__line:first-child {
    font-style: normal !important;
}

And the minimal example in Russian:

<html>

  <head>
    <link rel="stylesheet" href="//cdn.jsdelivr.net/npm/leipzig@latest/dist/leipzig.min.css">
  </head>

  <body>
    <div data-gloss>
      <p>Дональд Трамп - нелепый болван, который был избран президентом.</p>
      <p>дональд трамп - нелепый болван который был избрать президент.</p>
      <p>‘Donald Trump is a ridiculous moron who was elected president.’</p>
    </div>
    <script src="//cdn.jsdelivr.net/npm/leipzig@latest/dist/leipzig.min.js"></script>
    <script>
      document.addEventListener('DOMContentLoaded', function() {
        var glosser = Leipzig();
        glosser.gloss();
      });
    </script>
  </body>
</html>

This minimal example as a JSFiddle

More on interlinear gloss

Hedghog and interlinear lemmas

I’ve been working on a side-project for a few weeks that I’m calling “Hedghog.” Here’s the elevator pitch.

This is a tool to aid in learning foreign languages. Adults learn languages best by consuming comprehensible content whose context is relevant to the learner. Reading is one of the ways of acquiring foreign language content, including vocabulary, phraseology and so forth. Hedghog is a tool for acquiring and storing foreign language texts for the purposes of language learning. It helps the user track new words and phrases from these texts and provides translation, lemmatization and tagging features. It also can export lists of new words and phrases to the spaced-repetition program Anki.

One of the features of Hedghog is interlinear display of lemmas. Often, interlinear displays are used to display bilingual text. This is difficult when the word order differs significantly from the first-language word order. I’m also skeptical that this sort of display helps the learner efficiently acquire an understanding of the second-language ways of idiomatic writing. Instead, in Hedghog, the display will show the original text in large type with each word’s lemma beneath. For Russian, this solves one of the slowdowns in reading that I encounter - which is the momentary hesitation in recognizing the inflected form. It’s particularly halting when I run into a participle in an oblique case. The term “interlinear” isn’t exactly right here, but I’m struggling to think of something better. Edit 2022-05-15: The better term is “interlinear gloss”1

It looks something like this:

This is adapted from an approach demonstrated initially for reading classical Greek.

<h3>Interlinear lemmas</h3>
<div class="unit"><p class="ru">В</p><p class="lemma">в</p></div>
<div class="unit"><p class="ru">этом</p><p class="lemma">этот</p></div>
<div class="unit"><p class="ru">контексте</p><p class="lemma">контехт</p></div>
<div class="unit"><p class="ru">комментаторы</p><p class="lemma">комментатор</p></div>
<div class="unit"><p class="comma">,</p></div>
<div class="unit"><p class="ru">журналисты</p><p class="lemma">журналист</p></div>
<div class="unit"><p class="ru">политики</p><p class="lemma">политик</p></div>
<div class="unit"><p class="ru">чувствуют</p><p class="lemma">чувствовать</p></div>
<div class="unit"><p class="ru">себя</p><p class="lemma">себя</p></div>
<div class="unit"><p class="ru">свободными</p><p class="lemma">свободный</p></div>
<div class="unit"><p class="ru">в</p><p class="lemma">в</p></div>
<div class="unit"><p class="ru">бряцании</p><p class="lemma">бряцание</p></div>
<div class="unit"><p class="ru">ядерным</p><p class="lemma">ядерный</p></div>
<div class="unit"><p class="ru">оружием</p><p class="lemma">оружие</p></div>

And the CSS:

div.unit {
  float: left;
  margin-bottom: 1em;
  color: black;
}

div.comma {
    float: left;
    margin-bottom: 1em;
    color: black;
    
}

p.comma {
  font-size: 16pt;
  font-family: serif;
  margin: 0em;
  padding: 0em 0em;
}

p.ru {
  font-size: 16pt;
  font-family: serif;
  margin: 0em;
  padding: 0em 0.5em;
}

p.lemma {
  font-size: 10pt;
  font-family: sans-serif;
  color: gray;
  margin: 0em;
  padding: 0em 1em;
}
h3 {
    font-family: "HelveticaNeue";
}

Here’s a link to a JSFiddle to play around with this.

More on interlinear text


  1. Wikipedia - Interlinear gloss ↩︎

Splitting text into sentences: Russian edition

Splitting text into sentences is one of those tasks that looks simple but on closer inspection is more difficult than you think. A common approach is to use regular expressions to divide up the text on punction marks. But without adding layers of complexity, that method fails on some sentences. This is a method using spaCy.

My favourite Cyrillic font

I’ve tried a lot of fonts for Cyrillic. My favourite is Georgia. As a non-native Russian speaker, there’s something about serif fonts, either on-screen or in print, that makes the text so much more legible.

The cancellation of Russian music

Free speech in Russia has never been particularly favoured. The Romanov dynasty remained in power long past their expiration date by suppressing waves of free thought, from the ideals of the Enlightenment, to the anti-capitalist ideals of Marx and Engels. At least, until the 1917 Revolution. And even then, the Bolsheviks continue to suppress dissent for the entire seventy-something year history of the Soviet Union. Perestroika and the collapse of the Soviet Union promised change.

Bash variable scope and pipelines

I alluded to this nuance involving variable scope in my post on automating pdf processing, but I wanted to expand on it a bit. Consider this little snippet: i=0 printf "foo:bar:baz:quux" | grep -o '[^:]+' | while read -r line ; do printf "Inner scope: %d - %s\n" $i $line ((i++)) [ $i -eq 3 ] && break; done printf "====\nOuter scope\ni = %d\n" $i; If you run this script - not in interactive mode in the shell - but as a script, what will i be in the outer scope?

Automating the handling of bank and financial statements

In my perpetual effort to get out of work, I’ve developed a suite of automation tools to help file statements that I download from banks, credit cards and others. While my setup described here is tuned to my specific needs, any of the ideas should be adaptable for your particular circumstances. For the purposes of this post, I’m going to assume you already have Hazel. None of what follows will be of much use to you without it.

Bulk rename tags in DEVONthink 3

In DEVONthink, I tag a lot. It’s an integral part of my strategy for finding things in my paperless environment. As I wrote about previously hierarchical tags are a big part of my organizational system in DEVONthink. For many years, I tagged subject matter with tags that emmanate from a single tag named topic_, but it was really an unnecessary top-level complication. So, the first item on my to-do list was to get rid of the all tags with a topic_ first level.

Stripping Russian syllabic stress marks in Python

I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea. #!/usr/bin/env python3 def strip_stress_marks(text: str) -> str: b = text.encode('utf-8') # correct error where latin accented ó is used b = b.replace(b'\xc3\xb3', b'\xd0\xbe') # correct error where latin accented á is used b = b.replace(b'\xc3\xa1', b'\xd0\xb0') # correct error where latin accented é is used b = b.