Splitting text into sentences: Russian edition

Splitting text into sentences is one of those tasks that looks simple but on closer inspection is more difficult than you think. A common approach is to use regular expressions to divide up the text on punction marks. But without adding layers of complexity, that method fails on a sentence such as:

“Trapper John, M.D. was as fine as any Ph.D.”

It’s obviously only one sentence, but try it with regex and the difficulty is obvious.

A solution suggested on Stack Overflow is to use the spaCy natural language processing module along with its ‘sentencizer’ pipeline to do the heavy lifting. The recommended solutions are all based on English language processing; so I was anxious to see if it would work on Russian text. The short answer is “yes.” This post is just to document the solution.

from spacy.lang.ru import Russian

nlp_simple = Russian()
nlp_simple.add_pipe('sentencizer')

doc = nlp_simple(text)
sentences = [str(sent).strip() for sent in doc.sents]

What's up with Pinboard? And an alternative

Beginning somewhere around April 2022, the bookmarking web application Pinboard began to suffer prolonged outages without really any substantive commentary from the developer. Reports on Hacker News reveal a pattern of frequently-broken functionality. As of this writing, the API is no longer functioning.

One of the great things about the HN community is that you can almost always find an open-source tool to get the job done. That’s how I discovered Espial. It’s a minimalist open-source self-hosted bookmarking tool that looks and works like Pinboard. It also imports the Pinboard export JSON format.

Espial installed readily for me on macOS and seems very usable. My advice is to export your Pinboard bookmarks while you can and spin-up an instance of Espial.

My favourite Cyrillic font

I’ve tried a lot of fonts for Cyrillic. My favourite is Georgia.

As a non-native Russian speaker, there’s something about serif fonts, either on-screen or in print, that makes the text so much more legible.

The cancellation of Russian music

Free speech in Russia has never been particularly favoured. The Romanov dynasty remained in power long past their expiration date by suppressing waves of free thought, from the ideals of the Enlightenment, to the anti-capitalist ideals of Marx and Engels. At least, until the 1917 Revolution. And even then, the Bolsheviks continue to suppress dissent for the entire seventy-something year history of the Soviet Union. Perestroika and the collapse of the Soviet Union promised change. But the change was fleeting.

Sunday, March 20, 2022

On Vladimir Putin

Some interesting writing on Putin in Der Spiegel on Vladimir Putin. An interview with political scientist Ivan Krastev on Putin’s motivators and his psychological state and worldviews.

“In 2008, during the war against Georgia, he [Putin] met with Alexei Venediktov, the editor-in-chief of the Ekho Moskvy radio station, which was one of the last critical media outlets in the country until it was shut down last week. Putin asked if Venediktov knew what he, Putin, had done in his previous job. Mr. President, Venediktov replied, we all know where you come from. Do you know, Putin said, what we did with traitors in my previous job? Yes, we know, said Venediktov. And do you know why I am speaking with you? Because you are an enemy and not a traitor!”

Bash variable scope and pipelines

I alluded to this nuance involving variable scope in my post on automating pdf processing, but I wanted to expand on it a bit.

Consider this little snippet:

i=0
printf "foo:bar:baz:quux" | grep -o '[^:]\+' | while read -r line ; do
   printf "Inner scope: %d - %s\n" $i $line
   ((i++))
   [ $i -eq 3 ] && break;
done
printf "====\nOuter scope\ni = %d\n" $i;

If you run this script - not in interactive mode in the shell - but as a script, what will i be in the outer scope? And why?

Automating the handling of bank and financial statements

In my perpetual effort to get out of work, I’ve developed a suite of automation tools to help file statements that I download from banks, credit cards and others. While my setup described here is tuned to my specific needs, any of the ideas should be adaptable for your particular circumstances. For the purposes of this post, I’m going to assume you already have Hazel. None of what follows will be of much use to you without it. I’ll also emphasize that this is a macOS-specific post. Bear in mind, too, that companies have the nasty habit of tweaking their statement formats. That fact alone makes any approach like this fragile; so be aware that maintaining these rules is just part of the game. With that out of the way, let’s dive in.

Bulk rename tags in DEVONthink 3

In DEVONthink, I tag a lot. It’s an integral part of my strategy for finding things in my paperless environment. As I wrote about previously hierarchical tags are a big part of my organizational system in DEVONthink. For many years, I tagged subject matter with tags that emmanate from a single tag named topic_, but it was really an unnecessary top-level complication. So, the first item on my to-do list was to get rid of the all tags with a topic_ first level.

Monday, January 24, 2022

Soviet maps of UK and North America

This is from an interesting collection of maps of the UK and North American cities produced by the Soviet Union. You can search for maps of particular locations.


What we learned about each other

What I hate most about the past half decade is what we learned about each other. Call me naive, but I never imagined that uncles, cousins and friends I loved and respected would so effortlessly embrace Fascism. I never imagined they would elevate some TV conman to a deity.

Stripping Russian syllabic stress marks in Python

I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea.

#!/usr/bin/env python3

def strip_stress_marks(text: str) -> str:
   b = text.encode('utf-8')
   # correct error where latin accented ó is used
   b = b.replace(b'\xc3\xb3', b'\xd0\xbe')
   # correct error where latin accented á is used
   b = b.replace(b'\xc3\xa1', b'\xd0\xb0')
   # correct error where latin accented é is used
   b = b.replace(b'\xc3\xa0', b'\xd0\xb5')
   # correct error where latin accented ý is used
   b = b.replace(b'\xc3\xbd', b'\xd1\x83')
   # remove combining diacritical mark
   b = b.replace(b'\xcc\x81',b'').decode()
   return b

text = "Том столкну́л Мэри с трампли́на для прыжко́в в во́ду."

print(strip_stress_marks(text))
# prints "Том столкнул Мэри с трамплина для прыжков в воду."

The approach is similar to the Perl-based tool we constructed before, but this time we are working working on the bytes object after encoding as utf-8. Since the bytes object has a replace method, we can use that to do all of the work. The first 4 replacements all deal with edge cases where accented Latin characters are use to show the placement of syllabic stress instead of the Cyrillic character plus the combining diacritical mark. In these cases, we just need to substitute the proper Cyrillic character. Then we just strip out the “combining acute accent” U+301\xcc\x81 in UTF-8. After these replacements, we just decode the bytes object back to a str.