Python

Removing stress marks from Russian text

Previously, I wrote about adding syllabic stress marks to Russian text. Here’s a method for doing the opposite - that is, removing such marks (ударение) from Russian text.

Although there may well be a more sophisticated approach, regex is well-suited to this task. The problem is that

def string_replace(dict,text):
   sorted_dict = {k: dict[k] for k in sorted(dict)}
   for n in sorted_dict.keys():
      text = text.replace(n,dict[n])
   return text

dict = { "а́" : "а", "е́" : "е", "о́" : "о", "у́" : "у",
      "я́" : "я", "ю́" : "ю", "ы́" : "ы", "и́" : "и",
      "ё́" : "ё", "А́" : "А", "Е́" : "Е", "О́" : "О",
      "У́" : "У", "Я́" : "Я", "Ю́" : "Ю", "Ы́" : "Ы",
      "И́" : "И", "Э́" : "Э", "э́" : "э"
   } 
   
print(string_replace(dict, "Существи́тельные в шве́дском обычно де́лятся на пять склоне́ний."))

This should print: Существительные в шведском обычно делятся на пять склонений.

Beginning to experiement with Stanza for natural language processing

After installing Stanza as dependency of UDAR which I recently described, I decided to play around with what is can do.

Installation

The installation is straightforward and is documented on the Stanza getting started page.

First,

sudo pip3 install stanza

Then install a model. For this example, I installed the Russian model:

#!/usr/local/bin/python3
import stanza
stanza.download('ru')

Usage

Part-of-speech (POS) and morphological analysis

Here’s a quick example of POS analysis for Russian. I used PrettyTable to clean up the presentation, but it’s not strictly-speaking necessary.

Automated marking of Russian syllabic stress

One of the challenges that Russian learners face is the placement of syllabic stress, an essential determinate of pronunciation. Although most pedagogical texts for students have marks indicating stress, practically no tests intended for native speakers do. The placement of stress is inferred from memory and context.

I was delighted to discover Dr. Robert Reynolds’ work on natural language processing of Russian text to mark stress based on grammatical analysis of the text. What follows is a brief description of the installation and use of this work. The project page on Github has installation instructions; but I found a number of items that needed to be addressed that were not covered there. For example, this project (UDAR) depends on Stanza; which in turn requires a language-specific (Russian) model.

Scripting thumbnail image file creation on macOS

One of the sites that I manage uses a jQuery-based image gallery to display images in a grid. The script decides which thumbnail to use based on how large and image is needed. A series of suffixes à la Flickr^[Well, sort of. I don’t think this is exactly what Flickr uses; and I made up the _q suffix for the less than 500px image.] is used to signify classes of image size. I wrote the following script to automate the process of scanning a source folder and creating four thumbnail sizes to an output directory.

Anki database adventures: Counting notes by model type

Continuing my series on accessing the Anki database outside of the Anki application environment, here’s a piece on accessing the note type model. You may wish to start here with the first article on accessing the Anki database. This is geared toward mac OS. (If you’re not on mac OS, then start here instead.)

The note type model

Since notes contain flexible fields in Anki, the model for a note type is in JSON. The best guess definition of the JSON is:

Accessing the Anki database with Python: Working with a specific deck

I previously wrote about accessing the Anki database using Python on mac OS. Extending that post, I’ll show how to work with a specific deck in this short post.

To use a named deck you’ll need its deck ID. Fortunately there’s a built-in method for finding a deck ID by name:

col = Collection(COLLECTION_PATH)
dID = col.decks.id(DECK_NAME)

Now in queries against the cards and notes tables we can apply the deck ID to restrict them to a certain deck. For example, to find all of the cards currently in the learning stage:

Working with the Anki database on mac OS using Python

Not long ago I ran across this post detailing a method for opening and inspecting the Anki database using Python outside the Anki application environment. However, the approach requires linking to the Anki code base which is inaccessible on mac OS since the Python code is packaged into a Mac app on this platform.

The solution I’ve found is inelegant; but just involves downloading the Anki code base to a location on your file system where you can link to it in your code. You can find the Anki code here on github.

Process automation in building Anki vocabulary cards

For the last two years, I’ve been working through a 10,000 word Russian vocabulary ordered by frequency. I have a goal of finishing the list before the end of 2019. This requires not only stubborn persistence but an efficient process of collecting the information that goes onto my Anki flash cards.

My manual process has been to work from a Numbers spreadsheet. As I collect information about each word from several websites, I log it in this table.

An approach to dealing with spurious sensor data in Indigo

Spurious sensor data can wreak havoc in an otherwise finely-tuned home automation system. I use temperature data from an Aeotech Multisensor 6 to monitor the environment in our greenhouse. Living in Canada, I cannot rely solely on passive systems to maintain the temperature, particularly at night. So, using the temperature and humidity measurements transmitted back to the controller over Z-wave, I control devices inside the greenhouse that heat and humidify the environment.