Bulk rename tags in DEVONthink 3

In DEVONthink, I tag a lot. It’s an integral part of my strategy for finding things in my paperless environment. As I wrote about previously hierarchical tags are a big part of my organizational system in DEVONthink. For many years, I tagged subject matter with tags that emmanate from a single tag named topic_, but it was really an unnecessary top-level complication. So, the first item on my to-do list was to get rid of the all tags with a topic_ first level.

Also began to despise the underscore separator symbol. Instead, the : would take up less screen real estate. So the second item on my to-do list was to do a symbol replacement.

Complicating all of this is that we would have to do this all recursively because of my deeply-hierarchical tag tree. Since DEVONthink’s AppleScript support is excellent, I put together a little program to take care of the tag reorganization for me. Aside from the recursive processing, the other interesting bit is how to use the do shell script command to easily use command line tools rather than resorting to the clumsy AppleScript syntax for text processing.

I’ve posted the script here. It’s specific to my needs; but it’s here partly to remind my future self of how I did this and partly to serve as a jumping-off point for others who may have similar needs.

--
--   Created by: Alan Duncan
--   Created on: 2022-03-15
--
--   Copyright © 2022 OjisanSeiuchi, All Rights Reserved
--

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

global names
set names to {}

on processTag(thisTag)
   tell application id "DNtp"
      set db to the first database whose name is "leviathan"
      tell db
         set tagName to name of thisTag
         if tagName begins with "topic_" then
            -- remove the topic_prefix
            set cmd to "echo " & quoted form of tagName & " | " & ¬
               "sed -E 's/topic_//g'"
            set newName to do shell script cmd
            -- change "_" to ":"
            set cmd to "echo " & quoted form of newName & " | " & ¬
               "tr \"_\" \":\""
            set newName to do shell script cmd
            -- rename
            tell thisTag
               set name to newName
            end tell
            set names to names & newName
         end if
         -- perform recurrently as needed
         tell thisTag
            repeat with childRecord in children
               set tagType to (tag type of childRecord as string)
               if tagType is "ordinary tag" then
                  set tagName to (name of childRecord as string)
                  processTag(childRecord) of me
               end if
            end repeat
         end tell
      end tell
   end tell
end processTag

tell application id "DNtp"
   set theSelection to the selection
   repeat with topTag in theSelection
      processTag(topTag) of me
   end repeat
   names
end tell

Stripping Russian syllabic stress marks in Python

I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea.

#!/usr/bin/env python3

def strip_stress_marks(text: str) -> str:
   b = text.encode('utf-8')
   # correct error where latin accented ó is used
   b = b.replace(b'\xc3\xb3', b'\xd0\xbe')
   # correct error where latin accented á is used
   b = b.replace(b'\xc3\xa1', b'\xd0\xb0')
   # correct error where latin accented é is used
   b = b.replace(b'\xc3\xa0', b'\xd0\xb5')
   # correct error where latin accented ý is used
   b = b.replace(b'\xc3\xbd', b'\xd1\x83')
   # remove combining diacritical mark
   b = b.replace(b'\xcc\x81',b'').decode()
   return b

text = "Том столкну́л Мэри с трампли́на для прыжко́в в во́ду."

print(strip_stress_marks(text))
# prints "Том столкнул Мэри с трамплина для прыжков в воду."

The approach is similar to the Perl-based tool we constructed before, but this time we are working working on the bytes object after encoding as utf-8. Since the bytes object has a replace method, we can use that to do all of the work. The first 4 replacements all deal with edge cases where accented Latin characters are use to show the placement of syllabic stress instead of the Cyrillic character plus the combining diacritical mark. In these cases, we just need to substitute the proper Cyrillic character. Then we just strip out the “combining acute accent” U+301\xcc\x81 in UTF-8. After these replacements, we just decode the bytes object back to a str.

Edit:

A little later, it occurred to me that there might be an easier way using the regex (not re) module which does a better job handling Unicode. So here’s a version of the strip_stress_marks function that doesn’t involve taking a trip through a bytes object and back to string:

def strip_stress_marks(text: str) -> str:
   # correct error where latin accented ó is used
   result = regex.sub('\u00f3','\u043e', searchText)
   # correct error where latin accented á is used
   result = regex.sub('\u00e1','\u0430', result)
   # correct error where latin accented é is used
   result = regex.sub('\u00e9','\u0435', result)
   # correct error where latin accented ý is used
   result = regex.sub('\u00fd','\u0443', result)
   # remove combining diacritical mark
   result = regex.sub('\u0301', "", result)
   
   return result

I thought this might be faster, but instead using the regex module is about an order of magnitude slower. Oh well.

By compiling the regex, you can reclaim most of the difference, but the method using regular expressions is still about twice as slow as the approach of using the bytes object manipulation. For completeness, here is the version using compiled regular expressions:

o_pat = regex.compile(r'\u00f3')
a_pat = regex.compile(r'\u00e1')
e_pat = regex.compile(r'\u00e9')
y_pat = regex.compile(r'\u00fd')
diacritical_pat = regex.compile(r'\u0301')

def strip_stress_marks3(text: str) -> str:
   # correct error where latin accented ó is used
   result = o_pat.sub('\u043e', searchText)
   # correct error where latin accented á is used
   result = a_pat.sub('\u0430', result)
   # correct error where latin accented é is used
   result = e_pat.sub('\u0435', result)
   # correct error where latin accented ý is used
   result = y_pat.sub('\u0443', result)
   # remove combining diacritical mark
   result = diacritical_pat.sub("", result)
   
   return result

Accessing Anki collection models from Python

For one-off projects that target Anki collections, I often use Python in a standalone application rather than an Anki add-on. Since I’m not going to distribute these little creations that are specific to my own needs, there’s no reason to create an add-on. These are just a few notes - nothing comprehensive - on the process.

One thing to be aware of is that there must be a perfect match between the Anki major and minor version numbers for the Python anki module to work. If you are running Anki 2.1.48 on your desktop application but have the Python module built for 2.1.49, it will not work. This is a huge irritation and there’s no backwards compatibility; the versions must match precisely.

Converting Cyrillic UTF-8 text encoded as Latin-1

This may be obvious to some, but visually-recognizing character encoding at a glance is not always obvious.

For example, pronunciation files downloaded form Forvo have the following appearance:

pronunciation_ru_оÑ‚бывание.mp3

How can we extact the actual word from this gibberish? Optimally, the filename should reflect that actual word uttered in the pronunciation file, after all.

Step 1 - Extracting the interesting bits

The gibberish begins after the pronunciation_ru_ and ends before the file extension. Any regex tool can tease that out.

accentchar: a command-line utility to apply Russian stress marks

I’ve written a lot about applying and removing syllabic stress marks in Russian text because I use it a lot when making Anki cards.

This iteration is a command line tool for applying the stress mark at a particular character index. The advantage of these little shell tools is that they can be composable, integrating into different tools as the need arises.

#!/usr/local/bin/zsh

while getopts i:w: flag
do
    case "${flag}" in
        i) index=${OPTARG};;
        w) word=${OPTARG};;
    esac
done

if [ $word ]; then
    temp=$word
else
    read temp
fi

outword=""
for (( i=0; i<${#temp}; i++ )); do
    thischar="${temp:$i:1}"
    if [ $i -eq $index ]; then
        thischar=$(echo $thischar | perl -C -pe 's/(.)/\1\x{301}/g;')
    fi
    outword="$outword$thischar"
done

echo $outword

We can use it in a couple different ways. For example, we can provide all of the arguments in a declarative way:

sterilize-ng: a command-line URL sterilizer

Introducing sterilize-ng [GitHub link] - a URL sterilizer made to work flexibily on the command line.

Background

The surveillance capitalist economy is built on the relentless tracking of users. Imagine going about town running errands but everywhere you go, someone is quietly following you. When you pop into the grocery, they examine your receipt. They look into the bags to see what you bought. Then they hop in the car with you and keep careful records of where you go, how fast you drive, whom you talk with on the phone. This is surveillance capitalism - the relentless “digital exhaust” left by our actions online.

Using Perl in Keyboard Maestro macros

One of the things that I love about Keyboard Maestro is the ability to chain together disparate technologies to achieve some automation goal on macOS.

In most of my previous posts about Keyboard Maestro macros, I’ve used Python or shell scripts, but I decided to draw on some decades-old experience with Perl to do a little text processing for a specific need.

Background

I want this text from Wiktionary:

to look like this:

Stripping Russian stress marks from text from the command line

Russian text intended for learners sometimes contains marks that indicate the syllabic stress. It is usually rendered as a vowel + a combining diacritical mark, typically the combining acute accent \u301. Here are a couple ways of stripping these marks on the command line:

First is a version using Perl

#!/bin/bash

f='покупа́ешья́';
echo $f | perl -C -pe 's/\x{301}//g;'

And then another using the sd tool:

#!/bin/bash

f='покупа́ешья́';
echo $f | sd "\u0301" ""

Both rely on finding the combining diacritical mark and removing it with regex.

Splitting a string on the command line - the search for the one-liner

It seems like the command line is one of those places where you can accomplish crazy efficient things with one-liners.

Here’s a perfect use case for a CLI one-liner:

In Anki, I often add lists of synonyms and antonyms to my vocabulary cards, but I like them formatted as a bulleted list. My usual route to that involves Markdown. But how to convert this:

известный, точный, определённый, достоверный

to

- `известный`
- `точный`
- `определённый`
- `достоверный`

After trying to come up with a single text replacement strategy to make this work, the best I could do was this:

A Keyboard Maestro macro to edit Anki sound file

Often when I import a pronunciation file into Anki, from Forvo for example, the volume isn’t quite right or there’s a lot of background noise; and I want to edit the sound file. How?

The solution for me, as it often the case is a Keyboard Maestro macro.

Prerequisites

  • Keyboard Maestro - if you are a macOS power user and don’t have KM, then your missing on a lot.
  • Audacity - the multi-platform FOSS audio editor

Outline of the approach

Since Keyboard Maestro won’t know the path to our file in Anki’s collection.media directory, we have to find it. But the first task is to extract the filename. In the Anki note field, it’s going to have this format: