Where the power lies in 2021

From an article recently on the BBC Russian Service:

Блокировка уходящего президента США в “Твиттере” и “Фейсбуке” привела к необычной ситуации: теоретически Трамп еще может начать ядерную войну, но не может написать твит.

“Blocking the outgoing U.S. President from Twitter and Facebook has led to an unusual situation: theoretically Trump can still start a nuclear war, but cannot write a Tweet."

In only a week, he won’t be able to do either. But while celebrating the deplatforming of this vicious clown, I have a tinge of worry about why it means for the future of democracy. It nearly goes without saying that social networks have become nearly the de facto equals of representative government in the U.S.

More on integrating Hazel and DEVONthink

Since DEVONthink is my primary knowledge-management and repository tool on the macOS desktop, I constantly work with mechanisms for efficiently getting data into and out of it. I previously wrote about using Hazel and DEVONthink together. This post extends those ideas about and looks into options for preprocessing documents in Hazel before importing into DEVONthink as a way of sidestepping some of the limitations of Smart Rules in the latter. I’m going to work from a particular use-case to illustrate some of the options.

Use case

While preparing for tax season, I download all of my bank statements because I have to deal with foreign accounts for FATCA compliance. (Thanks a lot, U.S.!) It would be ideal if I could analyze the document content and rename the statement based on dates in the PDF. While Smart Rules in DEVONthink are quite robust, I have two problems with them:

  1. They don’t reliably trigger automatically. Often I find that the matching process works, but the actions aren’t triggered. Instead, they “accumulate” in the Smart Rule group and I have to select them and “Apply Rules” to get the actions started. Sometimes it works; sometimes it doesn’t.
  2. Options for extracting content from the PDF are limited. Specifically, I’ve not found a way to pull content on the OCR’d text of the PDF. Certainly, it’s possible to match against content; but extracting fields and using that data to, say, rename the document seems impossible.

Turning to Hazel, then, I can do much of the required pre-processing of the PDF document before it hits DEVONthink. In our particular use-case, I want to extract the statement end date from the PDF content and use those data to rename the document before it reaches DEVONthink. Otherwise, all of the statements have the same gibberish names as they come from the bank.

Using CAM::PDF to inspect the PDF

I like to work in Perl when I can because:

  • It plays nicely with the lower levels that we’re working in here.
  • I understand its regex model well
  • Rarely having to deal with versioning issues is a place over Python.

There are a fer Perl packages that can inspect and manipulate PDF documents. Out of familiarity, I chose CAM::PDF. The first step is to dive into the text content of the PDF and see what’s there.

#!/usr/bin/perl

use CAM::PDF;
use Data::Dumper; 
$Data::Dumper::Indent = 1; $Data::Dumper::Sortkeys = 1;

my $filename = "/Users/alan/blah2.pdf";
my $pdf = CAM::PDF->new($filename);
my $content = $pdf->getPageText(1);
print Dumper($content);

Now I can sort through the text and find the data of interest:

Account
No.
OCT
30/20
-
NOV
30/20
0026
0026-7247238

Don’t worry, I’ve obfuscated the account information here.

To extract NOV, 30 and 20, I can use regex to pull them out of the content. Ideally, the content will remain stable between statements. To tune the regular expression, I use the excellent Patterns application on macOS, but there are many others. Here’s the extraction process laid out with a little more detail:

#!/usr/bin/perl

use CAM::PDF;
use Data::Dumper; 
$Data::Dumper::Indent = 1; $Data::Dumper::Sortkeys = 1;

my $filename = "/Users/alan/blah2.pdf";
my $pdf = CAM::PDF->new($filename);
my $content = $pdf->getPageText(1);

if ($info) {
   if( $content =~ m/-\n(\D+)\n(\d+)\/(\d+)\n\d+\n0026-7247238/ ) {
      my ($month_str,$day,$year) = (lc($1), $2, $3);
      my %month_dict = (
         jan =>  1, feb =>  2, mar =>  3,
         apr =>  4, may =>  5, jun =>  6,
         jul =>  7, aug =>  8, sep =>  9,
         oct => 10, nov => 11, dec => 12
      );
      my $month_num = $month_dict{$month_str};
      my $fn = sprintf("20%d-%02d-%d Acme Bank business statement.pdf", $year, $month_num, $day);
      my @f = split('/',$filename);
      splice @f, -1;
      push @f, $fn;
      $ff = join "/",@f;
      print $ff;
      #  rename($filename, $ff);    # rename the original file 
   }
   else { print "No match\n"; }
}
exit $?;

If implementing this script as part of an actual Hazel rule, then you’ll want to uncomment the rename line, remove the print $ff and the final else condition. Of course, you’ll need to adjust the regex and so forth since this is specific to my use case.

Importing to DEVONthink

Now that we’ve dived in the PDF text, extracted the information needed to rename the file and have done so, we can tag and import the file into desired DEVONthink group. This we’ll do via AppleScript:

tell application id "DNtp"
   -- whatever your db name is, mine is leviathan
   set dbs to first database whose name is "leviathan" 
   set myGroup to get record at "/path/to/your/group" in dbs
   set myRecord to import (POSIX path of theFile) to myGroup
   set tags of myRecord to {"main", "topic_financial", "topic_financial_banking", "topic_financial_content", "topic_financial_content_statement", "vendor", "vendor_acmebank"}
end tell

References

Undoing the Anki new card custom study limit

Recently I hit an extra digit when setting up a custom new card session and was stuck with hundreds of new cards to review. Desparate to fix this, I started poking around the Anki collection SQLite database, I found the collection data responsible for the extra cards. In the col table, find the newToday key and you’ll find the extra card count expressed as a negative integer. Just change that to zero and you’ll be good.

Copy Zettel as link in DEVONthink

Following up on my recent article on cleaning up Zettelkasten WikiLinks in DEVONthink, here’s another script to solve the problem of linking notes. Backing up to the problem. In the Zettelkasten (or archive) - Zettel (or notes) are stored as list of Markdown files. But what happens when I want to add a link to another note into one that I’m writing? Since DEVONthink recognizes WikiLinks, I can just start typing but then I have to remember the exact date so that I can pick the item out of the contextual list that DEVONthink offers as links.

Cleaning up Zettelkasten WikiLinks in DEVONthink Pro

Organizing and reorganizing knowledge is one my seemingly endless tasks. For years, I’ve used DEVONthink as my primary knowledge repository. Recently, though I began to lament the fact that while I seemed to be collecting and storing knowledge in a raw form in DEVONthink, that I wasn’t really processing and engaging with it intellectually.1 In other words, I found myself collecting content but not really synthesizing, personalizing and using it. While researching note-taking systems in the search for a better way to process and absord the information I had been collecting, I discovered the Zettelkasten method.

Regex to match a cloze

Anki and some other platforms use a particular format to signify cloze deletions in flashcard text. It has a format like any of the following: {{c1::dog::}} {{c2::dog::domestic canine}} Here’s a regular expression that matches the content of cloze deletions in an arbitrary string, keeping only the main clozed word (in this case dog.) {{c\d::(.*?)(::[^:]+)?}} To see it in action, here it is in action in a Python script:

Removing stress marks from Russian text

Previously, I wrote about adding syllabic stress marks to Russian text. Here’s a method for doing the opposite - that is, removing such marks (ударение) from Russian text. Although there may well be a more sophisticated approach, regex is well-suited to this task. The problem is that def string_replace(dict,text): sorted_dict = {k: dict[k] for k in sorted(dict)} for n in sorted_dict.keys(): text = text.replace(n,dict[n]) return text dict = { "а́" : "а", "е́" : "е", "о́" : "о", "у́" : "у", "я́" : "я", "ю́" : "ю", "ы́" : "ы", "и́" : "и", "ё́" : "ё", "А́" : "А", "Е́" : "Е", "О́" : "О", "У́" : "У", "Я́" : "Я", "Ю́" : "Ю", "Ы́" : "Ы", "И́" : "И", "Э́" : "Э", "э́" : "э" } print(string_replace(dict, "Существи́тельные в шве́дском обычно де́лятся на пять склоне́ний.

"Delete any app that makes money off your attention."

Listening to Cal Newport interviewed on a recent podcast, something he said resonated. I’m probably paraphrasing, but a key piece of advice was: “Delete any app that makes money off your attention." Seems like really good advice. A smartphone is a collection of tools embedded in a tool. Use it like a tool and not an entertainment device and you’ll be find. For a while, in an effort to pry myself loose from the psychic hold of the smartphone I went back to using some kind of old flip phone.

URL-encoding URLs in AppleScript

The AppleScript Safari API is apparently quite finicky and rejects Russian Cyrillic characters when loading URLs. For example, the following URL https://en.wiktionary.org/wiki/стоять#Russian throws an error in AppleScript. Instead, Safari requires URL’s of the form https://en.wiktionary.org/wiki/%D1%81%D1%82%D0%BE%D1%8F%D1%82%D1%8C#Russian whereas Chrome happily consumes whatever comes along. So, we just need to encode the URL thusly: use framework "Foundation" -- encode Cyrillic test as "%D0" type strings on urlEncode(input) tell current application's NSString to set rawUrl to stringWithString_(input) -- 4 is NSUTF8StringEncoding set theEncodedURL to rawUrl's stringByAddingPercentEscapesUsingEncoding:4 return theEncodedURL as Unicode text end urlEncode When researching Russian words for vocabulary study, I use the URL encoding handler to load the appropriate words into several reference sites in sequential Safari tabs.

Consume media outside one's bubble?

That “reality bubbles” contribute heavily to increasing political polarization is well-known. Customized media diets at scale and social media feeds that are tailored to individual proclivities progressively narrow our understanding of perspectives other than our own. Yet, the cures are difficult and uncertain. Often, though, we’re advised to consume media from the other side of the political divide. A sentence from a recent piece in The Atlantic encapsulates why I think this is such a fraught idea: