Searching the Russian National Corpus

The Russian language has a vast and nuanced vocabulary. One approach to learning the vocabulary is to approach it in frequency order. The Nicholas Brown book seems dated and the frequency ordering methodology is not clear to me. Some words seem to be clustered by the beginning letter, which seems statistically unlikely. However, it’s a convenient list and I’m slowly building a table that cross-correlates the Nicholas Brown list with the methodologically-superior Russian National Corpus. To do that I harvested the data from the Corpus and built a Python application to search the database and report the rank and frequency data from it.

Creating a sqlite3 version of the Russian National Corpus

There is a CSV version of the Corpus, but the data is not useful for ordering in a meaningful way. Instead, I took the rank ordered tabular data from the page Частотный список лемм (Frequency list of lemmas) and simply pasted it into a Numbers spreadsheet. Since Numbers is extremely slow even on a fast-performing machine, it beachballed for nearly a minute during the paste operation. After that, I exported it as CSV. To get the CSV file into a sqlite3 database, I created a new table with the following schema:

After mapping the column names to those in the CSV, the import was simple.

Accessing the sqlite3 version of the corpus using Python

Next I wrote a little Python application to access the data and return the rank or frequency of any Russian word. The only trick is that the Russian letter ë is rendered as e in the database; so any word containing ë must be altered before the search. Regular expressions to the rescue! To use the application, just launch it with -h help flag and you’ll see the calling format.

#!/usr/bin/python
# encoding=utf8

import sqlite3
import sys
import re
import argparse

# '/Users/alan/torrential/russian/vocabulary/RussianNationalCorpus'

# instantiate argument parser
parser = argparse.ArgumentParser(description='Search the Russian National Corpus')
# arguments
parser.add_argument('word', help='Russian word to search for')
parser.add_argument('db_path', help='Path to the sqlite db')
group = parser.add_mutually_exclusive_group()
group.add_argument('--r', action='store_true',help='Show rank order')
group.add_argument('--f', action='store_false',help='Show frequency in instances/million')
# parse
args = parser.parse_args()

word = args.word
replaced = re.sub('ё','е',word)

conn = sqlite3.connect(args.db_path)

col = "rank" if args.r else "frequency"
sql = "SELECT " + col + " FROM corpus WHERE word LIKE '" + replaced + "'"
curs = conn.cursor()
curs.execute(sql)
print curs.fetchone()[0]

And a little AppleScript

Finally, I wrote an AppleScript wrapper that I can launch with a Quicksilver keystroke trigger. The wrapper takes the word off the clipboard and calls the Python app above, replacing the contents of the clipboard with the rank order of the word. For a little fun, it speaks the rank order number in Russian! Here’s the code for the AppleScript wrapper:

--
--	Created by: Alan Duncan
--	Created on: 2018-09-22
--
--	Copyright (c) 2018 Ojisan Seiuchi
--	Use to your heart's content; just give me a little credit
--

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

set dbPath to "/Users/alan/torrential/russian/vocabulary/RussianNationalCorpus"

set errorFlag to 0
set w to the clipboard
set cmd to "python /Users/alan/Documents/dev/scripts+tools/getRussianRank.py " & w & " " & dbPath & " --r"

try
	set rank to do shell script cmd
	set the clipboard to rank
on error errMsg
	say "плохо"
	set errorFlag to 1
end try
if errorFlag is 0 then
	set saying to "Готово " & (rank as string)
	say saying
end if

If you want a pre-built sqlite3 version of the Russian National Corpus, here it is

Sunday, September 16, 2018

Regex 101 is a great online regex tester.


Speaking of regular expressions, for the past year, I’ve used an automated process for building Anki flash cards. One of the steps in the process is to download Russian word pronunciations from Wiktionary. When Wiktionary began publishing transcoded mp3 files rather than just ogg files, they broke the URL scheme that I relied on to download content. The new regex for this scheme is: (?:src=.*:)?src=\"(\/\/.*\.mp3)

Saturday, September 15, 2018

Interestingly, Fox News rejects requests from the Tor Browser. The New York Times loads perfectly normally via Tor. I don’t often visit Fox News but an article title caught my attention.

Thursday, September 13, 2018

Politico has a piece today about Trump’s outrageous claims in the face of weather disasters. In almost every context, he reveals himself to be an abject fool; but lurking beneath that idiocy is another layer of loathsomeness - the complete lacking in understanding of science. I want a reporter to ask him any of the following questions about hurricanes:

  • “Mr. Trump, can you describe for us your understanding of how hurricanes form?”
  • “What role do Coriolis forces play in the formation of tropical cyclones.”
  • “Given that hurricanes possess massive amounts of energy, what are the sources of that energy?”

An article from the Times on why yelling at children is comparable to physical punishment. Children who are subjected to yelling have lower self-esteem, and more depressive and anxiety symptoms.^[The article cites a study that shows a reciprocal amplifying effect of yelling and behavioural problems: “Mothers’ and fathers’ harsh verbal discipline at age 13 predicted an increase in adolescent conduct problems and depressive symptoms between ages 13 and 14. A child effect was also present, with adolescent misconduct at age 13 predicting increases in mothers’ and fathers’ harsh verbal discipline between ages 13 and 14.”]

How fascism works

A recent piece in The Atlantic by Peter Beinart filled in a cognitive gap in understanding how a large minority of U.S. citizens continue to support an abjectly incompetent, almost certainly criminal, willfully ignorant, and generally hateful man as president. The article Why Trump supporters believe he is not corrupt makes the argument that when Trump defenders concern themselves with the idea of corruption they are not thinking of political corruption so much as corruption of the purity. This is consistent with Jonathan Haight’s research into the determinants of a person’s moral judgments as a function of political affiliation.^[This has been noted before by Thomas Edsall back in early 2016 writing for The New York Times.] Conservatives are likelier than liberals to concern themselves with tradition and purity. When Donald Trump uses the word disgusting which he has done scores of times on Twitter, he’s invoking the conservative fear of taint. The Special Prosecutor’s inquiry into possible collusion and other crimes committed during the 2016 elections, in Trump’s view, are not only unlawful, biased, or unfavourable in some other objective way. It is, to Trump, disgusting (“this Rigged and Disgusting Witch Hunt.”)

Using lynx to bypass ad block detection

Web

Organizing works as playlists and folders

It turns out that command line text web browers like lynx can bypass AdBlock detection.

On macOS, I installed lynx using Homebrew. Then from the Terminal, it’s just lynx your-url. It’s actually quite pleasant to read text without all of the images and fluff.

Quarantining extremist ideas

This is an interesting essay in The Guardian on the idea of quarantining extremist ideas.

A non-trivial proportion of the population regards the media as having a responsibility to represent all idea with equal validity. So the appearance of extremist ideas in the press, even if they are treated negatively, results in more legitimacy than they are due. The authors in this essay make a case for quarantining these extreme ideas, refusing to cover them. Strategic silence, they call it.