Parsing Russian Wiktionary content using XPath

As readers of this blog know, I’m an avid user of Anki to learn Russian. I have a number of sources for reference content that go onto my Anki cards. Notably, I use Wiktionary to get word definitions and the word with the proper syllabic stress marked. (This is an aid to pronunciation for Russian language learners.)

Since I’m lazy to the core, I came up with a system way of grabbing the stress-marked word from the Wiktionary page using lxml and XPath.

Finding the XPath of the element

First, we need to find the XPath of the element we want. Right-click on the stress-marked word and select “Inspect Element” from the contextual menu in Safari. After confirming that the correct HTML is displayed, right-click again and select “Copy” > “XPath”.

Fortunately, the Wiktionary format is (relatively) consistent and stable enough, that I’ll just tell you the XPath. It’s //*[@id="mw-content-text"]/div[1]/p[2]/strong

Prerequisites?

You will need the HTML/XML parsing module lxml so install with pip3 install lxml. While you’re at it, you’ll need BeautifulSoup later, so install that too pip3 install beautifulsoup4.

Scraping the text of the element

#!/usr/bin/env python3

from lxml import html
from lxml import etree
import requests
import sys

page = requests.get(sys.argv[1])
tree = etree.fromstring(page.content)
headword = tree.xpath('//*[@id="mw-content-text"]/div[1]/p[2]/strong')
try:
	print(headword[0].text)
except:
	headword = tree.xpath('//*[@id="mw-content-text"]/div[1]/p/strong')
	print(headword[0].text)

And that’s it - the script should print the headword - the accented word at the page. (Assumes the URL is the first script argument.)

Scraping the definition(s)

It becomes a little more complicated to scrape the word definitions because Wiktionary makes extensive use of markup in the middle of the definition. But BeautifulSoup seems to do an admirable job of wading through the fluff to return just the text of the definition.

#!/usr/bin/env python3

from lxml import html
from lxml import etree
import requests
import re
from bs4 import BeautifulSoup

page = requests.get('https://en.wiktionary.org/wiki/перерасти#Russian')
tree = etree.fromstring(page.content)
headword = tree.xpath('//*[@id="mw-content-text"]/div[1]/p[2]/strong')
try:
	print(headword[0].text)
except:
	headword = tree.xpath('//*[@id="mw-content-text"]/div[1]/p/strong')
	print(headword[0].text)

def_list = tree.xpath('//*[@id="mw-content-text"]/div[1]/ol')
def_text = ''
for li in def_list[0].iterchildren():
	soup = BeautifulSoup(etree.tostring(li), 'html.parser')
	def_text = def_text + '\n' + soup.get_text()
result = re.sub(r'\n\n', "\n", def_text)
print(result)

That’s the complete code to grab the headword and the definition list.

References

Being grateful for those who push our buttons


We need people to push our buttons, otherwise how are we to know what buttons we have?

Jetsunma Tenzin Palmo Ten Percent Happier podcast, February 8, 2021

Jetsunma Tenzin Palmo is a Buddhist nun interviewed on the excellent Ten Percent Happier podcast. It’s always possible to reframe situations where someone “pushes our buttons” to see it as an opportunity to better understand that there are these buttons, these sensitivities that otherwise evade our awareness.

Directly setting an Anki card's interval in the sqlite3 database

It’s always best to let Anki set intervals according to its view of your performance on testing. That said, there are times when directly altering the interval makes sense. For example, to build out a complete representation of the entire Russian National Corpus, I’m forced to enter vocabulary terms that should be obvious to even elementary Russian learners but which aren’t yet in my nearly 24,000 card database. Therefore, I’m entering these cards gradually.

Where the power lies in 2021

From an article recently on the BBC Russian Service: Блокировка уходящего президента США в “Твиттере” и “Фейсбуке” привела к необычной ситуации: теоретически Трамп еще может начать ядерную войну, но не может написать твит. “Blocking the outgoing U.S. President from Twitter and Facebook has led to an unusual situation: theoretically Trump can still start a nuclear war, but cannot write a Tweet." In only a week, he won’t be able to do either.

More on integrating Hazel and DEVONthink

Since DEVONthink is my primary knowledge-management and repository tool on the macOS desktop, I constantly work with mechanisms for efficiently getting data into and out of it. I previously wrote about using Hazel and DEVONthink together. This post extends those ideas about and looks into options for preprocessing documents in Hazel before importing into DEVONthink as a way of sidestepping some of the limitations of Smart Rules in the latter. I’m going to work from a particular use-case to illustrate some of the options.

Undoing the Anki new card custom study limit

Recently I hit an extra digit when setting up a custom new card session and was stuck with hundreds of new cards to review. Desparate to fix this, I started poking around the Anki collection SQLite database, I found the collection data responsible for the extra cards. In the col table, find the newToday key and you’ll find the extra card count expressed as a negative integer. Just change that to zero and you’ll be good.

Copy Zettel as link in DEVONthink

Following up on my recent article on cleaning up Zettelkasten WikiLinks in DEVONthink, here’s another script to solve the problem of linking notes. Backing up to the problem. In the Zettelkasten (or archive) - Zettel (or notes) are stored as list of Markdown files. But what happens when I want to add a link to another note into one that I’m writing? Since DEVONthink recognizes WikiLinks, I can just start typing but then I have to remember the exact date so that I can pick the item out of the contextual list that DEVONthink offers as links.

Cleaning up Zettelkasten WikiLinks in DEVONthink Pro

Organizing and reorganizing knowledge is one my seemingly endless tasks. For years, I’ve used DEVONthink as my primary knowledge repository. Recently, though I began to lament the fact that while I seemed to be collecting and storing knowledge in a raw form in DEVONthink, that I wasn’t really processing and engaging with it intellectually.1 In other words, I found myself collecting content but not really synthesizing, personalizing and using it. While researching note-taking systems in the search for a better way to process and absord the information I had been collecting, I discovered the Zettelkasten method.

Regex to match a cloze

Anki and some other platforms use a particular format to signify cloze deletions in flashcard text. It has a format like any of the following: {{c1::dog::}} {{c2::dog::domestic canine}} Here’s a regular expression that matches the content of cloze deletions in an arbitrary string, keeping only the main clozed word (in this case dog.) {{c\d::(.*?)(::[^:]+)?}} To see it in action, here it is in action in a Python script:

Removing stress marks from Russian text

Previously, I wrote about adding syllabic stress marks to Russian text. Here’s a method for doing the opposite - that is, removing such marks (ударение) from Russian text. Although there may well be a more sophisticated approach, regex is well-suited to this task. The problem is that def string_replace(dict,text): sorted_dict = {k: dict[k] for k in sorted(dict)} for n in sorted_dict.keys(): text = text.replace(n,dict[n]) return text dict = { "а́" : "а", "е́" : "е", "о́" : "о", "у́" : "у", "я́" : "я", "ю́" : "ю", "ы́" : "ы", "и́" : "и", "ё́" : "ё", "А́" : "А", "Е́" : "Е", "О́" : "О", "У́" : "У", "Я́" : "Я", "Ю́" : "Ю", "Ы́" : "Ы", "И́" : "И", "Э́" : "Э", "э́" : "э" } print(string_replace(dict, "Существи́тельные в шве́дском обычно де́лятся на пять склоне́ний.