Week functions in Dataview plugin for Obsidian

There are a couple features of the Dataview plugin for Obsidian that aren’t documented and are potentially useful.

For the start of the week, use date(sow) and for the end of the week date(eow). Since there’s no documentation as of yet, I’ll venture a guess that they are locale-dependendent. For me (in Canada), sow is Monday. Since I do my weekly notes on Saturday, I have to subtract a couple days to point to them.

`="[[" + dateformat(date(sow) - dur(2 days), "yyyy-MM-dd") + " weekly" + "|Week]]"`

This inline Dataview function will provide a link to my weekly summary document.

Scraping Forvo pronunciations

Most language learners are familiar with Forvo, a site that allows users to download and contribute pronunciations for words and phrases. For my Russian studies, I make daily use of the site. In fact, to facilitate my Anki card-making workflow, I am a paid user of the Forvo API. But that’s where the trouble started.

When the Forvo API works, it works OK, often extremely slow. But lately, it has been down more than up. In an effort to patch my workflow and continue to download Russian word pronunciations, I wrote this little scraper. I’d prefer to use the API, but experience has shown now that the API is slow and unreliable. I’ll keep paying for the API access, because I support what the company does. And as often as not when a company offers a free service, it’s likely to be involved in surveillance capitalism. So I’d rather companies offer a reliable product at a reasonable price.

There are other such projects out in open-source. This project incorporates one interesting feature in that it attempts to rank pronunciations in a scoring system that relies on whether the contributing user is a favourite and how many votes that the pronunciation has gained.1

If you just want to get started with the scraper, it’s up on GitHub. I’m open to pull requests if you have something interesting to contribute, or honestly, you can do whatever you would like with it. I’d appreciate a little acknowledgement if you adopt the code in your own work.

If you want to stick around and see how I did things, feel free to follow along.

Approach

This scaper uses Selenium for Python which loads a browser head in order to capture the HTML that we’re going to scrape. The idea is to future-proof the script against attempts on behalf of the company to detect script-based access. It was also a chance to learn how to integrate Selenium with Beautiful Soup 4, my go-to scraping technology for Python. First, we’re going to need to import those dependencies:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait

from bs4 import BeautifulSoup, element

We use command-line arguments in the script, so we will need the argparse infrastructure to get our download destination and the word we’re researching:

import argparse

if __name__ == "__main__":
   prog_desc = "Download pronunciation file from Forvo"
   parser = argparse.ArgumentParser()
   parser = argparse.ArgumentParser(prog=prog_desc)
   parser.add_argument('--dest',
                        help="The directory for download",
                        type=str)
   parser.add_argument('--word',
                        help="Word to research",
                        type=str)

   args = parser.parse_args()
   word = args.word
   dest = args.dest

Capturing the Forvo page

We’re using Selenium to capture the HTML content of the pronunciation list page so that we can analyze it:

def get_forvo_page(url: str) -> BeautifulSoup:
   """Get the bs4 object from Forvo page
   
   url: str - the Forvo pronunciation page

   Returns
   -------
   BeautifulSoup 4 object for the page
   """
   driver = webdriver.Safari()
   driver.get(url)
   driver.implicitly_wait(30)
   agree_button = driver.execute_script("""return document.querySelector("button[mode='primary']");""")
   try:
      agree_button.click()
   except AttributeError:
      pass
   try:
      close_button = driver.execute_script("""return document.querySelector("button.mfp-close");""")
      close_button.click()
   except AttributeError:
      pass
   time.sleep(1)
   soup = BeautifulSoup(driver.page_source, 'html.parser')
   return soup

Note that this uses Safari; obviously I’m on macOS, so you can use a different driver if you are on another platform.

Note that that we have to try to respond to privacy and other pop-ups that try to ruin our day. We could probably do it without the JavaScript, but that’s what came to me in the moment.

<ul class="pronunciations-list pronunciations-list-ru" id=
   "pronunciations-list-ru">
   <li class="pronunciation li-active">
      <!-- detail for this pronunciation -->
   </li>

All of the Russian language pronunciations are in the unordered list with the id pronunciations-list-ru, so our next task is to find that ul element and enumerate it. Fortunately, Beautiful Soup makes that incredible easy:

ru_pronunciation_list = soup.find("ul", {"id": "pronunciations-list-ru"})
if ru_pronunciation_list is None:
   exit('ERROR - this word may not exist on Forvo!')

Then we can loop over ru_pronunciation_list to find all of the pronunciation list items (li) and accumulate them as our custom Pronunciation objects:

pronunciations = []
for li in ru_pronunciation_list.find_all("li"):
   pronunciation = pronunciation_for_li(li)
   pronunciations.append(pronunciation)

Next, we’ll take a look at what pronunciation_for_li does with each of those <li> elements:

def pronunciation_for_li(element: element.Tag) -> Optional[Pronunciation]:
   """Pronunciation object from its <li> element

   Returns an optional Pronunciation object from a
   <li> element that contains the required info.

   Returns
   -------
   Pronunciation object, or None
   """
   info_span = element.find("span", {"class": "info"})
   if info_span is not None:
      user = user_from_info_span(info_span)
   votes = num_votes_from_li(element)
   url = audio_link_for_li(element)
   if url is not None:
      pronunciation = Pronunciation(user, votes, url)
      return pronunciation
   return None

Here we’re just extracting vote, username and audio file link from the deeper levels of the hierarchy, which is left as an exercise for the reader. One detail that bears mentioning is how we extract the link to the .ogg file. Each pronunciation has a play button with an onclick attribute. The JavaScript code provides a base64-encoded value that we can extract. The value is a component of the audio file path that we extract.

Ranking pronunciations

As mentioned, we rank pronunciations by two variables - username and the number of votes. But we need a method for ordering them in the list of pronunciations. We use functools.total_ordering for this.

from functools import total_ordering

@total_ordering
class Pronunciation(object):
   def __init__(self, uname:str, positive: int, path: str):
      self.user_name = uname
      self.positive: int = positive
      self.path: str = path

By decorating the Pronunciation class, we can later use max against our list of pronunciations to select that highest rated item. But we do have to implement certain functions required by total_ordering:

@property
def score(self) -> int:
   subscore = 0
   if self.user_name in FAVS:
      subscore = 2
   return self.positive + subscore
   
def __eq__(self, other):
   if not isinstance(other, type(self)): return NotImplemented
   return self.score == other.score
   
def __lt__(self, other):
   if not isinstance(other, type(self)): return NotImplemented
   return self.score < other.score

The scoring algorithm is entirely arbitrary. If you want to give a higher weight to favourite users, that’s something you can certainly implement.

Selecting a pronunciation

Having implement the comparison functions in the Pronunciation class, we can select the one with the highest score:

use_p  = max(pronunciations) if len(pronunciations) > 1 else pronunciations[0]

And that’s it! I hope this example is helpful to you. If you have the means, and if the Forvo API improves a lot using that would be the most ethical way to automate the process of grabbing pronunciations. But until then, here’s an alternative. If you have questions, I’m not on Twitter2 so please just use my contact page.


  1. Of course the later variable is not entirely reliable because the oldest pronunciations will have had the longest opportunity to garner votes; but the idea is that we can at least look to our favourites as a way of nudging the choice in the desired direction. ↩︎

  2. I refuse to be part of Elon Musk’s attempt to impose his authoritarian world-view through his acquisition of Twitter. I have no accounts on the platform. ↩︎

A regex to remove Anki's cloze markup

Recently, someone asked a question on r/Anki about changing and existing cloze-type note to a regular note. Part of the solution involves stripping the cloze markup from the existing cloze’d field. A cloze sentence has the form Play {{c1::studid}} games. or Play {{c1::stupid::pejorative adj}} games. To handle both of these cases, the following regular expression will work. Just substitute for $1. {{c\d::([^:}]+)(?:::+[^}])}} However, the Cloze Anything markup is different. It uses ( and ) instead of curly braces.

Anki: Insert the most recent image

I make a lot of Anki cards, so I’m on a constant quest to make the process more efficient. Like a lot of language-learners, I use images on my cards where possible in order to make the word or sentence more memorable. Process When I find an image online that I want to use on the card, I download it to ~/Documents/ankibound. A Hazel rule then grabs the image file and converts it to a .

Altering Anki's revlog table, or how to recover your streak

Anki users are protective of their streak - the number of consecutive days they’ve done their reviews. Right now, for example, my streak is 621 days. So if you miss a day for whatever reason, not only do you have to deal with double the number of reviews, but you also deal with the emotional toll of having lost your streak. You can lose your streak for one of several reasons.

A deep dive into my Anki language learning: Part III (Sentences)

Welcome to Part III of a deep dive into my Anki language learning decks. In Part I I covered the principles that guide how I setup my decks and the overall deck structure. In the lengthy Part II I delved into my vocabulary deck. In this installment, Part III, we’ll cover my sentence decks. Principles First, sentences (and still larger units of language) should eventually take precedence in language study. What help is it to know the word for “tomato” in your L2, if you don’t know how to slice a tomato, how to eat a tomato, how to grow a tomato plant?

A deep dive into my Anki language learning: Part II (Vocabulary)

In Part I of my series on my Anki language-learning setup, I described the philosophy that informs my Anki setup and touched on the deck overview. Now I’ll tackle the largest and most complex deck(s), my vocabulary decks. First some FAQ’s about my vocabulary deck: Do you organize it as L1 → L2 or as L2 → L1, or both? Actually, it’s both and more. Keep reading. Do you have separate subdecks by language level, or source, or some other characteristic?

A deep dive into my Anki language learning: Part I (Overview and philosophy)

Although I’ve been writing about Anki for years, it’s been in bits and pieces. Solving little problems. Creating efficiencies. But I realized that I’ve never taken a top-down approach to my Anki language learning system. So consider the post the launch of that overdue effort. Caveats A few caveats at the outset: I’m not a professional language tutor or pedagogue of any sort really. Much of what I’ve developed, I’ve done through trial-and-error, some intuition, and a some reading on relevant topics.

A tool for scraping definitions of Russian words from Wikitionary

In my perpetual attempt to make my language learning process using Anki more efficient, I’ve written a tool to extract English-language definitions from Russian words from Wiktionary. I wrote about the idea previously in Scraping Russian word definitions from Wikitionary: utility for Anki but it relied on the WiktionaryParser module which is good but misses some important edge cases. So I rolled up my sleeves and crafted my own solution. As with WiktionaryParser the heavy-lifting is done by the Beautiful Soup parser.

Getting plaintext into Anki fields on macOS: An update

A few years ago, I wrote about my problems with HTML in Anki fields. If you check out that previous post you’ll get the backstory about my objection. The gist is this: If you copy something from the web, Anki tries to maintain the formatting. Basically it just pastes the HTML off the clipboard. Supposedly, Anki offers to strip the formatting with Shift-paste, but I’ve point out to the developer specific examples where this fails.