Hugo static site upload woes and a way forward

As much as I love the static website concept in general and Hugo in particular, there is one part of the Hugo/S3 infrastructure that I despise which is the lack of incremental uploads and the fact that no matter whether I use the --noTimes=false flag to compile my sites with hugo. It seems to touch every single file, every single time. Therefore, whatever sync utility I chose sees every file as new and in need of upload. For this blog, that takes about 10 minutes.

Since I only sporadically see people complaining about this online, the problem is either that I haven’t figured out the magical incantation to stop hugo from touching unchanged files, or I just have really slow upload speeds (I do.) or something else. In any case, I’ve decided to take matters into my own hands and force the upload process to respect the MD5 hash of each file. We store these hashes in a database and then walk the /public directory examining comparing hashes. Only if the hashes differ, or if it’s a new file will we add it to the upload list.

All we need is a sqlite database stored in the project root and a simple script. First the database has the following structure:

CREATE TABLE "checksums" (
   "id" INTEGER PRIMARY KEY AUTOINCREMENT,
   "path" TEXT,
   "fn" TEXT,
   "md5" TEXT
);

And the script:

#!/usr/bin/env python3

import os
import sys
import hashlib
import sqlite3
import time
import subprocess
import re
from concurrent.futures import ThreadPoolExecutor


SRC_DIR = "/path/to/public/dir"
BUCKET_URL = "s3://your.bucket.com"
DB_PATH = /path/to/checksums.db'
   
def aws_copy(source: str, dest: str):
   cmd_list = ["aws","s3", "cp", source, dest ]
   subprocess.run(cmd_list)
   
def bucket_path_from_local(source_path: str) -> str:
   m = re.search(r'.*/public/(.*)', local_path)
   bucket_path = f"{BUCKET_URL}/{m[1]}"
   return bucket_path

def change_md5_query(changed: tuple) -> str:
   chg_query = f"UPDATE checksums SET md5 = '{changed[2]}' "
   chg_query += f"WHERE path = '{changed[0]}' AND fn = '{changed[1]}'"
   return chg_query
   
connection = sqlite3.connect(DB_PATH)
cursor = connection.cursor()

changed_files = []
for root, subdirs, files in os.walk(SRC_DIR):
   for file in files:
      with open(os.path.join(root, file), 'rb') as _file:
         file_md5 = hashlib.md5(_file.read()).hexdigest()
         query = f"SELECT md5 FROM checksums \
                   WHERE path LIKE '{root}' \
                   AND fn LIKE '{file}' LIMIT 1"
         cursor.execute(query)
         row = cursor.fetchone()
         if row is None:
            # file md5 doesn't exist
            add_query = f"INSERT INTO checksums (path, fn, md5) \
                          VALUES ('{root}', '{file}', '{file_md5}')"
            cursor.execute(add_query)
            connection.commit()
            # since this is a new file, we need to
            # add it to the upload list
            changed_files.append((root, file, file_md5))
         else:
            # file md5 exists, is it changed?
            if row[0] != file_md5:
               print(f'changed db md5 = {row[0]} vs {file_md5}')
               changed_files.append((root, file, file_md5))
         # process changed files
with ThreadPoolExecutor(max_workers=16) as executor:
   for changed in changed_files:
      chg_query = change_md5_query(changed)
      cursor.execute(chg_query)
      
      local_path = f"{changed[0]}/{changed[1]}"
      bucket_path = bucket_path_from_local(local_path)
      future = executor.submit(aws_copy, local_path, bucket_path)
connection.commit()
connection.close()

For simple posts only two files might be changed - the root index page and the post page itself. But if new tags or categories are required, or pagination gets shuffled around, a larger number of files may be affected. That’s why I’ve divided the work in a thread pool. With informal testing that strategy seems to provide about 10x performance improvement over just a singular serial execution on the main thread. The greatest savings in efficiency comes from avoiding uploading images repeatedly.

Just build the site normally and run the script to upload and you’re good to go. Enjoy! If you have questions, you can reach me through my contact page

Prerequisites

The AWS command-line interface tool is required. Install whatever release is compatible with your system.

Alternatives

I’ve explored many of these options for deploying Hugo static sites to AWS S3. Each has some limitations which I won’t go into in detail. But I will note that I had problems with each of them “over-syncing” files that had not actually changed.

Interlinear glossing dealing with punctuation

In a previous post I presented a CSS-based solution to interlinear glossing that uses only CSS. It’s a solution that may be preferrable others such as leipzig.js or interlinear.js because both of the latter assume a different annotation purpose than what I envision for my app. Whereas those libraries deal with punctuation gracefully, my CSS-only approach does not. So we end up with something like this:

where the PUNCT nodes end up standing alone. These extra punctuation nodes add nothing to the understanding of the text and look ragged.

What I would really like is for the punctuation marks to live with the previous element and for the markup to go away. A little jQuery helps here. The basic strategy is this:

  1. Find the p.pos nodes and select the ones containing PUNCT.
  2. Loop over the p.pos punctuation nodes and find their parent node, which we’re going to delete.
  3. Find the previous sibling of the punctuation div.
  4. Append the punctuation mark onto the p.ru of the previous sibling div.
  5. Remove the punctuation div from the DOM.

The result looks like this:

The visual appearance is much better now, I think.

The CSS and HTML example code are as presented previously. Here’s the jQuery code we use to move around the punctuation.

$(function() {
    /* document ready code here */
    $('p.pos').filter(function() {
        return $(this).text().trim().toLowerCase() === 'punct';
    }).each(function(index) {
        /* these are each punctuation <p> */
        let punctDiv = $(this).parent();
        // get the exact punctuation mark in use
        let punctMark = punctDiv.children().filter('.ru').first().text();
        /*  find the previous div because
        	that's where we need to add back the
            punctuation mark
        */
        let punctPrevDiv = punctDiv.prev();
        // the p.ru child
        var punctRuP = punctPrevDiv.children().filter('.ru').first();
        // glom the punctuation mark onto previous p.ru
        punctRuP.append(punctMark);
        // remove the PUNCT div from the DOM
        punctDiv.remove()
    })
})

There is a JSFiddle to play with if this is helpful. There’s still much more to do in my project, integrating various pieces, but it’s beginning to take shape.

Three-line (though non-standard) interlinear glossing

Still thinking about interlinear glossing for my language learning project. The leizig.js library is great but my use case isn’t really what the author had in mind. I really just need to display a unit consisting of the word as it appears in the text, the lemma for that word form, and (possibly) the part of speech. For academic linguistics purposes, what I have in mind is completely non-standard. The other issue with leizig.

Splitting text into sentences: Russian edition

Splitting text into sentences is one of those tasks that looks simple but on closer inspection is more difficult than you think. A common approach is to use regular expressions to divide up the text on punction marks. But without adding layers of complexity, that method fails on some sentences. This is a method using spaCy.

My favourite Cyrillic font

I’ve tried a lot of fonts for Cyrillic. My favourite is Georgia. As a non-native Russian speaker, there’s something about serif fonts, either on-screen or in print, that makes the text so much more legible.

The cancellation of Russian music

Free speech in Russia has never been particularly favoured. The Romanov dynasty remained in power long past their expiration date by suppressing waves of free thought, from the ideals of the Enlightenment, to the anti-capitalist ideals of Marx and Engels. At least, until the 1917 Revolution. And even then, the Bolsheviks continue to suppress dissent for the entire seventy-something year history of the Soviet Union. Perestroika and the collapse of the Soviet Union promised change.

Bash variable scope and pipelines

I alluded to this nuance involving variable scope in my post on automating pdf processing, but I wanted to expand on it a bit. Consider this little snippet: i=0 printf "foo:bar:baz:quux" | grep -o '[^:]+' | while read -r line ; do printf "Inner scope: %d - %s\n" $i $line ((i++)) [ $i -eq 3 ] && break; done printf "====\nOuter scope\ni = %d\n" $i; If you run this script - not in interactive mode in the shell - but as a script, what will i be in the outer scope?