linguistics

Beginning to experiement with Stanza for natural language processing

After installing Stanza as dependency of UDAR which I recently described, I decided to play around with what is can do. Installation The installation is straightforward and is documented on the Stanza getting started page. First, sudo pip3 install stanza Then install a model. For this example, I installed the Russian model: #!/usr/local/bin/python3 import stanza stanza.download('ru') Usage Part-of-speech (POS) and morphological analysis Here’s a quick example of POS analysis for Russian.

Automated marking of Russian syllabic stress

One of the challenges that Russian learners face is the placement of syllabic stress, an essential determinate of pronunciation. Although most pedagogical texts for students have marks indicating stress, practically no tests intended for native speakers do. The placement of stress is inferred from memory and context. I was delighted to discover Dr. Robert Reynolds’ work on natural language processing of Russian text to mark stress based on grammatical analysis of the text.

Language word frequencies

Since one of the cornerstones of my approach to learning the Russian language has been to track how many words I’ve learned and their frequencies, I was intrigued by reading the following statistics today: The 15 most frequent words in the language account for 25% of all the words in typical texts. The first 100 words account for 60% of the words appearing in texts. 97% of the words one encounters in a ordinary text will be among the first 4000 most frequent words.