Skip to content

Home

DOI pipeline status Latest Release License: CC BY-SA 4.0

About

ParaLiv is a collection of Livonian nominal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis.

The data is encoded in csv files, and the metadata follows frictionless standards. The dataset conforms to the Paralex standard

Please cite as:

  • Jules Bouton, Tuuli Tuisk, and Valts Ernštreits. ParaLiv: Livonian Paradigms in Phonemic Notation. 2024. doi:10.5281/zenodo.11391420.
  • Jules Bouton. Towards standardized inflected lexicons for the Finnic languages. In Proceedings of the Ninth International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2024),. Helsinki, Finland, 2024. Association for Computational Linguistics.

The data can be downloaded from zenodo or from the gitlab repository.

We thank the Livonian Institute for providing us access to their morphological data and plenty of support.

Creation of data was supported by State Research Programme (Latvia) "Latvian Studies for the Development of a Latvian and European Society", project Multifunctional dictionary of Livonian (VPP-LETONIKA-2021/2-0002).

How this lexicon was prepared

We selected all the verbs and nouns from the Livonian Institute's data, including inflectional information. We used epitran custom rules to convert these into phonemic notation. We performed extensive manual verifications. The input to epitran rules are the annotated orthographic forms, where:

  • | a bar indicates boundaries for composita
  • ¦ a broken bar indicates boundaries for foreign composita
  • ' a straight apostrophy indicates a broken tone

Any other diacritics present in the original dataset, such as the "~" separating overabundant forms were removed.

Finally, we enriched the dataset with annotations for overabundance, defectivity, cells and features.

Summary

flowchart LR
    A[(Livonian
        Institute)]:::start ==>|JSON| B(Orthographic
                                    paradigms)
        B ==> X
        E[["🖋 G2P rules"]]:::add -.-> X{{Epitran}}:::start
        X ==> C(Phonemic
                paradigms)
        C ==> D[(Paralex
                dataset)]:::aim
        Z[(Estonian
        dialects)] -...->|Token frequencies| F
        F[["🖋 Rich annotations"]]:::add --> D

classDef start stroke:#f00
classDef aim stroke:#090
classDef add stroke:#ffa10a

How to re-generate the data

To ensure replicability, we provide the possibility to rebuild the package from the sources by running the following commands:

$ git clone https://gitlab.com/finnic-morpho/paraliv.git
$ cd paraliv
$ make all

Please note that some tables (as cells, features, tags) need to be created manually and are required to build the other tables. The different steps of the process are detailed below.

Getting the sources

You should first clone the git repository:

$ git clone https://gitlab.com/finnic-morpho/paraliv.git
$ cd paraliv

Preparing python environment:

$ make venv

Downloading data from the Livonian Institute API:

$ make data

Extracting the lexicon from json to csv:

$ make parse
Transcriptions

Evaluating the transcription on dev forms:

$ make evaluate

Phonological transcription:

$ make transcription
Frequencies

Frequencies can be manually extracted from the Estonian dialects archive. If you want new cell frequencies, follow this procedure:

  • Make a regex query on POS : ^S$
  • Select Livonian among the languages.

Extract as .csv and put the resulting file with the name frequencies-S.csv in the source folder. Then run:

$ make frequencies
Packaging & Validation

We produce Frictionless metadata:

$ make metadata

Check the conformity with Paralex standard:

$ make validate

It is possible to export a random sample (with fixed seed), for manual verifications:

$ make sample

References

This dataset is derived from the Livonian Institute's morphological data collection. See:

  • Valts Ernštreits, Tiit-Rein Viitso, and Milda Kurpniece. Livonian morphology database. 2024.
  • Valts Ernštreits. Electronical resources for Livonian. In Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages, 184–191. Tartu, Estonia, 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-0314.

Source for frequencies is:

  • Liina Lindström, Triin Todesk, and Maarja-Liisa Pilvik. Corpus of Estonian Dialects. 2022. doi:10.23673/re-365.

Other sources used for the transcriptions are:

  • Tuuli Tuisk. Main features of the Livonian sound system and pronunciation. Eesti ja soome-ugri keeleteaduse ajakiri. Journal of Estonian and Finno-Ugric Linguistics, 7(1):121–143, 2016. doi:10.12697/jeful.2016.7.1.06.
  • Valts Ernštreits. Livonian Orthography. Linguistica Uralica, 43(1):11–22, 2007. doi:10.3176/lu.2007.1.02.
  • Tiit-Rein Viitso. Livonian Gradation : Types and Genesis. Linguistica Uralica, 43(1):45, 2007. doi:10.3176/lu.2007.1.05.
  • Marilyn May Vihman. Livonian phonolgy, with an appendix on Stød in Danish and Livonian. PhD thesis, University of California, Berkeley, Ca, 1971.
  • Lauri Posti. Grundzüge der livischen Lautgeschichte. PhD thesis, University of Helsinki, Helsinki, 1942.
  • Lauri Kettunen. Livisches Wörterbuch mit grammatischer Einleitung. Number 5 in Lexica Societatis Fenno-Ugricae. Suomalais-Ugrilainen Seura, Helsinki, 1938.