Home
About
ParaLiv is a collection of Livonian nominal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis.
The data is encoded in csv
files, and the metadata follows frictionless standards. The dataset conforms to the Paralex standard
Please cite as:
- Jules Bouton, Tuuli Tuisk, and Valts Ernštreits. ParaLiv: Livonian Paradigms in Phonemic Notation. 2024. doi:10.5281/zenodo.11391420.
- Jules Bouton. Towards standardized inflected lexicons for the Finnic languages. In Proceedings of the Ninth International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2024),. Helsinki, Finland, 2024. Association for Computational Linguistics.
The data can be downloaded from zenodo or from the gitlab repository.
We thank the Livonian Institute for providing us access to their morphological data and plenty of support.
Creation of data was supported by State Research Programme (Latvia) "Latvian Studies for the Development of a Latvian and European Society", project Multifunctional dictionary of Livonian (VPP-LETONIKA-2021/2-0002).
How this lexicon was prepared
We selected all the verbs and nouns from the Livonian Institute's data, including inflectional information. We used epitran custom rules to convert these into phonemic notation. We performed extensive manual verifications. The input to epitran rules are the annotated orthographic forms, where:
|
a bar indicates boundaries for composita¦
a broken bar indicates boundaries for foreign composita'
a straight apostrophy indicates a broken tone
Any other diacritics present in the original dataset, such as the "~" separating overabundant forms were removed.
Finally, we enriched the dataset with annotations for overabundance, defectivity, cells and features.
Summary
flowchart LR
A[(Livonian
Institute)]:::start ==>|JSON| B(Orthographic
paradigms)
B ==> X
E[["🖋 G2P rules"]]:::add -.-> X{{Epitran}}:::start
X ==> C(Phonemic
paradigms)
C ==> D[(Paralex
dataset)]:::aim
Z[(Estonian
dialects)] -...->|Token frequencies| F
F[["🖋 Rich annotations"]]:::add --> D
classDef start stroke:#f00
classDef aim stroke:#090
classDef add stroke:#ffa10a
How to re-generate the data
To ensure replicability, we provide the possibility to rebuild the package from the sources by running the following commands:
$ git clone https://gitlab.com/finnic-morpho/paraliv.git
$ cd paraliv
$ make all
Please note that some tables (as cells, features, tags) need to be created manually and are required to build the other tables. The different steps of the process are detailed below.
Getting the sources
You should first clone the git repository:
$ git clone https://gitlab.com/finnic-morpho/paraliv.git
$ cd paraliv
Preparing python environment:
$ make venv
Downloading data from the Livonian Institute API:
$ make data
Extracting the lexicon from json to csv:
$ make parse
Transcriptions
Evaluating the transcription on dev forms:
$ make evaluate
Phonological transcription:
$ make transcription
Frequencies
Frequencies can be manually extracted from the Estonian dialects archive. If you want new cell frequencies, follow this procedure:
- Make a regex query on POS :
^S$
- Select Livonian among the languages.
Extract as .csv
and put the resulting file with the name frequencies-S.csv
in the source
folder. Then run:
$ make frequencies
Packaging & Validation
We produce Frictionless metadata:
$ make metadata
Check the conformity with Paralex standard:
$ make validate
It is possible to export a random sample (with fixed seed), for manual verifications:
$ make sample
References
This dataset is derived from the Livonian Institute's morphological data collection. See:
- Valts Ernštreits, Tiit-Rein Viitso, and Milda Kurpniece. Livonian morphology database. 2024.
- Valts Ernštreits. Electronical resources for Livonian. In Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages, 184–191. Tartu, Estonia, 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-0314.
Source for frequencies is:
- Liina Lindström, Triin Todesk, and Maarja-Liisa Pilvik. Corpus of Estonian Dialects. 2022. doi:10.23673/re-365.
Other sources used for the transcriptions are:
- Tuuli Tuisk. Main features of the Livonian sound system and pronunciation. Eesti ja soome-ugri keeleteaduse ajakiri. Journal of Estonian and Finno-Ugric Linguistics, 7(1):121–143, 2016. doi:10.12697/jeful.2016.7.1.06.
- Valts Ernštreits. Livonian Orthography. Linguistica Uralica, 43(1):11–22, 2007. doi:10.3176/lu.2007.1.02.
- Tiit-Rein Viitso. Livonian Gradation : Types and Genesis. Linguistica Uralica, 43(1):45, 2007. doi:10.3176/lu.2007.1.05.
- Marilyn May Vihman. Livonian phonolgy, with an appendix on Stød in Danish and Livonian. PhD thesis, University of California, Berkeley, Ca, 1971.
- Lauri Posti. Grundzüge der livischen Lautgeschichte. PhD thesis, University of Helsinki, Helsinki, 1942.
- Lauri Kettunen. Livisches Wörterbuch mit grammatischer Einleitung. Number 5 in Lexica Societatis Fenno-Ugricae. Suomalais-Ugrilainen Seura, Helsinki, 1938.