ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ЦЭМИ РАН |
||
Lemmatization represents the basis for many experiments and analyses in computational philology and corpus linguistics, and although considered to be solved for modern major languages, producing lemmatized text remains challenging for languages for which little or no language resources are currently in existence. In particular, morphologically rich languages benefit greatly from the sparsity reduction achieved if automated pre-processing, annotation or distributional analyses (be it collocation graphs or correlation studies) are conducted on lemmata rather than the original word forms. We describe experiments and a tool on lemmatizing languages with insufficient resources for state-of-the-art linguistic, or philological work. We introduce LiOTrea, the Linked Open Text Reader and Annotator: Given a text (corpus) and one or more dictionaries or lemma lists, LiOTrea uses the dictionaries to suggest lemmata (resp., links to dictionary entries) and morphological features for words in the corpus. Several linking/lemmatization strategies are implemented. Their suggestions are ranked and can either be used as a pre-processing step in manual annotation, or in place of manual annotation. Furthermore, novel dictionary entries can be created during annotation. The technological innovation is three-fold: (1) A number of lemmatization/linking strategies being implemented, (2) native support for the Ontolex-lemon vocabulary (McCrae et al., 2017) and lexical resources from the Linguistic Linked Open Data cloud, and (3) a language-independent technology. Ontolex-lemon is a standard of growing importance to the DH community (cf. Bellandi et al., 2018), and tools for creating and publishing such datasets are available, but no tool, to the best of our knowledge, that currently uses this technology for analysing philological text. The technology is applicable to any language, for illustration we use real-world studies on languages from the Caucasus, Eurasia and the Near East, conducted in the context of philological research (Armenian and Sumerian, languages with a long and extensive history as a written language) and the documentation of endangered languages and cultures (Batsbi; a minority language spoken in Georgia). The case studies thus cover two important strands of Digital Humanities.