Morphologically Annotated Amharic Text Corpora

Tilahun Yeshambel; Josiane Mothe; Yaregal Assabie

doi:10.1145/3404835.3463237

Communication Dans Un Congrès Année : 2021

Morphologically Annotated Amharic Text Corpora

(1) , (2) , (1)

1
2

Tilahun Yeshambel

Fonction : Auteur
PersonId : 1078785

Addis Ababa University

Josiane Mothe

Fonction : Auteur
PersonId : 735149
IdHAL : josianemothe
ORCID : 0000-0001-9273-2193
IdRef : 087097222

Systèmes d’Informations Généralisées

Yaregal Assabie

Fonction : Auteur
PersonId : 1078786

Addis Ababa University

Résumé

In information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the documents to be relevant. Stemmers are known to be effective in many languages for IR. However, there are still languages where stemmers or morphological analyzers are missing; this is the case for Amharic which is the working language of Ethiopia. Morphological analysis is the key to derive stems, roots (primary lexical units) and grammatical markers of words such as person, tense and negation markers. This paper presents morphologically annotated Amharic lexicons as well as stem-based and root-based morphologically annotated corpora which could be used by the research community as benchmark collections either to evaluate morphological analyzers or information retrieval for Amharic. Such resources are believed to foster research in Amharic IR.

Mots clés

• Information systems ~ Information Retrieval • Information systems ~ Document representation • Information systems ~ Dictionaries Information retrieval Corpus Morphological annotation Underresourced language Amharic

Domaines

Informatique et langage [cs.CL] Intelligence artificielle [cs.AI]

Fichier principal

Morphologically.pdf (2.11 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Romain Meunier : Connectez-vous pour contacter le contributeur

https://univ-tlse2.hal.science/hal-03362977

Soumis le : samedi 2 octobre 2021-16:29:25

Dernière modification le : lundi 20 novembre 2023-11:44:23

Archivage à long terme le : lundi 3 janvier 2022-18:48:52

Dates et versions

hal-03362977 , version 1 (02-10-2021)

Identifiants

HAL Id : hal-03362977 , version 1
DOI : 10.1145/3404835.3463237

Citer

Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie. Morphologically Annotated Amharic Text Corpora. SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul 2021, Virtual Event Canada, France. pp.2349-2355, ⟨10.1145/3404835.3463237⟩. ⟨hal-03362977⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS UT1-CAPITOLE IRIT IRIT-SIG IRIT-GD IRIT-UT2J TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

79 Consultations

531 Téléchargements

Morphologically Annotated Amharic Text Corpora

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager