Jochre, océrisation par apprentissage automatique : étude comparée sur le yiddish et l'occitan

Abstract : To create textual databases for less-resourced languages such as Yiddish and Occitan, we need tools and resources enabling high-quality OCR (optical character recognition). One of the main difficulties to overcome for these two languages is their considerable spelling variation (and dialectal variation for Occitan). It is generally admitted that a lexicon can improve OCR quality, but it is not clear how to take such variation into account within the lexicon. In this study, we use Jochre, a supervised machine learning OCR system. We compare several methods of generating and using lexicons. The best method allows us to attain an accuracy of 91.2% (words) and 97.4% (letters) for the Yiddish corpus, and 93.2% (words) and 97.9% (letters) for the Occitan corpus.
Complete list of metadatas

Cited literature [8 references]  Display  Hide  Download

https://hal-univ-tlse2.archives-ouvertes.fr/hal-00979665
Contributor : Assaf Urieli <>
Submitted on : Wednesday, April 16, 2014 - 8:12:51 PM
Last modification on : Wednesday, July 10, 2019 - 1:34:54 AM
Long-term archiving on : Monday, April 10, 2017 - 2:32:04 PM

File

talare-2013-long-004.pdf
Publisher files allowed on an open archive

Identifiers

  • HAL Id : hal-00979665, version 1

Citation

Assaf Urieli, Marianne Vergez-Couret. Jochre, océrisation par apprentissage automatique : étude comparée sur le yiddish et l'occitan. TALARE 2013 : Traitement automatique des langues régionales de France et d'Europe, Jun 2013, Les Sables d'Olonne, France. pp.221. ⟨hal-00979665⟩

Share

Metrics

Record views

579

Files downloads

272