Re-train or train from scratch? Comparing pre-training strategies of BERT in the medical domain - Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Re-train or train from scratch? Comparing pre-training strategies of BERT in the medical domain

Résumé

BERT models used in specialized domains all seem to be the result of a simple strategy: initializing with the original BERT and then resuming pre-training on a specialized corpus. This method yields rather good performance (e.g. BioBERT (Lee et al., 2020), SciBERT (Beltagy et al., 2019), BlueBERT (Peng et al., 2019)). However, it seems reasonable to think that training directly on a specialized corpus, using a specialized vocabulary, could result in more tailored embeddings and thus help performance. To test this hypothesis, we train BERT models from scratch using many configurations involving general and medical corpora. Based on evaluations using four different tasks, we find that the initial corpus only has a weak influence on the performance of BERT models when these are further pre-trained on a medical corpus.
Fichier principal
Vignette du fichier
2022.lrec-1.281.pdf (791.22 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte

Dates et versions

hal-03803880 , version 1 (06-10-2022)

Licence

Paternité - Pas d'utilisation commerciale

Identifiants

  • HAL Id : hal-03803880 , version 1

Citer

Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Pierre Zweigenbaum. Re-train or train from scratch? Comparing pre-training strategies of BERT in the medical domain. LREC 2022 - Language Resources and Evaluation Conference, Jun 2022, Marseille, France. pp.2626-2633. ⟨hal-03803880⟩
379 Consultations
400 Téléchargements

Partager

Gmail Facebook X LinkedIn More