Skip to Main content Skip to Navigation
Theses

Qualitative evaluation of word embeddings : investigating the instability in neural-based models

Abstract : Distributional semantics has been revolutionized by neural-based word embeddings methods such as word2vec that made semantics models more accessible by providing fast, efficient and easy to use training methods. These dense representations of lexical units based on the unsupervised analysis of large corpora are more and more used in various types of applications. They are integrated as the input layer in deep learning models or they are used to draw qualitative conclusions in corpus linguistics. However, despite their popularity, there still exists no satisfying evaluation method for word embeddings that provides a global yet precise vision of the differences between models. In this PhD thesis, we propose a methodology to qualitatively evaluate word embeddings and provide a comprehensive study of models trained using word2vec. In the first part of this thesis, we give an overview of distributional semantics evolution and review the different methods that are currently used to evaluate word embeddings. We then identify the limits of the existing methods and propose to evaluate word embeddings using a different approach based on the variation of nearest neighbors. We experiment with the proposed method by evaluating models trained with different parameters or on different corpora. Because of the non-deterministic nature of neural-based methods, we acknowledge the limits of this approach and consider the problem of nearest neighbors instability in word embeddings models. Rather than avoiding this problem we embrace it and use it as a mean to better understand word embeddings. We show that the instability problem does not impact all words in the same way and that several linguistic features are correlated. This is a step towards a better understanding of vector-based semantic models.
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-03148513
Contributor : Abes Star :  Contact
Submitted on : Monday, February 22, 2021 - 12:09:13 PM
Last modification on : Tuesday, February 23, 2021 - 3:24:43 AM

File

Pierrejean_Benedicte.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-03148513, version 1

Collections

Citation

Bénédicte Pierrejean. Qualitative evaluation of word embeddings : investigating the instability in neural-based models. Linguistics. Université Toulouse le Mirail - Toulouse II, 2020. English. ⟨NNT : 2020TOU20001⟩. ⟨tel-03148513⟩

Share

Metrics

Record views

28

Files downloads

6