Journal article
International Conference on Language Resources and Evaluation, 2021
APA
Click to copy
Barry, J., Wagner, J., Cassidy, L., Cowap, A., Lynn, T., Walsh, A., … Foster, J. (2021). gaBERT — an Irish Language Model. International Conference on Language Resources and Evaluation.
Chicago/Turabian
Click to copy
Barry, James, Joachim Wagner, Lauren Cassidy, Alan Cowap, Teresa Lynn, Abigail Walsh, M'iche'al J. 'O Meachair, and Jennifer Foster. “GaBERT — an Irish Language Model.” International Conference on Language Resources and Evaluation (2021).
MLA
Click to copy
Barry, James, et al. “GaBERT — an Irish Language Model.” International Conference on Language Resources and Evaluation, 2021.
BibTeX Click to copy
@article{james2021a,
title = {gaBERT — an Irish Language Model},
year = {2021},
journal = {International Conference on Language Resources and Evaluation},
author = {Barry, James and Wagner, Joachim and Cassidy, Lauren and Cowap, Alan and Lynn, Teresa and Walsh, Abigail and Meachair, M'iche'al J. 'O and Foster, Jennifer}
}
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.