TY - GEN
T1 - GSDMM Model Evaluation Techniques with Application to British Telecom Data
AU - Abdelmotaleb, Hesham
AU - Wojtyś, Małgorzata
AU - McNeile, Craig
N1 - Publisher Copyright:
© 2023, Avestia Publishing. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Statistical topic modelling has become one of the most important tools in the text processing field, as more applications are using it to handle the increasing amount of available text data, e.g. from social media platforms. The aim of topic modelling is to discover the main themes or topics from a collection of text documents. While several models have been developed, there is no consensus on evaluating the models, and how to determine the best hyper-parameters of the model. In this research, we develop a method for evaluating topic models for short text that employs word embedding and measuring within-topic variability and separation between topics. We focus on the Dirichlet Mixture Model and tuning its hyper-parameters. In empirical experiments, we present a case study on short text datasets related to the British telecommunication industry. In particular, we find that the optimal values of hyper-parameters, obtained from our evaluation method, do not agree with the fixed values typically used in the literature.
AB - Statistical topic modelling has become one of the most important tools in the text processing field, as more applications are using it to handle the increasing amount of available text data, e.g. from social media platforms. The aim of topic modelling is to discover the main themes or topics from a collection of text documents. While several models have been developed, there is no consensus on evaluating the models, and how to determine the best hyper-parameters of the model. In this research, we develop a method for evaluating topic models for short text that employs word embedding and measuring within-topic variability and separation between topics. We focus on the Dirichlet Mixture Model and tuning its hyper-parameters. In empirical experiments, we present a case study on short text datasets related to the British telecommunication industry. In particular, we find that the optimal values of hyper-parameters, obtained from our evaluation method, do not agree with the fixed values typically used in the literature.
KW - Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM)
KW - hyper-parameters tuning
KW - model evaluation
KW - telecommunication industry
KW - topic modelling
UR - http://www.scopus.com/inward/record.url?scp=85188890722&partnerID=8YFLogxK
U2 - 10.11159/icsta23.116
DO - 10.11159/icsta23.116
M3 - Conference proceedings published in a book
SN - 9781990800252
T3 - Proceedings of the International Conference on Statistics
BT - 5th International Conference on Statistics
A2 - Samia, Noelle
A2 - Husmeier, Dirk
PB - Avestia Publishing
T2 - 5th International Conference on Statistics: Theory and Applications, ICSTA 2023
Y2 - 3 August 2023 through 5 August 2023
ER -