Electronic Journal of Applied Statistical Analysis, Vol 17, No 3 (2024)

Beyond human labelling: an automatic topic identification framework for big web data

Roberto Ascari, Silvio Gerli, Sonia Migliorati, Teresa Cigna, Matteo Borrotti

Abstract


Nowadays, the global amount of written texts grows faster and faster. Since 2011 the number of posts per minute on Facebook increased from 650K to 3M. These unstructured data represent the source of an enormous amount of information that should be extracted by using automatic engines. This can be mainly accomplished by means of Natural Language Processing (NLP), which is a field of Artificial Intelligence devoted to analyzing and understanding human language as it is spoken and written. One common task of NLP is topic identification, related to the recognition of a text's topic(s).  Two popular methods for modeling latent topics are latent Dirichlet allocation (LDA) and correlated topic model (CTM). Both of them assume that each word composing a document is associated with a latent topic, but they differ in the prior distribution assigned to topics, thus showing different pros and cons.In this work, LDA and CTM are tested and compared in a big data context by analyzing a  large set of short documents automatically downloaded from the web by means of a modern crawler. In addition, under the assumption that each document is associated with a single topic, two new methods for the automatic classification of documents according to their real topic are proposed and tested relying on LDA and CTM as (latent) topic model engines. Finally, under the more realistic hypothesis of multiple topics within a document, the two new methods together with some combinations of the two are tested as multi-class classification tools.