Rivindu Perera

A HMM POS Tagger for Micro-Blogging Type Texts

The high volume of communication via micro-blogging type messages has created an increased demand for text processing tools customised the unstructured text genre. The available text processing tools developed on structured texts has been shown to deteriorate significantly when used on unstructured, micro-blogging type texts. In this paper, we present the results of testing a HMM based POS (Part-Of-Speech) tagging model customized for unstructured texts. We also evaluated the tagger against published CRF based state-of-the-art POS tagging models customized for Tweet messages using three publicly available Tweet corpora. Finally, we did cross-validation tests with both the taggers by training them on one Tweet corpus and testing them on another one. The results show that the CRF-based POS tagger from GATE performed approximately 8% better compared to the HMM (Hidden Markov Model) model at token level, however at the sentence level the performances were approximately the same. The cross-validation experiments showed that both tagger’s results deteriorated by approximately 25% at the token level and a massive 80% at the sentence level. A detailed analysis of this deterioration is presented and the HMM trained model including the data has also been made available for research purposes. Since HMM training is orders of magnitude faster compared to CRF training, we conclude that the HMM model, despite trailing by about 8% for token accuracy, is still a viable alternative for real time applications which demand rapid as well as progressive learning.