Multi-Value Classification of Very Short Texts

14 Jun

by Andreas Heß, Philipp Dopichaj and Christian Maaß

Multi-value text classification is an interesting and very practical topic. In many applications, a single label only is not enough to appropriately classify documents. This is especially true in many applications on the web. As opposed to traditional documents, some texts on the web, especially on Web 2.0 sites, are very short, for example pin-board entries, comments to blog posts or captions of pictures or videos. Sometimes these texts are mere snippets, being at most one or two sentences long. Yet, in some Web 2.0 Applications, labelling or tagging such short snippets does not only make sense but could be the key to success. Therefore we believe it is important to investigate how multi-value text classification algorithms perform when very short texts are classified. To test this, we classified news articles from the well known Reuters data set based only on the headlines and compared the results to older approaches in literature that used the full text. We applied the same algorithm to a dataset from Web 2.0 site Lycos iQ. An empirical evaluation shows that text classification algorithms perform well in both setups.

The remainder of this paper is organised as follows: First, we present a new stacking approach for multi-value classification. By comparing the performance of classifiers trained only on the short headlines of the well-known Reuters news articles benchmark to results achieved with similar classifiers using the full article text we show that classification of very short texts is possible and the loss in accuracy is acceptable. Second, we present an application of text classification for tagging short texts from a Web 2.0-site. We demonstrate that presenting suggestions to the user can greatly improve the quality of tagging.

The paper is accepted for the 31st Annual German Conference on Artifical Intelligence.

Download: Multi-Value Classification of Very Short Texts

Comments are closed.