next up previous
Next: Introduction

Clustering Narrow-Domain Short Texts by using the Kullback-Leibler Distance [*]

(1,2) David Pinto, (1) José-Miguel Benedí, (1) Paolo Rosso

(1) Department of Information Systems and Computation, Polytechnic University of Valencia

(2) Faculty of Computer Science, B. Autonomous University of Puebla

Abstract:

Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried out experiments over two different narrow-domain corpora and our findings indicates that it is possible to use this measure for the addressed problem obtaining comparable results than those which use the Jaccard similarity measure.





David Pinto 2007-05-08