Next: Introduction
Clustering Narrow-Domain Short Texts by using the Kullback-Leibler Distance
(1,2) David Pinto, (1) José-Miguel Benedí, (1) Paolo Rosso
(1) Department of Information Systems and Computation, Polytechnic University of Valencia
(2) Faculty of Computer Science, B. Autonomous University of Puebla
Abstract:
Clustering short length texts is a difficult task itself, but
adding the narrow domain characteristic poses an additional challenge
for current clustering methods. We addressed this problem with the use
of a new measure of distance between documents which is based on the
symmetric Kullback-Leibler distance. Although this measure is commonly
used to calculate a distance between two probability distributions, we
have adapted it in order to obtain a distance value between two
documents. We have carried out experiments over two different
narrow-domain corpora and our findings indicates that it is possible to
use this measure for the addressed problem obtaining comparable results
than those which use the Jaccard similarity measure.
David Pinto
2007-05-08