Results

Next: Discussion Up: Experiments Previous: Data Description

Results

Figure 1 shows an interpolation of the average precision at different standard recall levels [1]. Two of these curves were previously presented: the classical VSM and TP [13]; therefore, we are using them as a reference for our own results. The three remained curves were obtained by using the representation schemas presented at section 2: H, terms obtained by using entropy; TP', enriched terms by bigrams; and H+TP, the union of H and TP.

**Figure 1:** Performance of term selection using entropy () and transition point ().
$\begin{figure} \setlength{\unitlength}{0.240900pt} \ifx\plotpoint\undefined \ne... ...put(160.0,82.0){\rule[-0.200pt]{0.400pt}{187.420pt}} \end{picture}} \end{figure}$

The TP-based method shows a better performance than the classical VSM by using low computational resources. On the other hand, the entropy-based method has a very good performance but with a higher computational cost. The TP approach, enriched with bigrams, obtained a similar performance than the entropy. Finally, the union of entropy and TP curve may indicate that the weighting procedure (by using both, Equation (1) and (7)) is not giving an adequated importance to terms, since precision diminished after 0.6 of recall level.

The vocabulary size for each method is shown in Table 1. Entropy did the highest reduction (it just uses the 3.3% of the original term space). TP enrichment obtained the highest vocabulary size, except for VSM, but its results are competitive with the entropy method and, with so much light computation consumption than entropy does.

**Table 1:** Term reduction methods and the vocabulary size obtained for TREC-5.
Method	Vocabulary	Percentage
name	size	of reduction
VSM	235,808	0.00
TP	28,111	88.08
H	7,870	96.70
TP'	36,442	84.55
H+TP	29,117	87.66

Next: Discussion Up: Experiments Previous: Data Description

David Pinto 2007-05-08