The TP-based method shows a better performance than the classical VSM by using low computational resources. On the other hand, the entropy-based method has a very good performance but with a higher computational cost. The TP approach, enriched with bigrams, obtained a similar performance than the entropy. Finally, the union of entropy and TP curve may indicate that the weighting procedure (by using both, Equation (1) and (7)) is not giving an adequated importance to terms, since precision diminished after 0.6 of recall level.
The vocabulary size for each method is shown in Table 1. Entropy did the highest reduction (it just uses the 3.3% of the original term space). TP enrichment obtained the highest vocabulary size, except for VSM, but its results are competitive with the entropy method and, with so much light computation consumption than entropy does.
Method | Vocabulary | Percentage |
name | size | of reduction |
VSM | 235,808 | 0.00 |
TP | 28,111 | 88.08 |
H | 7,870 | 96.70 |
TP' | 36,442 | 84.55 |
H+TP | 29,117 | 87.66 |