next up previous
Next: Conclusions Up: UPV-SI: Word Sense Induction Previous: The Self Term Expansion

Evaluation

The task organizers decided to use two different measures for evaluating the runs submitted to the task. The first measure is called unsupervised one, and it is based on the Fscore measure. Whereas the second measure is called supervised recall. For further information on how these measures are calculated refer to [Agirre 2006a,Agirre 2006b]. Since these measures give conflicting information, two different evaluation results are reported in this paper.

In Table 2 we may see our ranking and the Fscore measure obtained (UPV-SI). We also show the best and worst team Fscores; as well as the total average and two baselines proposed by the task organizers. The first baseline (Baseline1) assumes that each ambiguous word has only one sense, whereas the second baseline (Baseline2) is a random assignation of senses. We are ranked as third place and our results are better scored than the other teams except for the best team score. However, given the similar values with the ``Baseline1'', we may assume that that team presented one cluster per ambiguous word as its result as the Baseline1 did; whereas we obtained 9.03 senses per ambiguous word in average.


Table: Unsupervised evaluation (Fscore performance).
Name Rank All Nouns Verbs
Baseline1 1 78.9 80.7 76.8
Best Team 2 78.7 80.8 76.3
UPV-SI 3 66.3 69.9 62.2
Average - 63.6 66.5 60.3
Worst Team 7 56.1 65.8 45.1
Baseline2 8 37.8 38.0 37.6


In Table 3 we show our ranking and the supervised recall obtained (UPV-SI). We again show the best and worst team recalls. The total average and one baseline is also presented (the other baseline obtained the same Fscore). In this case, the baseline tags each test instance with the most frequent sense obtained in a train split. We are ranked again in third place and our score is slightly above the baseline.


Table: Supervised evaluation (Recall).
Name Rank All Nouns Verbs
Best Team 1 81.6 86.8 76.2
UPV-SI 3 79.1 82.5 75.3
Average - 79.1 82.8 75.0
Baseline 4 78.7 80.9 76.2
Worst Team 6a 78.5 81.8 74.9
Worst Team 6b 78.5 81.4 75.2


The results show that the technique employed have learned, since our simple approach obtained a better performance than the baselines, especially the one that have chosen the most frequent sense as baseline.


next up previous
Next: Conclusions Up: UPV-SI: Word Sense Induction Previous: The Self Term Expansion
David Pinto 2007-05-08