Experimental Results

Table 1 shows the results obtained with each of the three different approximations submitted. The WithoutDiac run eliminates all diacritics in both, the corpus and the topics, whereas the WithDiac run only supresses the diacritics in the corpus. We may observe an expected reduction of the Mean Reciprocal Rank (MRR), but it is not significatively high with respect to the first run. This is clearly derived from the amount of diacritics introduced in the evaluation topics set, which is not very high. An analysis of the queries in real situations may be interesting in order to determine whether the topics set is realistic. The last run (CDWithoutDiac) eliminates diacritization in both, the topics and corpus, but also tried a charset detection for each document to be indexed. Unfortunately, from the table we can observe that we did not success in our attempt.

**Table 1:** Evaluation of each run submitted
	Average Success at
Team	Run	1	5	10	20	50	MRR over 1939
rfia	WithoutDiac	0,0665	0,1423	0,1769	0,2192	0,2625	0,1021
rfia	WithDiac	0,0665	0,1372	0,1717	0,2130	0,2568	0,1006
rfia	CDWithoutDiac	0,0665	0,1310	0,1681	0,1996	0,2470	0,0982

Table 2 shows a summary of all the best participant runs submitted to the mixed monolingual task of WebCLEF 2006. The Mean Reciprocal Rank (MRR) scores are reported for both the original and the new topic set. The first column indicates the name that each team had in the competition, whereas the second column indicates the name of their best run. The scores shown in that table rank us in a third place.

**Table 2:** Best runs for each WebCLEF 2006 participant in the mixed monolingual task
Team	Run	MRR for the	MRR for the
Name	Name	original topic set	new topic set
isla	CombPhrase	0.2001	0.3464
hummingbird	humWC06dpcD	0.1380	0.2390
rfia	WithoutDiac (ERFinal)	0.1021	0.1768
depok	UI2DTF	0.0918	0.1589
ucm	webclef-run-all-2006	0.0870	0.1505
hildesheim	UHiBase	0.0795	0.1376
buap	allpt40bi	0.0157	0.0272
reina	USAL_mix_hp	0.0139	0.0241

In table 3(a) we can see the best overall results by using only the new topic set. Here we have obtained a fourth place, according to the average among the automatic and the manual topic scores. Whereas, in Table 3(b) we may observe the results by using only the new automatic generated topics. Our second place shows that the penalisation-based ranking is working well for the task proposed in this competition. Interesting is to see that our approach obtains a better behaviour on the new than the original topics. As can be seen in [1], the new topics were mostly automatically generated; whereas the original where all manually generated. Further investigation would analyse the above mentioned behaviour of the penalisation-based ranking approach presented in this paper.

Table 3: Best overall runs for each WebCLEF 2006 participant by using: (a) the new topic set, and (b) only the automatic generated topics

Team Name	automatic	manual	average
isla	0.3145	0.4411	0.3778
hummingbird	0.1396	0.5068	0.3232
depok	0.0923	0.3386	0.2154
rfia	0.1556	0.2431	0.1993
hildesheim	0.0685	0.3299	0.1992
ucm	0.1103	0.2591	0.1847
buap	0.0080	0.0790	0.0435
reina	0.0075	0.0689	0.0382

automatic	manual	average
0.3145	0.3114	0.3176
0.1396	0.1408	0.1384
0.0923	0.1024	0.0819
0.1556	0.1568	0.1544
0.0685	0.0640	0.0731
0.1103	0.1128	0.1077
0.0080	0.0061	0.0099
0.0075	0.0126	0.0022

(a)

(b)