IEEE Access | 卷:9 |
Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales | |
Nicola Ferro1  Marco Ferrante2  Norbert Fuhr3  | |
[1] &x2018; | |
[2] Department of Mathematics &x2018; | |
[3] Tullio Levi-Civita,&x2019; | |
关键词: Information retrieval; measurement; software metrics; statistical analysis; experimental evaluation; retrieval effectiveness; | |
DOI : 10.1109/ACCESS.2021.3116857 | |
来源: DOAJ |
【 摘 要 】
Information Retrieval (IR) is a discipline deeply rooted in evaluation since its inception. Indeed, experimentally measuring and statistically validating the performance of IR systems are the only possible ways to compare systems and understand which are better than others and, ultimately, more effective and useful for end-users. Since the seminal paper by Stevens (1946), it is known that the properties of the measurement scales determine the operations you should or should not perform with values from those scales. For example, Stevens suggested that you can compute means and variances only when you are working with, at least, interval scales. It was recently shown that the most popular evaluation measures in IR are not interval-scaled. However, so far, there has been little or no investigation in IR on the impact and consequences of departing from scale assumptions. Taken to the extremes, it might even mean that decades of experimental IR research used potentially improper methods, which may have produced results needing further validation. However, it was unclear if and to what extent these findings apply to actual evaluations; this opened a debate in the community with researchers standing on opposite positions about whether this should be considered an issue (or not) and to what extent. In this paper, we first give an introduction to the representational measurement theory explaining why certain operations and significance tests are permissible only with scales of a certain level. For that, we introduce the notion of
【 授权许可】
Unknown