Sequence analysis Threshold Average Precision (TAP-k): A Measure of Retrieval Designed for Bioinformatics Hyrum D. Carroll, Maricel G. Kann, Sergey L. Sheetlin, and John L. Spouge National Center for Biotechnology Information, Bethesda, MD 20894, United States of America University of Maryland, Baltimore County, Baltimore, MD 21250, United States of America ABSTRACT Motivation: Because database retrieval is a fundamental operation, the measurement of retrieval efficacy is critical to progress in bioinformatics. This paper points out some issues with current methods of measuring retrieval efficacy and suggests some improvements. In particular, many studies have used the pooled ROCn score, the area under the curve (AUC) of a "pooled" receiver operating characteristic (ROC) curve, truncated at n irrelevant records. Unfortunately, the pooled ROCn score does not faithfully reflect actual usage of retrieval algorithms. Additionally, a pooled ROCn score can be very sensitive to retrieval results from as little as a single query. Methods: To replace the pooled ROCn score, we propose the Threshold Average Precision (TAP-k), a measure closely related to the well-known average precision in information retrieval, but reflecting the usage of E-values in bioinformatics. Furthermore, in addition to conditions previously given in the literature, we introduce three new criteria that an ideal measure of retrieval efficacy should satisfy. Results: PSI-BLAST, GLOBAL, HMMER, and RPS-BLAST provided examples of using the TAP-k and pooled ROCn scores to evaluate sequence retrieval algorithms. In particular, compelling examples using real data highlight the drawbacks of the pooled ROCn score, showing that it can produce evaluations skewing far from intuitive expectations. In contrast, the TAP-k satisfies most of the criteria desired in an ideal measure of retrieval efficacy. Availability and Implementation: The TAP-k web server and downloadable Perl script are freely available at http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html.ncbi/tap/. Contact: spouge@ncbi.nlm.nih.gov 1 INTRODUCTION In bioinformatics, retrieval from databases is a fundamental operation. Progress therefore depends on being able to recognize superior retrieval algorithms, so the measurement of retrieval efficacy is critical in bioinformatics. Swets (1967) stated that an ideal measure of retrieval efficacy (or more simply, a "retrieval measure") should satisfy four conditions: (1) It should concern itself solely with the effectiveness of separating the relevant from the non-relevant [records] and not with the efficiency of resource use. (2) It should not be dependent on a [user] threshold, but should measure the potential output of the method. (3) It should be a single number. (4) It should have absolute significance as a measure of a single method and should readily allow comparisons of different methods to decide which is best. To fix our terms, when a user queries a database of R records, a retrieval algorithm typically lists up to R records, ranked by some score S indicating the probability that the corresponding record has relevance to the query. In text retrieval (e.g., Google or PubMed results), retrieval lists do not indicate the scores producing their list orders. In a significant break with the traditions of information retrieval, however, bioinformatics retrieval often explicitly presents an E-value with the score, so users are free to choose an E-value threshold E0 and then ignore the retrieval list beyond E0. For concreteness, we discuss only E-values, but the methods in this paper apply to any score S. (Note that E-values and the retrieval ranks increase together.) In accord with the motivation behind E-values, Wilbur (1992) modified Swets' Condition (2): (2') It should be characterized by a [user] threshold, but should reflect the quality of retrieval at every rank down to that threshold. Wilbur's modification implicitly respects an overarching principle governing retrieval measures, which we call "the Principle of Fidelity": a retrieval measure should faithfully reflect the actual usage of the retrieval list. If not, the measure might be "ideal" in some abstract sense, but would lack a practical meaning. The Principle of Fidelity supports Wilbur's Condition (2') in bioinformatics, because an E-value threshold E0 influences the actual usage of a retrieval algorithm. Because a user rarely examines a retrieval list far beyond the E-value threshold E0, any practical measure of database retrieval in bioinformatics should reflect the user's E0. The rest of this article therefore disregards Swets' Condition (2) in favor of Wilbur's Condition (2'), referring to the result as the "Swets-Wilbur Conditions". In addition to the Swets-Wilbur Conditions, the Principle of Fidelity suggests that an ideal retrieval measure should satisfy additional conditions. Accordingly, we introduce the following: (5) It should be robust against results representing a small proportion of possible user queries. (6) When two disjoint sets of queries are considered, its value for the union of the two sets should lie between its values for the two sets of queries. (7) It should reflect the choice of threshold; in particular, it should eventually decrease as the threshold increases to include the entire retrieval list. Condition (5) reflects the fact that not many users are likely to query with the proportionally small subset, so the subset should not greatly influence conclusions about retrieval efficacy. Condition (6) says that when combined, two disjoint sets of retrieval results should not suggest better (or worse) efficacy than either set individually. Condition (7) reflects the fact that presumably, an appropriate E-value threshold E0 has practical utility: if users prefer to examine the entire retrieval list, they have no use for an E-value threshold. The ROCn score (Gribskov and Robinson, 1996) described in the Methods section is often used as a retrieval measure in bioinformatics. In fact, the "pooled ROCn score" (Schaffer et al., 1999) (also described in the Methods section) is probably the most popular summary retrieval measure over several different queries. The Principle of Fidelity casts immediate suspicion on the pooled ROCn score as a retrieval measure, however. Users do not examine the pooled retrieval list aggregated from lists for individual queries: users see the individual retrieval lists one at a time. The Results section shows that the pooled ROCn score does not always satisfy Condition (5) or (6). Moreover, it always fails to satisfy Condition (7). To replace the pooled ROCn score, we therefore propose as a measure of retrieval the Threshold Average Precision at a median of k errors per query (abbreviated and pronounced "TAP-k"). The TAP-k summarizes features of the Precision-Recall curve (described in our Methods section). Precision-Recall curves are popular in general information retrieval and already have found some favor in bioinformatics (Chen, 2003; Jones et al., 2005; Krishnamurthy et al., 2007; Raychaudhuri et al., 2002; Wass and Sternberg, 2008). To exemplify the Precision-Recall curve and TAP-k, the Results section presents several examples of actual database retrieval, using the programs PSI-BLAST (Schaffer et al., 2001), GLOBAL (Kann et al., 2007), HMMER (Eddy, 1998) and RPS-BLAST (Schaffer et al., 1999). The section shows that unlike the TAP-k, the pooled ROCn score can produce evaluations so misleading as to be completely contrary to common sense (Chen, 2003; Hand, 2009; Sierk and Pearson, 2004). Finally, the Discussion section summarizes the implications of our results. 2 METHODS 2.1 Databases and Query Sets We used two distinct databases in this work (see Supplementary Material for complete details). First, Gonzalez and Pearson (2010) constructed DB_344_Pfam, 344 protein families from the Pfam database (Finn et al., 2008). As sample queries for DB_344_Pfam, they provided 50 randomly selected families, each with a "query A", from a deserted part of each family's phylogenetic tree; and a "query B", from a heavily populated part. Gonzalez and Pearson considered as "relevant" only sequences in the same domain family or clan as the query. Second, Kann et al. (2007) provided DB_331_CDD, the position-specific scoring matrices (PSSMs) corresponding to 331 multiple sequence alignments from the NCBI Conserved Domain Database (CDD) (Marchler-Bauer et al., 2007). As sample queries for DB_331_CDD, they provided DB_8920_PDB (which Kann et al. (2007) named "DB_10185", for the 10,185 PDB sequences it contained before additional filtering). DB_8920_PDB contains 8,920 non-redundant sequences from the RCSB Protein Data Bank (PDB) (Berman et al., 2007). Kann et al. considered as "relevant" only those sequences in DB_8920_PDB that had at least 80% overlap with a representative in DB_331_CDD. 2.2 Retrieval Programs Retrieval with PSI-BLAST (version 2.2.21) provided our anecdotal examples. We performed five PSI-BLAST iterations on NCBI's NR database with an E-value threshold of 0.005, using the final PSSM to retrieve sequences from DB_344_Pfam. Estimates of retrieval efficacy reflected solely the final retrieval from DB_344_Pfam, not the previous iterations on the NR database. Additionally, we calculated retrieval results for GLOBAL, HMMER and RPS-BLAST with the DB_8920_PDB queries searching in the DB_331_CDD. We utilized two variants of HMMER: HMMER_semi-global and HMMER_local (HMMER in "global" and "local" modes respectively). The settings for HMMER, along with their rationale, have been specified elsewhere (Kann et al., 2007). 2.3 Retrieval Measures 2.3.1 The Receiver Operating Characteristic for n Irrelevant Records (ROCn) Curve and Score: Given a particular query, assume every database record is either relevant or irrelevant to the query. (The standard ROC terminology refers to "true positives" and "false positives", but in information retrieval, the terms "relevant" and "irrelevant" are pertinent. Unlike some authors (Hand, 2009), we view information retrieval as a problem in ranking, not a problem in classification.) Let the total number of irrelevant records be F. In response to a query, a retrieval algorithm produces a ranked retrieval list of all records in the database. Number each irrelevant record in the database 1, 2, ..., f, ..., F, according to its order in the retrieval ranking. The "ROC curve" plots the fraction of relevant records preceding the f-th irrelevant record against the fraction f / F. The "ROC score" is the area under the ROC curve, abbreviated "AUC" (Swets, 1988). The ROC score is the probability that a random relevant record is ranked before a random irrelevant record (Bamber, 1975). By analogy to the ROC curve, the "ROCn curve" is the ROC curve truncated after the first n irrelevant records, with the ROCn score being the area under the ROCn curve divided by n / F. An "ideal retrieval" ranks all relevant records before any irrelevant record. The normalization by n / F ensures that ideal retrieval receives the maximum ROCn score of 1.0. For the ROCn, a threshold of n = 50 irrelevant records seems accepted practice (Gribskov and Robinson, 1996). 2.3.2 The Pooled Receiver Operating Characteristic for n Irrelevant Records (pooled ROCn) Score: To calculate the pooled ROCn score, merge the retrieval lists for all sample queries into a "pooled retrieval list", and sort the pooled list on the E-value. Then, calculate the ROCn score for the pooled list, as though it were a single retrieval list. 2.3.4 The Precision-Recall (PR) Curve and the Average Precision (AP): Precision-Recall (PR) curves and average precision (AP) often quantify retrieval efficacy in general information retrieval. To calculate the AP (see Supplementary Material for more details), fix the retrieval algorithm A, and consider a particular query q to a fixed database. Let the database contain T(q) records relevant to the query q, and let them be ranked t1<...