MultiDomainBenchmark

Introduction

MultiDomainBenchmark is a complete evaluation suite for genetic database search programs. It is derived from RefProtDom v1.2 (Gonzalez and Pearson, 2010), which in turn is derived from Pfam (Finn et al., 2015). MultiDomainBenchmark was explicitly designed to provide robust evaluation of genetic database searching with query sequences that have multi-domains. While other benchmarks include target databases with multi-domains (Chandonia et al., 2004, Gonzalez and Pearson, 2010), MultiDomainBenchmark is the only benchmark known to the authors with at least 100's of query sequences with multi-domains. Included with MultiDomainBenchmark are 412 phylogenetically diverse query sequences, a target database of 227,512 sequences, relevancy relationships, domain location information and the scripts to generate the benchmark and perform evaluations.

Domain Architectures (DAs)

Domains are the basic unit of function and structure in proteins. Instead of the traditional relevancy criterion of curated families, we determine relevancy based on the Domain Architectures (DAs). The DA captures a sequences' domains and their relative order. For example, consider the following two sequences:

pfam21|Q9ULD7|Q9ULD7_HUMAN

up|P14046|A1I3_RAT

Domain	Start	Stop
CL159	15	289
PF07703	450	628
CL11	749	844
CL59	1181	1430
PF07677	1568	1658
CL5	1710	1755

Domain	Start	Stop
CL159	14	284
PF07703	453	610
CL11	741	835
CL59	1013	1269
PF07677	1379	1466

Figure 1. Filtering steps
applied to RefProtDom v1.2

The domains for pfam21|Q9ULD7|Q9ULD7_HUMAN, in this order, make up the domain architecture da00640. Additionally, the domains for up|P14046|A1I3_RAT, in this order, match da00003.
To determine relevance, we use the following rule. Query sequenceA is a relevant ("True Positive") match to sequenceB, if the DA for sequenceA is at least a subset of the DA for sequenceB. Using the example above, if you searched with up|P14046|A1I3_RAT and found pfam21|Q9ULD7|Q9ULD7_HUMAN as a hit, then it would be a relevant match (assuming no additional criteria like minimum overlap). Notice that relevancy using DAs is not transitive. Here, if you searched with pfam21|Q9ULD7|Q9ULD7_HUMAN and found up|P14046|A1I3_RAT as a hit, the match does not fully support the DA used as the query so the match could be classified as an irrelevant match ("False Positive").
Before constructing a set of DAs, we applied additional filtering steps to RefProtDom. First, we removed sequences that had multiple domain annotations for the same position ("overlapping" domains) (see Figure 1). These resulting 227,512 sequences comprise the target (or subject) sequence database (of which multi-domain queries are searched). Additionally, we segregated sequences that only had a single domain and those with multi-domains. This resulted in 66,601 sequences that have multiples domains (and have non-overlapping domains). These multi-domain sequences have a total of 2,525 DAs. Collectively, by using DAs, MultiDomainBenchmark specifies 4,435,759,802 relevancy relationships between all of its sequences.

Query Selection

Venn diagram illustrating DAs that have only a single member (1,320 DAs), DAs that at least one of the sequences have at most 1,800 residues (2,142 DAs) and their intersection (1,179 DAs)

Figure 2. Domain Architecture (DA) filtering

To obtain a set of phylogenetically diverse query sequences, first, we ignored DAs that have only a single member (resulting in 1,320 DAs) (see Figure 2). Second, we required that at least one of the sequences have at most 1,800 residues (for computational considerations). This resulted in 1,179 DAs that passed these filtering steps.
We further constrained the set of query DAs by only considering those that represent a diversity in the kingdoms of life. There are 412 DAs that have sequences in at least two kingdoms of life. From each of set of DAs, we randomly choose a query sequence. We ordered these query sequences based on their arbitrarily assigned DA number, and partitioned the even ranked ones for a Training set and the odd ranked ones for a Test set.

Summary Statistics

In an effort to fully describe MultiDomainBenchmark, we have compiled the following sets of statistics that characterize the benchmark.

Description	Min	Average	Max
Number of sequences per DA	2	111.0	1315
Lengths of each of the query sequences	170	759.7	1800
Lengths of each of the `Training` query sequences	170	750.9	1800
Lengths of each of the `Test` query sequences	198	768.5	1749
Number of domains per query	2	2.9	16

For more details, please refer to our journal article (Carroll et al., 2019).

References

H.D. Carroll, J.L. Spouge, and M.W. Gonzalez. 2019. The MultiDomainBenchmark Suite: a multiple-domain query and subject database suite. BMC Bioinformatics 20:77. (link)
J.M. Chandonia, G. Hon, N.S. Walker, L. Lo Conte, P. Koehl, M. Levitt, and S.E. Brenner. 2004. The ASTRAL Compendium in 2004. Nucleic Acids Research 32:Database Issue, D189-D192.
R.D Finn, P. Coggill, R.Y Eberhardt, S.R Eddy, J. Mistry, A.L Mitchell, S.C Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, and others. 2015. The Pfam protein families database: towards a more sustainable future. Nucleic acids research. Oxford Univ Press, gkv1344.
M.W. Gonzalez, and W.R. Pearson. 2010. RefProtDom: a protein database with improved domain boundaries and homology relationships. Bioinformatics 26:18, 2361-2362.

Downloads

Benchmark

The core of the MultiDomainBenchmark is comprised of query sequences (used to initiate the search), a collection of target sequences (which a query is searched against), relationship or relevancy information and domain location information (necessary to calculate overlaps).

Query Sequences

List of all 160,911 single domain queries
Table of all 66,601 multi-domain queries (and selected attributes)

Table of the 412 randomly selected multi-domain queries (one sequence per DA) (a subset of the multi-domain queries file)

Target Sequences

All 227,512 target sequences in FASTA format (.zip, .gz)

Relevancy Information

Relevance Information (DA for each sequence)

#Taxon	DA	Relevant_DAs
pfam21|O00197|O00197|HUMAN	da00353	da00352,da00698,da00923,da00972,da01542,da01962,da02225,da02278,da02288,da02397,da02425
pfam21|O00222|MGR8|HUMAN	da00069	da00192,da00679,da02044
pfam21|O00329|PK3CD|HUMAN	da00020	da00179,da00324,da00325,da00336,da00615,da00714,da00747,da01306,da02021,da02047,da02059
...

The first column is the label of the multi-domain sequence. The second column is a DA identifier. The third (last) colunmn is a a list of other DAs that would result in a relevant ("True Positive") match.

Domain Location Information

Domain location information - Lists domain names and positions for each sequence.

>pfam21|Q40577|5EAS_TOBAC
PF01397	25	219
PF03936	224	492
>up|O70212|5HT3A_CAVPO
PF02931	39	248
PF02932	255	481
>up|P46098|5HT3A_HUMAN
...

The format of the domain locations information file is a greater than sign, ">", followed by the label of each of the 227,512 sequences. The following lines have three columns to indicate the domain name, it's starting and ending positions. This file is a subset of family_members.annot from RefProtDom v1.2 (Gonzalez and Pearson, 2010).

Scripts

Benchmark Generation Scripts

The queries, target sequences, relevance information and domain location information files provided above are derived from RefProtDom v1.2 (namely, the family_members.annot and library_all_domains.fa files). The scripts to generate the benchmark are available as either benchmarkGenerationScripts.zip or benchmarkGenerationScripts.tgz (a tarball). Both compressed files contain the following files:

Script / File / Directory	Description
`makeBenchmark.sh`	Makes the entire benchmark
`perl/`	Directory of supporting Perl modules
`scripts/domainLocsParser.pl`	Determines the number of domains, presence of a repeated domain and/or an embedded domain for each sequence
`scripts/getDomainSizes.pl`	Retrieves a subset of the statistics about each sequence based on a file listing the random queries
`scripts/getDASizes.pl`	Determines the DA associated with each specified query and calculates the number of sequences with the same DAs
`scripts/getDAs.pl`	Filters DAs and randomly chooses one query sequence per DA
`scripts/lengths.sh`	Calculates the number of residues in each sequence
`scripts/fileSplicer.pl`	Breaks up a file into sets
`scripts/removeTaxa.pl`	Removes taxa from a specified file that match labels from a specified list
`taxonomy/diverseDAs.py`	Determines if a domain architecture (DA) has members (sequences) in multiple taxonomical domains
`taxonomy/speclist-20060516.txt`	Older controlled vocabulary of species (for deprecated species)
`taxonomy/uniprot2taxonomy.pl`	Determines the taxonomy for a Uniprot accession
`taxonomy/idmapping_selected-cached.tab`	Simplified cached copy of accessions to taxon ID mappings
`COPYING`	GNU GENERAL PUBLIC LICENSE v3

Performance Analysis Scripts

Managing multiple tests for hundreds of queries can be a complicated and time consuming process. Additionally, calculating the relevancy is not a trivial task, especially when overlap calculations are desired. Consequently, we're providing the scripts used to perform the case study to assist other researchers in performing their evaluations. The performance analysis scripts are available as either benchmarkingScripts.zip or benchmarkingScripts.tgz (a tarball). Both compressed files contain the following files:

Script / File	Description
`calculateTap-iterations.sh`	Calculates TAP scores for each of the iterations
`calculateTap.sh`	Calculates TAP scores
`calculateTapKs.sh`	Calculates TAP-k scores
`calculateTiming.sh`	Gathers all timing information
`classifyRelevance.pl`	Main relevancy script. Given an output file from BLAST, PSI-BLAST, or HMMER3, it determines for each sequence retrieved if its a relevant or irrelevant match.
`hmmerWrapper.sh`	Executes phmmer or jackhmmer, including setting up a directory, calling the database search program and determining relevancy
`jackhmmerWrapper.sh`	Symbolic link to `hmmerWrapper.sh`
`jobsManager.sh`	Creates submission (or simply an execution) scripts and schedules jobs that are not pending or running on PBS-like systems. Alternatively, it can execute jobs serially (without a scheduler).
`numRelevantRecords.pl`	Calculates and displays the number of relevant records
`phmmerWrapper.sh`	Symbolic link to `hmmerWrapper.sh`
`psi-non_iterativeWrapper.sh`	Executes a single iteration of PSI-BLAST, including setting up a directory, calling the database search program and determining relevancy
`psiWrapper.sh`	Executes PSI-BLAST, including setting up a directory, calling the database search program and determining relevancy.
`psiblast-non_iterativeWrapper.sh`	Simple wrapper to psi-non_iterativeWrapper.sh
`psiblastWrapper.sh`	Simple wrapper to psiWrapper.sh
`removeEmptyFiles.sh`	Utility script that removes all non-hidden, zero-length files
`runAll.sh`	Highest level script to manage execution of runs
`spouge2spougeE.pl`	Truncates .spouge formatted files to a threshold (e.g., E-value)
`COPYING`	GNU GENERAL PUBLIC LICENSE v3

Results

Search Performance Results

Below are the Threshold Average Precision-k (TAP-k) (Carroll et al., 2010) scores for each of the query sequences in the Test set. As additional methods are evaluated, their results can be added to these files. Those wishing to contribute TAP-k scores, should use the format in the files below and contact the MultiDomainBenchmark authors.

# Query	non-iterative::phmmer	non-iterative::psiblast	iterative::jackhmmer	iterative::psiblast
# Threshold for 0.5 Quantile	5.3e-137	4e-131	2e-190	0
# Unweighted Average TAP-1	0.2564	0.2807	0.1381	0.3081
pfam21|P91550|P91550_CAEEL	0.0573	0.1031	0.0019	0.0868
pfam21|Q1SJT6|Q1SJT6_MEDTR	0.4948	0.5686	0.6759	0.6836
pfam21|Q1TLH7|Q1TLH7_9MYCO	0.6418	0.6589	0.5253	0.6221
pfam21|Q1UW50|Q1UW50_9MYCO	0.0315	0.0394	0.0000	0.0000
...

References

H.D. Carroll, M.G. Kann, S.L. Sheetlin, and J.L. Spouge. 2010. Threshold Average Precision (TAP-k): A Measure of Retrieval Efficacy Designed for Bioinformatics. Bioinformatics 26:14, 1708-1713. (link)
M.W. Gonzalez, and W.R. Pearson. 2010. RefProtDom: a protein database with improved domain boundaries and homology relationships. Bioinformatics 26:18, 2361-2362.

Case Study

MultiDomainBenchmark provides all of the necessary components to perform a rigorous evaluation of different database search algorithms. To illustrate this point, we conducted the first such evaluation. We evaluated the accuracy of each database search program with the Threshold Average Precision-k (TAP-k) metric (Carroll et al., 2010). The TAP-k measures the average precision, with an additional penalty for sequences retrieved just before the threshold (which lowers the utility of the retrieval). Here, k is the number of irrelevant ("False Positive") records to set the threshold. TAP-k scores range from 0.0 for retrievals where no relevant ("True Positive") matches are found to 1.0 where all of the relevant sequences and none of the irrelevant sequences were retrieved. Additionally, we measured the elapsed time of each execution.

Experiments Set-up

For the inaugural use of MultiDomainBenchmark, we evaluated the following commonly used database search programs: BLAST/PSI-BLAST (Camacho et al., 2009) and HMMER (phmmer and jackhmmer) (Eddy, 2011). Out of the dozens of command-line options for each program, we specified only a very few for formatting and evaluation purposes. In the spirit of facilitating reproducible results, the command-lines used for each program are provided below (with only directories abbreviated for readability).

BLAST (PSI-BLAST set to a single iteration):

(time -p psiblast -db finalDatabase -query queries-multiDomain/pfam21_Q1UW50_Q1UW50_9MYCO.fa -num_threads 1 -evalue 1000 -out pfam21_Q1UW50_Q1UW50_9MYCO.psiblast-non_iterative.hits.final.txt -num_descriptions 9999 -num_alignments 9999) &> pfam21_Q1UW50_Q1UW50_9MYCO.psiblast-non_iterative.final.out
classifyRelevance.pl -v -psiblast=pfam21_Q1UW50_Q1UW50_9MYCO.psiblast-non_iterative.hits.final.txt -rel=relevanceInfo.tab --spouge=pfam21_Q1UW50_Q1UW50_9MYCO.psiblast-non_iterative.spouge.1000 --spougeext -taxon=pfam21_Q1UW50_Q1UW50_9MYCO --domainLocs=domainLocs.tab --overlap=50 --norandomsAsIrrelevants --combineHSPs --useFold

(Note: the --useFold flag indicates to use the primary and superset DAs for classification.)

PSI-BLAST (PSI-BLAST for up to 5 iterations and saving the last PSSM. Then, a final search using the resulting PSSM on MultiDomainBenchmark):

(time -p psiblast -db nrdb90/nr90 -query queries-multiDomain/pfam21_Q1UW50_Q1UW50_9MYCO.fa -num_iterations 5 -num_threads 1 -out_pssm pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.nr.pssm -out_ascii_pssm pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.nr.pssm.txt) &> pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.iterationsDb.out
(time -p psiblast -db finalDatabase -in_pssm pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.nr.pssm -num_iterations 1 -evalue 1000 -num_threads 1 -out pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.hits.final.out -num_descriptions 9999 -num_alignments 9999) &> pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.final.out
classifyRelevance.pl -v -psiblast=pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.hits.final.out -rel=relevanceInfo.tab --spouge=pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.spouge.1000 --spougeext -taxon=pfam21_Q1UW50_Q1UW50_9MYCO --domainLocs=domainLocs.tab --overlap=50 --norandomsAsIrrelevants --combineHSPs --useFold

phmmer (hmmbuild and hmmsearch):

hmmbuild pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.finalIn.hmm queries-multiDomain/pfam21_Q1UW50_Q1UW50_9MYCO.fa
(time -p hmmsearch  -E 1000 --cpu 1 -o pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.hits.final.out  pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.finalIn.hmm  finalDatabase.fa) &> pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.final.out
classifyRelevance.pl -v -phmmer=pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.hits.final.out -rel=relevanceInfo.tab --spouge=pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.spouge.1000 --spougeext -taxon=pfam21_Q1UW50_Q1UW50_9MYCO --domainLocs=domainLocs.tab --overlap=50 --norandomsAsIrrelevants --combineHSPs --useFold

(Note: the --combineHSPs flag is ignored, but supplied to classifyRelevance.pl for uniformity.)

jackhmmer (jackhmmer for the iterations, then a final search using the resulting HMM on MultiDomainBenchmark):

(time -p jackhmmer   --noali -N 5  --cpu 1  --chkhmm pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.iterationsDb  queries-multiDomain/pfam21_Q1UW50_Q1UW50_9MYCO.fa  nrdb90/nr90.fa | grep '^\[ok\]$') &> pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.iterationsDb.out
ln -s "pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.iterationsDb-5.hmm" "pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.finalIn.hmm"
(time -p hmmsearch  -E 1000 --cpu 1 -o pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.hits.final.out  pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.finalIn.hmm  finalDatabase.fa) &> pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.final.out
classifyRelevance.pl -v -jackhmmer=pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.hits.final.out -rel=relevanceInfo.tab --spouge=pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.spouge.1000 --spougeext -taxon=pfam21_Q1UW50_Q1UW50_9MYCO --domainLocs=domainLocs.tab --overlap=50 --norandomsAsIrrelevants --combineHSPs --useFold

(Note: the --combineHSPs flag is ignored, but supplied to classifyRelevance.pl for uniformity.)

The scripts used to generate each of these runs and the supporting scripts (e.g., classifyRelevance.pl) are provided on the Downloads page.
MultiDomainBenchmark is a complete evaluation benchmark. Included in it is a Training and a Test set of query sequences. While the Training set is intended to be used during development and for parameters selection, the Test set is used for performance comparisons. Consequently, in this case, we used the 206 Test query sequences. Timing results were measured with the Unix command time for just the execution of the program (as illustrated above in the command-lines). For non-iterative runs, the timing only measures the execution of searching against all of MultiDomainBenchmark (with it's 227,512 sequences). For iterative runs, it is the sum of both the iterative rounds (searching the NR90 database with it's 13,219,523 sequences) and the final round searching against all of MultiDomainBenchmark. We gathered these timing results on a shared environment (multi-processor node) system and therefore should be considered a first approximation of the actual execution times.

Summary Search Performance

The following box-and-whisker plots summarize the TAP-k and timing results for each database search program. The whiskers are the minimum and maximum values. Given that the variety of queries encompasses both "easy" and "hard" cases, for these TAP-k results, all but one of the graphs have scores of 0.0 and 1.0, putting the whiskers at the very bottom and top of the graph. The bottoms and tops of the boxes in the graphs indicates the first and third quartile, with the black bar the second quartile (median).

TAP-k Non-iterative Performance Results

TAP-k Iterative Performance Results

Timing Non-iterative Results

Timing Iterative Results

Individual Search Performance

TAP-k scores and timing results for each of the query sequences in the Test set are available on the Downloads page.

References

H.D. Carroll, M.G. Kann, S.L. Sheetlin, and J.L. Spouge. 2010. Threshold Average Precision (TAP-k): A Measure of Retrieval Efficacy Designed for Bioinformatics. Bioinformatics 26:14, 1708-1713. (link)
C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T.L. Madden. 2009. BLAST+: architecture and applications. BMC bioinformatics 10:421.
S.R Eddy. 2011. Accelerated profile HMM searches. PLoS Comput Biol 7:10. Public Library of Science, e1002195.