MultiDomainBenchmark

  • Introduction
  • Downloads
  • Case Study
  • Home
  • Research
    • Overview
    • BLASTFDR
    • MultiDomainBenchmark
    • PSI-BLASTFDR
    • PSI-SemiGLOBAL
    • TAP
  • Courses
  • CV
  • Publications

Introduction

MultiDomainBenchmark is a complete evaluation suite for genetic database search programs. It is derived from RefProtDom v1.2 (Gonzalez and Pearson, 2010), which in turn is derived from Pfam (Finn et al., 2015). MultiDomainBenchmark was explicitly designed to provide robust evaluation of genetic database searching with query sequences that have multi-domains. While other benchmarks include target databases with multi-domains (Chandonia et al., 2004, Gonzalez and Pearson, 2010), MultiDomainBenchmark is the only benchmark known to the authors with at least 100's of query sequences with multi-domains. Included with MultiDomainBenchmark are 412 phylogenetically diverse query sequences, a target database of 227,512 sequences, relevancy relationships, domain location information and the scripts to generate the benchmark and perform evaluations.

Domain Architectures (DAs)

Domains are the basic unit of function and structure in proteins. Instead of the traditional relevancy criterion of curated families, we determine relevancy based on the Domain Architectures (DAs). The DA captures a sequences' domains and their relative order. For example, consider the following two sequences:
pfam21|Q9ULD7|Q9ULD7_HUMAN     up|P14046|A1I3_RAT
Domain Start Stop
CL159 15 289
PF07703 450 628
CL11 749 844
CL59 1181 1430
PF07677 1568 1658
CL5 1710 1755
Domain Start Stop
CL159 14 284
PF07703 453 610
CL11 741 835
CL59 1013 1269
PF07677 1379 1466
Figure 1. Filtering steps
applied to RefProtDom v1.2
The domains for pfam21|Q9ULD7|Q9ULD7_HUMAN, in this order, make up the domain architecture da00640. Additionally, the domains for up|P14046|A1I3_RAT, in this order, match da00003.
To determine relevance, we use the following rule. Query sequenceA is a relevant ("True Positive") match to sequenceB, if the DA for sequenceA is at least a subset of the DA for sequenceB. Using the example above, if you searched with up|P14046|A1I3_RAT and found pfam21|Q9ULD7|Q9ULD7_HUMAN as a hit, then it would be a relevant match (assuming no additional criteria like minimum overlap). Notice that relevancy using DAs is not transitive. Here, if you searched with pfam21|Q9ULD7|Q9ULD7_HUMAN and found up|P14046|A1I3_RAT as a hit, the match does not fully support the DA used as the query so the match could be classified as an irrelevant match ("False Positive").
Before constructing a set of DAs, we applied additional filtering steps to RefProtDom. First, we removed sequences that had multiple domain annotations for the same position ("overlapping" domains) (see Figure 1). These resulting 227,512 sequences comprise the target (or subject) sequence database (of which multi-domain queries are searched). Additionally, we segregated sequences that only had a single domain and those with multi-domains. This resulted in 66,601 sequences that have multiples domains (and have non-overlapping domains). These multi-domain sequences have a total of 2,525 DAs. Collectively, by using DAs, MultiDomainBenchmark specifies 4,435,759,802 relevancy relationships between all of its sequences.

Query Selection

Venn diagram illustrating DAs that have only a single member (1,320 DAs), DAs that at least one of the sequences have at most 1,800 residues (2,142 DAs) and their intersection (1,179 DAs)
Figure 2. Domain Architecture (DA) filtering
To obtain a set of phylogenetically diverse query sequences, first, we ignored DAs that have only a single member (resulting in 1,320 DAs) (see Figure 2). Second, we required that at least one of the sequences have at most 1,800 residues (for computational considerations). This resulted in 1,179 DAs that passed these filtering steps.
We further constrained the set of query DAs by only considering those that represent a diversity in the kingdoms of life. There are 412 DAs that have sequences in at least two kingdoms of life. From each of set of DAs, we randomly choose a query sequence. We ordered these query sequences based on their arbitrarily assigned DA number, and partitioned the even ranked ones for a Training set and the odd ranked ones for a Test set.

Summary Statistics

In an effort to fully describe MultiDomainBenchmark, we have compiled the following sets of statistics that characterize the benchmark.
Description Min Average Max Graph
Number of sequences per DA    2 111.0 1315
Lengths of each of the query sequences 170 759.7 1800
  • Lengths of each of the Training query sequences
170 750.9 1800
  • Lengths of each of the Test query sequences
198 768.5 1749
Number of domains per query 2 2.9 16
For more details, please refer to our journal article (Carroll et al., 2019).

References

  • H.D. Carroll, J.L. Spouge, and M.W. Gonzalez. 2019. The MultiDomainBenchmark Suite: a multiple-domain query and subject database suite. BMC Bioinformatics 20:77. (link)
  • J.M. Chandonia, G. Hon, N.S. Walker, L. Lo Conte, P. Koehl, M. Levitt, and S.E. Brenner. 2004. The ASTRAL Compendium in 2004. Nucleic Acids Research 32:Database Issue, D189-D192.
  • R.D Finn, P. Coggill, R.Y Eberhardt, S.R Eddy, J. Mistry, A.L Mitchell, S.C Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, and others. 2015. The Pfam protein families database: towards a more sustainable future. Nucleic acids research. Oxford Univ Press, gkv1344.
  • M.W. Gonzalez, and W.R. Pearson. 2010. RefProtDom: a protein database with improved domain boundaries and homology relationships. Bioinformatics 26:18, 2361-2362.

Downloads

Benchmark

The core of the MultiDomainBenchmark is comprised of query sequences (used to initiate the search), a collection of target sequences (which a query is searched against), relationship or relevancy information and domain location information (necessary to calculate overlaps).

Query Sequences

List of all 160,911 single domain queries
Table of all 66,601 multi-domain queries (and selected attributes)
  • Table of the 412 randomly selected multi-domain queries (one sequence per DA) (a subset of the multi-domain queries file)
    • List of Training sequences
    • List of Test sequences

Target Sequences

All 227,512 target sequences in FASTA format (.zip, .gz)

Relevancy Information

Relevance Information (DA for each sequence)
#Taxon	DA	Relevant_DAs
pfam21|O00197|O00197|HUMAN	da00353	da00352,da00698,da00923,da00972,da01542,da01962,da02225,da02278,da02288,da02397,da02425
pfam21|O00222|MGR8|HUMAN	da00069	da00192,da00679,da02044
pfam21|O00329|PK3CD|HUMAN	da00020	da00179,da00324,da00325,da00336,da00615,da00714,da00747,da01306,da02021,da02047,da02059
...
The first column is the label of the multi-domain sequence. The second column is a DA identifier. The third (last) colunmn is a a list of other DAs that would result in a relevant ("True Positive") match.

Domain Location Information

Domain location information - Lists domain names and positions for each sequence.
>pfam21|Q40577|5EAS_TOBAC
PF01397	25	219
PF03936	224	492
>up|O70212|5HT3A_CAVPO
PF02931	39	248
PF02932	255	481
>up|P46098|5HT3A_HUMAN
...
The format of the domain locations information file is a greater than sign, ">", followed by the label of each of the 227,512 sequences. The following lines have three columns to indicate the domain name, it's starting and ending positions. This file is a subset of family_members.annot from RefProtDom v1.2 (Gonzalez and Pearson, 2010).

Scripts

Benchmark Generation Scripts

The queries, target sequences, relevance information and domain location information files provided above are derived from RefProtDom v1.2 (namely, the family_members.annot and library_all_domains.fa files). The scripts to generate the benchmark are available as either benchmarkGenerationScripts.zip or benchmarkGenerationScripts.tgz (a tarball). Both compressed files contain the following files:
Script / File / Directory Description
makeBenchmark.sh Makes the entire benchmark
perl/ Directory of supporting Perl modules
scripts/domainLocsParser.pl Determines the number of domains, presence of a repeated domain and/or an embedded domain for each sequence
scripts/getDomainSizes.pl Retrieves a subset of the statistics about each sequence based on a file listing the random queries
scripts/getDASizes.pl Determines the DA associated with each specified query and calculates the number of sequences with the same DAs
scripts/getDAs.pl Filters DAs and randomly chooses one query sequence per DA
scripts/lengths.sh Calculates the number of residues in each sequence
scripts/fileSplicer.pl Breaks up a file into sets
scripts/removeTaxa.pl Removes taxa from a specified file that match labels from a specified list
taxonomy/diverseDAs.py Determines if a domain architecture (DA) has members (sequences) in multiple taxonomical domains
taxonomy/speclist-20060516.txt Older controlled vocabulary of species (for deprecated species)
taxonomy/uniprot2taxonomy.pl Determines the taxonomy for a Uniprot accession
taxonomy/idmapping_selected-cached.tab Simplified cached copy of accessions to taxon ID mappings
COPYING GNU GENERAL PUBLIC LICENSE v3

Performance Analysis Scripts

Managing multiple tests for hundreds of queries can be a complicated and time consuming process. Additionally, calculating the relevancy is not a trivial task, especially when overlap calculations are desired. Consequently, we're providing the scripts used to perform the case study to assist other researchers in performing their evaluations. The performance analysis scripts are available as either benchmarkingScripts.zip or benchmarkingScripts.tgz (a tarball). Both compressed files contain the following files:
Script / File Description
calculateTap-iterations.sh Calculates TAP scores for each of the iterations
calculateTap.sh Calculates TAP scores
calculateTapKs.sh Calculates TAP-k scores
calculateTiming.sh Gathers all timing information
classifyRelevance.pl Main relevancy script. Given an output file from BLAST, PSI-BLAST, or HMMER3, it determines for each sequence retrieved if its a relevant or irrelevant match.
hmmerWrapper.sh Executes phmmer or jackhmmer, including setting up a directory, calling the database search program and determining relevancy
jackhmmerWrapper.sh Symbolic link to hmmerWrapper.sh
jobsManager.sh Creates submission (or simply an execution) scripts and schedules jobs that are not pending or running on PBS-like systems. Alternatively, it can execute jobs serially (without a scheduler).
numRelevantRecords.pl Calculates and displays the number of relevant records
phmmerWrapper.sh Symbolic link to hmmerWrapper.sh
psi-non_iterativeWrapper.sh Executes a single iteration of PSI-BLAST, including setting up a directory, calling the database search program and determining relevancy
psiWrapper.sh Executes PSI-BLAST, including setting up a directory, calling the database search program and determining relevancy.
psiblast-non_iterativeWrapper.sh Simple wrapper to psi-non_iterativeWrapper.sh
psiblastWrapper.sh Simple wrapper to psiWrapper.sh
removeEmptyFiles.sh Utility script that removes all non-hidden, zero-length files
runAll.sh Highest level script to manage execution of runs
spouge2spougeE.pl Truncates .spouge formatted files to a threshold (e.g., E-value)
COPYING GNU GENERAL PUBLIC LICENSE v3

Results

Search Performance Results

Below are the Threshold Average Precision-k (TAP-k) (Carroll et al., 2010) scores for each of the query sequences in the Test set. As additional methods are evaluated, their results can be added to these files. Those wishing to contribute TAP-k scores, should use the format in the files below and contact the MultiDomainBenchmark authors.
  • Summary and individual TAP-1 scores
  • Summary and individual TAP-3 scores
  • Summary and individual TAP-5 scores
  • Summary and individual TAP-20 scores
# Query	non-iterative::phmmer	non-iterative::psiblast	iterative::jackhmmer	iterative::psiblast
# Threshold for 0.5 Quantile	5.3e-137	4e-131	2e-190	0
# Unweighted Average TAP-1	0.2564	0.2807	0.1381	0.3081
pfam21|P91550|P91550_CAEEL	0.0573	0.1031	0.0019	0.0868
pfam21|Q1SJT6|Q1SJT6_MEDTR	0.4948	0.5686	0.6759	0.6836
pfam21|Q1TLH7|Q1TLH7_9MYCO	0.6418	0.6589	0.5253	0.6221
pfam21|Q1UW50|Q1UW50_9MYCO	0.0315	0.0394	0.0000	0.0000
...

References

  • H.D. Carroll, M.G. Kann, S.L. Sheetlin, and J.L. Spouge. 2010. Threshold Average Precision (TAP-k): A Measure of Retrieval Efficacy Designed for Bioinformatics. Bioinformatics 26:14, 1708-1713. (link)
  • M.W. Gonzalez, and W.R. Pearson. 2010. RefProtDom: a protein database with improved domain boundaries and homology relationships. Bioinformatics 26:18, 2361-2362.

Case Study

MultiDomainBenchmark provides all of the necessary components to perform a rigorous evaluation of different database search algorithms. To illustrate this point, we conducted the first such evaluation. We evaluated the accuracy of each database search program with the Threshold Average Precision-k (TAP-k) metric (Carroll et al., 2010). The TAP-k measures the average precision, with an additional penalty for sequences retrieved just before the threshold (which lowers the utility of the retrieval). Here, k is the number of irrelevant ("False Positive") records to set the threshold. TAP-k scores range from 0.0 for retrievals where no relevant ("True Positive") matches are found to 1.0 where all of the relevant sequences and none of the irrelevant sequences were retrieved. Additionally, we measured the elapsed time of each execution.

Experiments Set-up

For the inaugural use of MultiDomainBenchmark, we evaluated the following commonly used database search programs: BLAST/PSI-BLAST (Camacho et al., 2009) and HMMER (phmmer and jackhmmer) (Eddy, 2011). Out of the dozens of command-line options for each program, we specified only a very few for formatting and evaluation purposes. In the spirit of facilitating reproducible results, the command-lines used for each program are provided below (with only directories abbreviated for readability).
  • BLAST (PSI-BLAST set to a single iteration):
    (time -p psiblast -db finalDatabase -query queries-multiDomain/pfam21_Q1UW50_Q1UW50_9MYCO.fa -num_threads 1 -evalue 1000 -out pfam21_Q1UW50_Q1UW50_9MYCO.psiblast-non_iterative.hits.final.txt -num_descriptions 9999 -num_alignments 9999) &> pfam21_Q1UW50_Q1UW50_9MYCO.psiblast-non_iterative.final.out
    classifyRelevance.pl -v -psiblast=pfam21_Q1UW50_Q1UW50_9MYCO.psiblast-non_iterative.hits.final.txt -rel=relevanceInfo.tab --spouge=pfam21_Q1UW50_Q1UW50_9MYCO.psiblast-non_iterative.spouge.1000 --spougeext -taxon=pfam21_Q1UW50_Q1UW50_9MYCO --domainLocs=domainLocs.tab --overlap=50 --norandomsAsIrrelevants --combineHSPs --useFold
    
    (Note: the --useFold flag indicates to use the primary and superset DAs for classification.)
  • PSI-BLAST (PSI-BLAST for up to 5 iterations and saving the last PSSM. Then, a final search using the resulting PSSM on MultiDomainBenchmark):
    (time -p psiblast -db nrdb90/nr90 -query queries-multiDomain/pfam21_Q1UW50_Q1UW50_9MYCO.fa -num_iterations 5 -num_threads 1 -out_pssm pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.nr.pssm -out_ascii_pssm pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.nr.pssm.txt) &> pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.iterationsDb.out
    (time -p psiblast -db finalDatabase -in_pssm pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.nr.pssm -num_iterations 1 -evalue 1000 -num_threads 1 -out pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.hits.final.out -num_descriptions 9999 -num_alignments 9999) &> pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.final.out
    classifyRelevance.pl -v -psiblast=pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.hits.final.out -rel=relevanceInfo.tab --spouge=pfam21_Q1UW50_Q1UW50_9MYCO.psiblast.spouge.1000 --spougeext -taxon=pfam21_Q1UW50_Q1UW50_9MYCO --domainLocs=domainLocs.tab --overlap=50 --norandomsAsIrrelevants --combineHSPs --useFold
    
  • phmmer (hmmbuild and hmmsearch):
    hmmbuild pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.finalIn.hmm queries-multiDomain/pfam21_Q1UW50_Q1UW50_9MYCO.fa
    (time -p hmmsearch  -E 1000 --cpu 1 -o pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.hits.final.out  pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.finalIn.hmm  finalDatabase.fa) &> pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.final.out
    classifyRelevance.pl -v -phmmer=pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.hits.final.out -rel=relevanceInfo.tab --spouge=pfam21_Q1UW50_Q1UW50_9MYCO.phmmer.spouge.1000 --spougeext -taxon=pfam21_Q1UW50_Q1UW50_9MYCO --domainLocs=domainLocs.tab --overlap=50 --norandomsAsIrrelevants --combineHSPs --useFold
    
    (Note: the --combineHSPs flag is ignored, but supplied to classifyRelevance.pl for uniformity.)
  • jackhmmer (jackhmmer for the iterations, then a final search using the resulting HMM on MultiDomainBenchmark):
    (time -p jackhmmer   --noali -N 5  --cpu 1  --chkhmm pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.iterationsDb  queries-multiDomain/pfam21_Q1UW50_Q1UW50_9MYCO.fa  nrdb90/nr90.fa | grep '^\[ok\]$') &> pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.iterationsDb.out
    ln -s "pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.iterationsDb-5.hmm" "pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.finalIn.hmm"
    (time -p hmmsearch  -E 1000 --cpu 1 -o pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.hits.final.out  pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.finalIn.hmm  finalDatabase.fa) &> pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.final.out
    classifyRelevance.pl -v -jackhmmer=pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.hits.final.out -rel=relevanceInfo.tab --spouge=pfam21_Q1UW50_Q1UW50_9MYCO.jackhmmer.spouge.1000 --spougeext -taxon=pfam21_Q1UW50_Q1UW50_9MYCO --domainLocs=domainLocs.tab --overlap=50 --norandomsAsIrrelevants --combineHSPs --useFold
    
    (Note: the --combineHSPs flag is ignored, but supplied to classifyRelevance.pl for uniformity.)
The scripts used to generate each of these runs and the supporting scripts (e.g., classifyRelevance.pl) are provided on the Downloads page.
MultiDomainBenchmark is a complete evaluation benchmark. Included in it is a Training and a Test set of query sequences. While the Training set is intended to be used during development and for parameters selection, the Test set is used for performance comparisons. Consequently, in this case, we used the 206 Test query sequences. Timing results were measured with the Unix command time for just the execution of the program (as illustrated above in the command-lines). For non-iterative runs, the timing only measures the execution of searching against all of MultiDomainBenchmark (with it's 227,512 sequences). For iterative runs, it is the sum of both the iterative rounds (searching the NR90 database with it's 13,219,523 sequences) and the final round searching against all of MultiDomainBenchmark. We gathered these timing results on a shared environment (multi-processor node) system and therefore should be considered a first approximation of the actual execution times.

Summary Search Performance

The following box-and-whisker plots summarize the TAP-k and timing results for each database search program. The whiskers are the minimum and maximum values. Given that the variety of queries encompasses both "easy" and "hard" cases, for these TAP-k results, all but one of the graphs have scores of 0.0 and 1.0, putting the whiskers at the very bottom and top of the graph. The bottoms and tops of the boxes in the graphs indicates the first and third quartile, with the black bar the second quartile (median).

TAP-k Non-iterative Performance Results


TAP-k Iterative Performance Results


Timing Non-iterative Results


Timing Iterative Results


Individual Search Performance

TAP-k scores and timing results for each of the query sequences in the Test set are available on the Downloads page.

References

  • H.D. Carroll, M.G. Kann, S.L. Sheetlin, and J.L. Spouge. 2010. Threshold Average Precision (TAP-k): A Measure of Retrieval Efficacy Designed for Bioinformatics. Bioinformatics 26:14, 1708-1713. (link)
  • C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T.L. Madden. 2009. BLAST+: architecture and applications. BMC bioinformatics 10:421.
  • S.R Eddy. 2011. Accelerated profile HMM searches. PLoS Comput Biol 7:10. Public Library of Science, e1002195.
Copyright © Hyrum D. Carroll
Last Modified: