BLASTFDR

  • Intro
  • Example
  • Results
  • Supp. Materials
  • Home
  • Research
    • Overview
    • BLASTFDR
    • MultiDomainBenchmark
    • PSI-BLASTFDR
    • PSI-SemiGLOBAL
    • TAP
  • Courses
  • CV
  • Publications

Introduction

BLASTFDR and PSI-BLASTFDR are variants of BLAST and PSI-BLAST, the ubiquitous genetic database search algorithms. Instead of using BLAST and PSI-BLAST's uniform E-value threshold criterion to determine which sequences to include, BLASTFDR and PSI-BLASTFDR utilize a false discovery rate (FDR) method. The default FDR method in both BLASTFDR and PSI-BLASTFDR is the Benjamini-Hochberg method (Benjamini & Hochberg, 1995). BLASTFDR and PSI-BLASTFDR include the first k sequences that meet the following condition: Ek ≤ kα with Ek as the E-value of the kth sequence and α as a user-specified level. Employing a FDR threshold criterion allows BLASTFDR and PSI-BLASTFDR to retrieve fewer irrelevant sequences ("false positives") and to achieve better average Threshold Average Precision (TAP) (Carroll et al., 2010) values than BLAST and PSI-BLAST.

We evaluated BLASTFDR and PSI-BLASTFDR on an augmented version of the Astral40 database (Chandonia et al., 2004). BLASTFDR received an average TAP value of 0.226 while BLAST received a 0.203 on a test subset. Furthermore, BLASTFDR performs better on datasets that belong to smaller superfamilies. For example, BLASTFDR achieves a TAP value of 0.089 better than BLAST for superfamilies with a size of twelve or fewer members. These superfamilies represent over 38% of the test datasets. BLASTFDR performs better or equal to BLAST for over 61% of these datasets. PSI-BLASTFDR also performs similarly when compared against PSI-BLAST.

Download and Installation Instructions

BLASTFDR and PSI-BLASTFDR are compiled just as BLAST is and can be used on any platform that BLAST is designed for. To compile BLASTFDR and PSI-BLASTFDR, the following BLAST version 2.2.27 files require modifications:

	  include/algo/blast/api/blast_options.hpp
	  include/algo/blast/api/blast_options_handle.hpp
	  include/algo/blast/api/psiblast_options.hpp
	  include/algo/blast/blastinput/cmdline_flags.hpp
	  include/algo/blast/core/blast_hits.h
	  include/algo/blast/core/blast_options.h
	  include/algo/blast/core/blast_program.h
	  include/algo/blast/format/data4xmlformat.hpp
	  include/objects/blast/names.hpp
	  include/objtools/align_format/align_format_util.hpp
	  src/algo/blast/api/blast_aux.cpp
	  src/algo/blast/api/blast_options_builder.cpp
	  src/algo/blast/api/blast_options_cxx.cpp
	  src/algo/blast/api/blast_options_local_priv.hpp
	  src/algo/blast/api/psi_pssm_input.cpp
	  src/algo/blast/api/psiblast.cpp
	  src/algo/blast/api/psiblast_aux_priv.cpp
	  src/algo/blast/api/psiblast_aux_priv.hpp
	  src/algo/blast/api/psiblast_iteration.cpp
	  src/algo/blast/api/psiblast_options.cpp
	  src/algo/blast/blastinput/blast_args.cpp
	  src/algo/blast/blastinput/cmdline_flags.cpp
	  src/algo/blast/core/blast_engine.c
	  src/algo/blast/core/blast_hits.c
	  src/algo/blast/core/blast_options.c
	  src/objects/blast/names.cpp
	  src/objtools/align_format/align_format_util.cpp
	
All of these files with the appropriate modifications can be downloaded in the following tarball: blast_fdr.tar.gz (also available via sourceforge.net/p/blastfdr/). To compile an executable, untar the files over the original version 2.2.27 source code files (tar -xzf blast_fdr.tar.gz). Alternatively, you can unzip the files into an existing copy of the source code (unzip blast_fdr.zip). If you're using a version after 2.2.27, apply the differences of each of the above files found in blast_fdr.diff.
After applying the modifications above, compile in the same way that you would compile BLAST.

Execution Instructions

After successfully compiling the source code, BLASTFDR can be executed instead of BLAST by adding "-fdrDisplayBH α" to the command-line. Similarly for PSI-BLASTFDR, add the same parameters to the command-line when executing PSI-BLAST. Empirically, we've found that α = 0.05 works best (at least on the augmented Astral40 database). Note, the -evalue parameter can continue to be used to influence the initial heuristic search algorithms.

License

BLASTFDR and PSI-BLASTFDR are extensions of BLAST and PSI-BLAST and therefore are subject to their licensing terms:

PUBLIC DOMAIN NOTICE
National Center for Biotechnology Information

This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite the author in any work or product based on this material.

Citations

To cite BLASTFDR, please use the following:
  • Carroll,H.D., Williams,A.C. Davis,A.G., and Spouge,J.L. (2013) False Discovery Rate for Homology Searches. Proceedings of the 8th Brazilian Symposium on Bioinformatics, pp 194-201.

To cite PSI-BLASTFDR, please use the following:
  • Carroll,H.D., Williams,A.C. Davis,A.G., and Spouge,J.L. (2014) Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate. Submitted.

References

Benjamini,Y. and Hochberg,Y. (1995). Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Series B, 57, 289-300.

Carroll,H.D., Kann,M.G., Sheetlin,S.L. and Spouge,J.L. (2010) Threshold Average Precision (TAP-k): A Measure of Retrieval Efficacy Designed for Bioinformatics. Bioinformatics, 26, pp 1708-1713. (web, pdf)

Chandonia,J., Hon,G., Walker,N., Lo Conte,L., Koehl,P., Levitt,M., and Brenner,S. (2004). The ASTRAL Compendium in 2004. Nucleic Acids Research, 32(Database Issue), D189-D192.

Example

As an example, let's search the augmented database referenced above with a sequence from the Test database:
>d1awcb_ d.211.1.1 (B:) GA bindinig protein (GABP) beta 1 {Mouse (Mus musculus) [TaxId: 10090]}
DLGKKLLEAARAGQDDEVRILMANGAPFTTDWLGTSPLHLAAQYGHFSTTEVLLRAGVSRDARTKVDRTPLHMAASEGHA
NIVEVLLKHGADVNAKDMLKMTALHWATEHNHQEVVELLIKYGADVHTQSKFCKTAFDISIDNGNEDLAEILQ
This sequence can be downloaded as d1awcb_.fa. Before proceeding, you'll need to have compiled the source code and uncompressed the augmented database. Additionally, you'll need to make a BLAST database out of the augmented database. This can be accomplished with a command like:
formatdb  -i astral-scop-1.75A-withRandomSeqs100x.fa  -p T  -o T  -n astral-scop-1.75A-withRandomSeqs100x  -V
This should result in five files with names that start with astral-scop-1.75A-withRandomSeqs100x, with an extension of: .phr, .pin, .psd, .psi, and .psq. Now we're ready to search the database using the query sequence. To do so using the Benjamini-Hochberg method for thresholding, you would use a command like the following:
blastp  -db astral-scop-1.75A-withRandomSeqs100x  -query d1awcb_.fa  -comp_based_stats 0  -fdrDisplayBH 0.05  -out d1awcb_-blastp_fdr.txt
For verification purposes, the output can be downloaded here: d1awcb_-blastp_fdr.txt. The above command runs BLASTFDR. To run PSI-BLASTFDR, add the maximum number of iterations to perform with the -num_iterations parameter.
For evaluation purposes, the self-hit (the first hit) is ignored. Additionally, those sequences in the same superfamily as d1awcb_ are considered relevant ("true positives"). Random sequences are considered irrelevant ("false positives"). Lastly, to avoid erroneously classification, the remaining sequences are ignore (e.g., d2ae0x1).
                                                                      Score    E
Sequences producing significant alignments:                          (Bits)  Value   FDR   Classification
										     	    
lcl|d1awcb_  d.211.1.1 (B:) GA bindinig protein (GABP) beta 1 {Mo...   309   3e-107  0.05  ignored: self hit
lcl|d3twra_  d.211.1.0 (A:) automated matches {Human (Homo sapien...  94.4   4e-23   0.10  relevant
lcl|d1n11a_  d.211.1.1 (A:) Ankyrin-R {Human (Homo sapiens) [TaxI...  90.9   1e-20   0.15  relevant
lcl|d1uoha_  d.211.1.1 (A:) 26S proteasome non-ATPase regulatory ...  75.9   6e-16   0.35  relevant
lcl|d1bd8a_  d.211.1.1 (A:) Cell cycle inhibitor p19ink4D {Human ...  65.5   1e-12   0.65  relevant
lcl|d2f8ya_  d.211.1.1 (A:) automated matches {Human (Homo sapien...  65.9   2e-12   0.70  relevant
lcl|d1iknd_  d.211.1.1 (D:) I-kappa-B-alpha {Human (Homo sapiens)...  63.5   1e-11   0.85  relevant
lcl|d1wdya_  d.211.1.1 (A:) RNase L, 2-5a-dependent ribonuclease ...  59.7   5e-10   0.95  relevant
lcl|d1k1aa_  d.211.1.1 (A:) bcl-3 {Human (Homo sapiens) [TaxId: 9...  55.5   9e-09   1.05  relevant
lcl|d3aaac_  d.211.1.1 (C:) automated matches {Human (Homo sapien...  52.8   2e-08   1.10  relevant
lcl|d2fo1e1  d.211.1.1 (E:1021-1297) Lin-12 {Nematode (Caenorhabd...  52.8   1e-07   1.30  relevant
lcl|d1s70b_  d.211.1.1 (B:) Myosin phosphatase targeting subunit ...  52.0   3e-07   1.35  relevant
lcl|d2dzna_  d.211.1.1 (A:) automated matches {Baker's yeast (Sac...  49.7   1e-06   1.45  relevant
lcl|d1ycsb1  d.211.1.1 (B:327-456) 53BP2 {Human (Homo sapiens) [T...  45.1   2e-05   1.55  relevant
lcl|d1oy3d_  d.211.1.1 (D:) Transcription factor inhibitor I-kapp...  43.5   2e-04   1.80  relevant
lcl|random1096763  unnamed protein product                            35.4   0.14    2.65  irrelevant
lcl|random336037  unnamed protein product                             34.3   0.26    2.70  irrelevant
lcl|d1dcqa1  d.211.1.1 (A:369-522) Pyk2-associated protein beta {...  33.1   0.57    2.75  relevant
lcl|random627502  unnamed protein product                             33.1   0.81    2.80  irrelevant
lcl|d2ae0x1  b.52.1.4 (X:3-337) Membrane-bound lytic murein trans...  32.7   1.1     2.85  ignored: different superfamily
Above, the values in the FDR column are the FDR multiplied by the number of sequences in the database. Because of BLAST and PSI-BLAST's focus on E-values, BLASTFDR and PSI-BLASTFDR uses these modified FDR values. Additionally, for classification purposes, only the most significant hit for each sequence in the database is considered. Multiple hits to a sequence in the database are not shown. Consequently, the values in the FDR column do not increase strictly by the α set on the command-line (i.e., 0.05). For example, d1n11a_ has four HSPs (High Scoring Pairs), each with an E-value greater of equal to 2e-17. For comparison purposes, BLAST (using the default values) returns an additional five irrelevant sequences:
lcl|random927625  unnamed protein product                             30.4   4.2   
lcl|random831070  unnamed protein product                             30.8   4.7
lcl|random78264  unnamed protein product                              30.0   6.8
lcl|random548372  unnamed protein product                             28.9   8.4
lcl|random899104  unnamed protein product                             29.6   8.8

Databases

As detailed in Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate, we augmented the Astral40 1.75A database by 100 fold with random sequences. This augmented database is available as astral-scop-1.75A-withRandomSeqs100x.fa.gz (119 MB). Additionally, the following files have the identifiers for the sequences of the respective databases: Training-subset (103 sequences), Training (5,162 sequences) , and Test (5,161 sequences).

Results

We executed BLAST, BLASTFDR, PSI-BLAST, and PSI-BLASTFDR on these three databases. The results of these runs are below, organized by database. In addition to the TAP (Carroll et al., 2010) method, we also calculated the mean precision and mean recall. The mean precision is the sum of the precision for each retrieval divided by the total number of retrievals. The mean recall is calculated in the same manner.

Training-sub Database:

BLAST
  Method     TAP     Mean Precision     Mean Recall  
  defaults     0.171     0.238     0.205  

BLASTFDR
  Method     α     TAP     Mean Precision     Mean Recall  
  Benjamini-Hochberg     0.0005     0.168     0.549     0.148  
  Benjamini-Hochberg     0.005     0.171     0.565     0.151  
  Benjamini-Hochberg     0.05     0.203     0.557     0.184  
  Benjamini-Hochberg     0.5     0.184     0.379     0.193  
  Bonferroni     0.0005     0.163     0.539     0.143  
  Bonferroni     0.005     0.170     0.569     0.141  
  Bonferroni     0.05     0.198     0.580     0.177  
  Bonferroni     0.5     0.199     0.526     0.185  
  Hochberg     0.0005     0.081     0.307     0.067  
  Hochberg     0.005     0.088     0.353     0.072  
  Hochberg     0.05     0.102     0.402     0.086  
  Hochberg     0.5     0.141     0.474     0.132  
  Holm     0.0005     0.163     0.539     0.143  
  Holm     0.005     0.170     0.569     0.141  
  Holm     0.05     0.198     0.580     0.177  
  Holm     0.5     0.199     0.526     0.185  
  Hommel     0.0005     0.163     0.539     0.143  
  Hommel     0.005     0.170     0.569     0.141  
  Hommel     0.05     0.198     0.580     0.177  
  Hommel     0.5     0.199     0.526     0.185  

PSI-BLAST
  Method     TAP     Mean Precision     Mean Recall  
  defaults     0.296     0.282     0.349  

PSI-BLASTFDR
  Method     α     TAP     Mean Precision     Mean Recall  
  Benjamini-Hochberg     0.0005     0.309     0.745     0.287  
  Benjamini-Hochberg     0.005     0.329     0.752     0.307  
  Benjamini-Hochberg     0.05     0.332     0.662     0.323  
  Benjamini-Hochberg     0.5     0.303     0.326     0.346  
  Bonferroni     0.0005     0.302     0.745     0.279  
  Bonferroni     0.005     0.319     0.765     0.294  
  Bonferroni     0.05     0.327     0.740     0.304  
  Bonferroni     0.5     0.323     0.626     0.323  
  Hochberg     0.0005     0.215     0.637     0.190  
  Hochberg     0.005     0.225     0.647     0.202  
  Hochberg     0.05     0.257     0.675     0.232  
  Hochberg     0.5     0.318     0.683     0.303  
  Holm     0.0005     0.302     0.745     0.279  
  Holm     0.005     0.319     0.765     0.294  
  Holm     0.05     0.327     0.740     0.304  
  Holm     0.5     0.323     0.626     0.323  
  Hommel     0.0005     0.302     0.745     0.279  
  Hommel     0.005     0.319     0.765     0.294  
  Hommel     0.05     0.327     0.740     0.304  
  Hommel     0.5     0.323     0.626     0.323  

Training Database:

BLAST
  Method     TAP     Mean Precision     Mean Recall  
  defaults     0.203     0.266     0.238  

BLASTFDR
  Method     α     TAP     Mean Precision     Mean Recall  
  Benjamini-Hochberg     0.0005     0.199     0.615     0.174  
  Benjamini-Hochberg     0.005     0.215     0.642     0.191  
  Benjamini-Hochberg     0.05     0.229     0.640     0.207  
  Benjamini-Hochberg     0.5     0.220     0.426     0.229  

PSI-BLAST
  Method     TAP     Mean Precision     Mean Recall  
  defaults     0.346     0.320     0.401  

PSI-BLASTFDR
  Method     α     TAP     Mean Precision     Mean Recall  
  Benjamini-Hochberg     0.05     0.385     0.763     0.371  

Test Database:

BLAST
  Method     TAP     Mean Precision     Mean Recall  
  defaults     0.198     0.266     0.235  

BLASTFDR
  Method     α     TAP     Mean Precision     Mean Recall  
  Benjamini-Hochberg     0.05     0.226     0.645     0.206  

PSI-BLAST
  Method     TAP     Mean Precision     Mean Recall  
  defaults     0.338     0.318     0.393  

PSI-BLASTFDR
  Method     α     TAP     Mean Precision     Mean Recall  
  Benjamini-Hochberg     0.05     0.378     0.761     0.366  
  Bonferroni     0.05     0.366     0.827     0.345  
  Hochberg     0.5     0.337     0.740     0.321  

References

Carroll,H.D., Kann,M.G., Sheetlin,S.L. and Spouge,J.L. (2010) Threshold Average Precision (TAP-k): A Measure of Retrieval Efficacy Designed for Bioinformatics. Bioinformatics, 26, pp 1708-1713. (web, pdf)

Supplementary Material

In Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate, Figures 2 and 5 present histograms of the E-values of sequences in the Test database declared significant by BLAST (Figure 2) and PSI-BLAST (Figure 5) but not by BLASTFDR or PSI-BLASTFDR. To allow for reproducibility, these sequences are in the following files:

  • Figure 2: blastp-defaults_blastp_fdr-fdrDisplayBH_0.05-allHits.tab (BLAST sequences) (2.4 MB)
  • Figure 5: psiblast-defaults_psiblast_fdr-fdrDisplayBH_0.05-allHits.tab (PSI-BLAST sequences) (4.0 MB)
Copyright © Hyrum D. Carroll
Last Modified: