BLAST_FDR and PSI-BLAST

Introduction

BLAST_FDR and PSI-BLAST_FDR are variants of BLAST and PSI-BLAST, the ubiquitous genetic database search algorithms. Instead of using BLAST and PSI-BLAST's uniform E-value threshold criterion to determine which sequences to include, BLAST_FDR and PSI-BLAST_FDR utilize a false discovery rate (FDR) method. The default FDR method in both BLAST_FDR and PSI-BLAST_FDR is the Benjamini-Hochberg method (Benjamini & Hochberg, 1995). BLAST_FDR and PSI-BLAST_FDR include the first k sequences that meet the following condition: E_k ≤ kα with E_k as the E-value of the k^th sequence and α as a user-specified level. Employing a FDR threshold criterion allows BLAST_FDR and PSI-BLAST_FDR to retrieve fewer irrelevant sequences ("false positives") and to achieve better average Threshold Average Precision (TAP) (Carroll et al., 2010) values than BLAST and PSI-BLAST.

We evaluated BLAST_FDR and PSI-BLAST_FDR on an augmented version of the Astral40 database (Chandonia et al., 2004). BLAST_FDR received an average TAP value of 0.226 while BLAST received a 0.203 on a test subset. Furthermore, BLAST_FDR performs better on datasets that belong to smaller superfamilies. For example, BLAST_FDR achieves a TAP value of 0.089 better than BLAST for superfamilies with a size of twelve or fewer members. These superfamilies represent over 38% of the test datasets. BLAST_FDR performs better or equal to BLAST for over 61% of these datasets. PSI-BLAST_FDR also performs similarly when compared against PSI-BLAST.

Download and Installation Instructions

BLAST_FDR and PSI-BLAST_FDR are compiled just as BLAST is and can be used on any platform that BLAST is designed for. To compile BLAST_FDR and PSI-BLAST_FDR, the following BLAST version 2.2.27 files require modifications:

	  include/algo/blast/api/blast_options.hpp
	  include/algo/blast/api/blast_options_handle.hpp
	  include/algo/blast/api/psiblast_options.hpp
	  include/algo/blast/blastinput/cmdline_flags.hpp
	  include/algo/blast/core/blast_hits.h
	  include/algo/blast/core/blast_options.h
	  include/algo/blast/core/blast_program.h
	  include/algo/blast/format/data4xmlformat.hpp
	  include/objects/blast/names.hpp
	  include/objtools/align_format/align_format_util.hpp
	  src/algo/blast/api/blast_aux.cpp
	  src/algo/blast/api/blast_options_builder.cpp
	  src/algo/blast/api/blast_options_cxx.cpp
	  src/algo/blast/api/blast_options_local_priv.hpp
	  src/algo/blast/api/psi_pssm_input.cpp
	  src/algo/blast/api/psiblast.cpp
	  src/algo/blast/api/psiblast_aux_priv.cpp
	  src/algo/blast/api/psiblast_aux_priv.hpp
	  src/algo/blast/api/psiblast_iteration.cpp
	  src/algo/blast/api/psiblast_options.cpp
	  src/algo/blast/blastinput/blast_args.cpp
	  src/algo/blast/blastinput/cmdline_flags.cpp
	  src/algo/blast/core/blast_engine.c
	  src/algo/blast/core/blast_hits.c
	  src/algo/blast/core/blast_options.c
	  src/objects/blast/names.cpp
	  src/objtools/align_format/align_format_util.cpp

All of these files with the appropriate modifications can be downloaded in the following tarball: blast_fdr.tar.gz (also available via sourceforge.net/p/blastfdr/). To compile an executable, untar the files over the original version 2.2.27 source code files (tar -xzf blast_fdr.tar.gz). Alternatively, you can unzip the files into an existing copy of the source code (unzip blast_fdr.zip). If you're using a version after 2.2.27, apply the differences of each of the above files found in blast_fdr.diff.
After applying the modifications above, compile in the same way that you would compile BLAST.

Execution Instructions

After successfully compiling the source code, BLAST_FDR can be executed instead of BLAST by adding "-fdrDisplayBH α" to the command-line. Similarly for PSI-BLAST_FDR, add the same parameters to the command-line when executing PSI-BLAST. Empirically, we've found that α = 0.05 works best (at least on the augmented Astral40 database). Note, the -evalue parameter can continue to be used to influence the initial heuristic search algorithms.

License

BLAST_FDR and PSI-BLAST_FDR are extensions of BLAST and PSI-BLAST and therefore are subject to their licensing terms:

PUBLIC DOMAIN NOTICE
National Center for Biotechnology Information

This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite the author in any work or product based on this material.

Citations

To cite BLAST_FDR, please use the following:

Carroll,H.D., Williams,A.C. Davis,A.G., and Spouge,J.L. (2013) False Discovery Rate for Homology Searches. Proceedings of the 8^th Brazilian Symposium on Bioinformatics, pp 194-201.

To cite PSI-BLAST_FDR, please use the following:

Carroll,H.D., Williams,A.C. Davis,A.G., and Spouge,J.L. (2014) Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate. Submitted.

References

Benjamini,Y. and Hochberg,Y. (1995). Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Series B, 57, 289-300.

Carroll,H.D., Kann,M.G., Sheetlin,S.L. and Spouge,J.L. (2010) Threshold Average Precision (TAP-k): A Measure of Retrieval Efficacy Designed for Bioinformatics. Bioinformatics, 26, pp 1708-1713. (web, pdf)

Chandonia,J., Hon,G., Walker,N., Lo Conte,L., Koehl,P., Levitt,M., and Brenner,S. (2004). The ASTRAL Compendium in 2004. Nucleic Acids Research, 32(Database Issue), D189-D192.

Example

As an example, let's search the augmented database referenced above with a sequence from the Test database:

>d1awcb_ d.211.1.1 (B:) GA bindinig protein (GABP) beta 1 {Mouse (Mus musculus) [TaxId: 10090]}
DLGKKLLEAARAGQDDEVRILMANGAPFTTDWLGTSPLHLAAQYGHFSTTEVLLRAGVSRDARTKVDRTPLHMAASEGHA
NIVEVLLKHGADVNAKDMLKMTALHWATEHNHQEVVELLIKYGADVHTQSKFCKTAFDISIDNGNEDLAEILQ

This sequence can be downloaded as d1awcb_.fa. Before proceeding, you'll need to have compiled the source code and uncompressed the augmented database. Additionally, you'll need to make a BLAST database out of the augmented database. This can be accomplished with a command like:

formatdb  -i astral-scop-1.75A-withRandomSeqs100x.fa  -p T  -o T  -n astral-scop-1.75A-withRandomSeqs100x  -V

This should result in five files with names that start with astral-scop-1.75A-withRandomSeqs100x, with an extension of: .phr, .pin, .psd, .psi, and .psq. Now we're ready to search the database using the query sequence. To do so using the Benjamini-Hochberg method for thresholding, you would use a command like the following:

blastp  -db astral-scop-1.75A-withRandomSeqs100x  -query d1awcb_.fa  -comp_based_stats 0  -fdrDisplayBH 0.05  -out d1awcb_-blastp_fdr.txt

For verification purposes, the output can be downloaded here: d1awcb_-blastp_fdr.txt. The above command runs BLAST_FDR. To run PSI-BLAST_FDR, add the maximum number of iterations to perform with the -num_iterations parameter.
For evaluation purposes, the self-hit (the first hit) is ignored. Additionally, those sequences in the same superfamily as d1awcb_ are considered relevant ("true positives"). Random sequences are considered irrelevant ("false positives"). Lastly, to avoid erroneously classification, the remaining sequences are ignore (e.g., d2ae0x1).

                                                                      Score    E
Sequences producing significant alignments:                          (Bits)  Value   FDR   Classification
										     	    
lcl|d1awcb_  d.211.1.1 (B:) GA bindinig protein (GABP) beta 1 {Mo...   309   3e-107  0.05  ignored: self hit
lcl|d3twra_  d.211.1.0 (A:) automated matches {Human (Homo sapien...  94.4   4e-23   0.10  relevant
lcl|d1n11a_  d.211.1.1 (A:) Ankyrin-R {Human (Homo sapiens) [TaxI...  90.9   1e-20   0.15  relevant
lcl|d1uoha_  d.211.1.1 (A:) 26S proteasome non-ATPase regulatory ...  75.9   6e-16   0.35  relevant
lcl|d1bd8a_  d.211.1.1 (A:) Cell cycle inhibitor p19ink4D {Human ...  65.5   1e-12   0.65  relevant
lcl|d2f8ya_  d.211.1.1 (A:) automated matches {Human (Homo sapien...  65.9   2e-12   0.70  relevant
lcl|d1iknd_  d.211.1.1 (D:) I-kappa-B-alpha {Human (Homo sapiens)...  63.5   1e-11   0.85  relevant
lcl|d1wdya_  d.211.1.1 (A:) RNase L, 2-5a-dependent ribonuclease ...  59.7   5e-10   0.95  relevant
lcl|d1k1aa_  d.211.1.1 (A:) bcl-3 {Human (Homo sapiens) [TaxId: 9...  55.5   9e-09   1.05  relevant
lcl|d3aaac_  d.211.1.1 (C:) automated matches {Human (Homo sapien...  52.8   2e-08   1.10  relevant
lcl|d2fo1e1  d.211.1.1 (E:1021-1297) Lin-12 {Nematode (Caenorhabd...  52.8   1e-07   1.30  relevant
lcl|d1s70b_  d.211.1.1 (B:) Myosin phosphatase targeting subunit ...  52.0   3e-07   1.35  relevant
lcl|d2dzna_  d.211.1.1 (A:) automated matches {Baker's yeast (Sac...  49.7   1e-06   1.45  relevant
lcl|d1ycsb1  d.211.1.1 (B:327-456) 53BP2 {Human (Homo sapiens) [T...  45.1   2e-05   1.55  relevant
lcl|d1oy3d_  d.211.1.1 (D:) Transcription factor inhibitor I-kapp...  43.5   2e-04   1.80  relevant
lcl|random1096763  unnamed protein product                            35.4   0.14    2.65  irrelevant
lcl|random336037  unnamed protein product                             34.3   0.26    2.70  irrelevant
lcl|d1dcqa1  d.211.1.1 (A:369-522) Pyk2-associated protein beta {...  33.1   0.57    2.75  relevant
lcl|random627502  unnamed protein product                             33.1   0.81    2.80  irrelevant
lcl|d2ae0x1  b.52.1.4 (X:3-337) Membrane-bound lytic murein trans...  32.7   1.1     2.85  ignored: different superfamily

Above, the values in the FDR column are the FDR multiplied by the number of sequences in the database. Because of BLAST and PSI-BLAST's focus on E-values, BLAST_FDR and PSI-BLAST_FDR uses these modified FDR values. Additionally, for classification purposes, only the most significant hit for each sequence in the database is considered. Multiple hits to a sequence in the database are not shown. Consequently, the values in the FDR column do not increase strictly by the α set on the command-line (i.e., 0.05). For example, d1n11a_ has four HSPs (High Scoring Pairs), each with an E-value greater of equal to 2e-17. For comparison purposes, BLAST (using the default values) returns an additional five irrelevant sequences:

lcl|random927625  unnamed protein product                             30.4   4.2   
lcl|random831070  unnamed protein product                             30.8   4.7
lcl|random78264  unnamed protein product                              30.0   6.8
lcl|random548372  unnamed protein product                             28.9   8.4
lcl|random899104  unnamed protein product                             29.6   8.8

Databases

As detailed in Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate, we augmented the Astral40 1.75A database by 100 fold with random sequences. This augmented database is available as astral-scop-1.75A-withRandomSeqs100x.fa.gz (119 MB). Additionally, the following files have the identifiers for the sequences of the respective databases: Training-subset (103 sequences), Training (5,162 sequences) , and Test (5,161 sequences).

Results

We executed BLAST, BLAST_FDR, PSI-BLAST, and PSI-BLAST_FDR on these three databases. The results of these runs are below, organized by database. In addition to the TAP (Carroll et al., 2010) method, we also calculated the mean precision and mean recall. The mean precision is the sum of the precision for each retrieval divided by the total number of retrievals. The mean recall is calculated in the same manner.

Training-sub Database:

BLAST

Method	TAP	Mean Precision	Mean Recall
defaults	0.171	0.238	0.205

BLAST_FDR

Method	α	TAP	Mean Precision	Mean Recall
Benjamini-Hochberg	0.0005	0.168	0.549	0.148
Benjamini-Hochberg	0.005	0.171	0.565	0.151
Benjamini-Hochberg	0.05	0.203	0.557	0.184
Benjamini-Hochberg	0.5	0.184	0.379	0.193
Bonferroni	0.0005	0.163	0.539	0.143
Bonferroni	0.005	0.170	0.569	0.141
Bonferroni	0.05	0.198	0.580	0.177
Bonferroni	0.5	0.199	0.526	0.185
Hochberg	0.0005	0.081	0.307	0.067
Hochberg	0.005	0.088	0.353	0.072
Hochberg	0.05	0.102	0.402	0.086
Hochberg	0.5	0.141	0.474	0.132
Holm	0.0005	0.163	0.539	0.143
Holm	0.005	0.170	0.569	0.141
Holm	0.05	0.198	0.580	0.177
Holm	0.5	0.199	0.526	0.185
Hommel	0.0005	0.163	0.539	0.143
Hommel	0.005	0.170	0.569	0.141
Hommel	0.05	0.198	0.580	0.177
Hommel	0.5	0.199	0.526	0.185

PSI-BLAST

Method	TAP	Mean Precision	Mean Recall
defaults	0.296	0.282	0.349

PSI-BLAST_FDR

Method	α	TAP	Mean Precision	Mean Recall
Benjamini-Hochberg	0.0005	0.309	0.745	0.287
Benjamini-Hochberg	0.005	0.329	0.752	0.307
Benjamini-Hochberg	0.05	0.332	0.662	0.323
Benjamini-Hochberg	0.5	0.303	0.326	0.346
Bonferroni	0.0005	0.302	0.745	0.279
Bonferroni	0.005	0.319	0.765	0.294
Bonferroni	0.05	0.327	0.740	0.304
Bonferroni	0.5	0.323	0.626	0.323
Hochberg	0.0005	0.215	0.637	0.190
Hochberg	0.005	0.225	0.647	0.202
Hochberg	0.05	0.257	0.675	0.232
Hochberg	0.5	0.318	0.683	0.303
Holm	0.0005	0.302	0.745	0.279
Holm	0.005	0.319	0.765	0.294
Holm	0.05	0.327	0.740	0.304
Holm	0.5	0.323	0.626	0.323
Hommel	0.0005	0.302	0.745	0.279
Hommel	0.005	0.319	0.765	0.294
Hommel	0.05	0.327	0.740	0.304
Hommel	0.5	0.323	0.626	0.323

Training Database:

BLAST

Method	TAP	Mean Precision	Mean Recall
defaults	0.203	0.266	0.238

BLAST_FDR

Method	α	TAP	Mean Precision	Mean Recall
Benjamini-Hochberg	0.0005	0.199	0.615	0.174
Benjamini-Hochberg	0.005	0.215	0.642	0.191
Benjamini-Hochberg	0.05	0.229	0.640	0.207
Benjamini-Hochberg	0.5	0.220	0.426	0.229

PSI-BLAST

Method	TAP	Mean Precision	Mean Recall
defaults	0.346	0.320	0.401

PSI-BLAST_FDR

Method	α	TAP	Mean Precision	Mean Recall
Benjamini-Hochberg	0.05	0.385	0.763	0.371

Test Database:

BLAST

Method	TAP	Mean Precision	Mean Recall
defaults	0.198	0.266	0.235

BLAST_FDR

Method	α	TAP	Mean Precision	Mean Recall
Benjamini-Hochberg	0.05	0.226	0.645	0.206

PSI-BLAST

Method	TAP	Mean Precision	Mean Recall
defaults	0.338	0.318	0.393

PSI-BLAST_FDR

Method	α	TAP	Mean Precision	Mean Recall
Benjamini-Hochberg	0.05	0.378	0.761	0.366
Bonferroni	0.05	0.366	0.827	0.345
Hochberg	0.5	0.337	0.740	0.321

References

Carroll,H.D., Kann,M.G., Sheetlin,S.L. and Spouge,J.L. (2010) Threshold Average Precision (TAP-k): A Measure of Retrieval Efficacy Designed for Bioinformatics. Bioinformatics, 26, pp 1708-1713. (web, pdf)

Supplementary Material

In Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate, Figures 2 and 5 present histograms of the E-values of sequences in the Test database declared significant by BLAST (Figure 2) and PSI-BLAST (Figure 5) but not by BLAST_FDR or PSI-BLAST_FDR. To allow for reproducibility, these sequences are in the following files:

Figure 2: blastp-defaults_blastp_fdr-fdrDisplayBH_0.05-allHits.tab (BLAST sequences) (2.4 MB)
Figure 5: psiblast-defaults_psiblast_fdr-fdrDisplayBH_0.05-allHits.tab (PSI-BLAST sequences) (4.0 MB)

BLASTFDR

Introduction

Download and Installation Instructions

Execution Instructions

License

Citations

References

Example

Databases

Results

Training-sub Database:

Training Database:

Test Database:

References

Supplementary Material

BLAST_FDR