BLASTFDR and PSI-BLASTFDR are variants of BLAST and PSI-BLAST, the ubiquitous genetic database search algorithms. Instead of using BLAST and PSI-BLAST's uniform E-value threshold criterion to determine which sequences to include, BLASTFDR and PSI-BLASTFDR utilize a false discovery rate (FDR) method. The default FDR method in both BLASTFDR and PSI-BLASTFDR is the Benjamini-Hochberg method (Benjamini & Hochberg, 1995). BLASTFDR and PSI-BLASTFDR include the first k sequences that meet the following condition: Ek ≤ kα with Ek as the E-value of the kth sequence and α as a user-specified level. Employing a FDR threshold criterion allows BLASTFDR and PSI-BLASTFDR to retrieve fewer irrelevant sequences ("false positives") and to achieve better average Threshold Average Precision (TAP) (Carroll et al., 2010) values than BLAST and PSI-BLAST.
We evaluated BLASTFDR and PSI-BLASTFDR on an augmented version of the Astral40 database (Chandonia et al., 2004). BLASTFDR received an average TAP value of 0.226 while BLAST received a 0.203 on a test subset. Furthermore, BLASTFDR performs better on datasets that belong to smaller superfamilies. For example, BLASTFDR achieves a TAP value of 0.089 better than BLAST for superfamilies with a size of twelve or fewer members. These superfamilies represent over 38% of the test datasets. BLASTFDR performs better or equal to BLAST for over 61% of these datasets. PSI-BLASTFDR also performs similarly when compared against PSI-BLAST.
BLASTFDR and PSI-BLASTFDR are compiled just as BLAST is and can be used on any platform that BLAST is designed for. To compile BLASTFDR and PSI-BLASTFDR, the following BLAST version 2.2.27 files require modifications:
include/algo/blast/api/blast_options.hpp include/algo/blast/api/blast_options_handle.hpp include/algo/blast/api/psiblast_options.hpp include/algo/blast/blastinput/cmdline_flags.hpp include/algo/blast/core/blast_hits.h include/algo/blast/core/blast_options.h include/algo/blast/core/blast_program.h include/algo/blast/format/data4xmlformat.hpp include/objects/blast/names.hpp include/objtools/align_format/align_format_util.hpp src/algo/blast/api/blast_aux.cpp src/algo/blast/api/blast_options_builder.cpp src/algo/blast/api/blast_options_cxx.cpp src/algo/blast/api/blast_options_local_priv.hpp src/algo/blast/api/psi_pssm_input.cpp src/algo/blast/api/psiblast.cpp src/algo/blast/api/psiblast_aux_priv.cpp src/algo/blast/api/psiblast_aux_priv.hpp src/algo/blast/api/psiblast_iteration.cpp src/algo/blast/api/psiblast_options.cpp src/algo/blast/blastinput/blast_args.cpp src/algo/blast/blastinput/cmdline_flags.cpp src/algo/blast/core/blast_engine.c src/algo/blast/core/blast_hits.c src/algo/blast/core/blast_options.c src/objects/blast/names.cpp src/objtools/align_format/align_format_util.cppAll of these files with the appropriate modifications can be downloaded in the following tarball: blast_fdr.tar.gz (also available via sourceforge.net/p/blastfdr/). To compile an executable, untar the files over the original version 2.2.27 source code files (tar -xzf blast_fdr.tar.gz). Alternatively, you can unzip the files into an existing copy of the source code (unzip blast_fdr.zip). If you're using a version after 2.2.27, apply the differences of each of the above files found in blast_fdr.diff.
>d1awcb_ d.211.1.1 (B:) GA bindinig protein (GABP) beta 1 {Mouse (Mus musculus) [TaxId: 10090]} DLGKKLLEAARAGQDDEVRILMANGAPFTTDWLGTSPLHLAAQYGHFSTTEVLLRAGVSRDARTKVDRTPLHMAASEGHA NIVEVLLKHGADVNAKDMLKMTALHWATEHNHQEVVELLIKYGADVHTQSKFCKTAFDISIDNGNEDLAEILQ
formatdb -i astral-scop-1.75A-withRandomSeqs100x.fa -p T -o T -n astral-scop-1.75A-withRandomSeqs100x -V
blastp -db astral-scop-1.75A-withRandomSeqs100x -query d1awcb_.fa -comp_based_stats 0 -fdrDisplayBH 0.05 -out d1awcb_-blastp_fdr.txt
Score E Sequences producing significant alignments: (Bits) Value FDR Classification lcl|d1awcb_ d.211.1.1 (B:) GA bindinig protein (GABP) beta 1 {Mo... 309 3e-107 0.05 ignored: self hit lcl|d3twra_ d.211.1.0 (A:) automated matches {Human (Homo sapien... 94.4 4e-23 0.10 relevant lcl|d1n11a_ d.211.1.1 (A:) Ankyrin-R {Human (Homo sapiens) [TaxI... 90.9 1e-20 0.15 relevant lcl|d1uoha_ d.211.1.1 (A:) 26S proteasome non-ATPase regulatory ... 75.9 6e-16 0.35 relevant lcl|d1bd8a_ d.211.1.1 (A:) Cell cycle inhibitor p19ink4D {Human ... 65.5 1e-12 0.65 relevant lcl|d2f8ya_ d.211.1.1 (A:) automated matches {Human (Homo sapien... 65.9 2e-12 0.70 relevant lcl|d1iknd_ d.211.1.1 (D:) I-kappa-B-alpha {Human (Homo sapiens)... 63.5 1e-11 0.85 relevant lcl|d1wdya_ d.211.1.1 (A:) RNase L, 2-5a-dependent ribonuclease ... 59.7 5e-10 0.95 relevant lcl|d1k1aa_ d.211.1.1 (A:) bcl-3 {Human (Homo sapiens) [TaxId: 9... 55.5 9e-09 1.05 relevant lcl|d3aaac_ d.211.1.1 (C:) automated matches {Human (Homo sapien... 52.8 2e-08 1.10 relevant lcl|d2fo1e1 d.211.1.1 (E:1021-1297) Lin-12 {Nematode (Caenorhabd... 52.8 1e-07 1.30 relevant lcl|d1s70b_ d.211.1.1 (B:) Myosin phosphatase targeting subunit ... 52.0 3e-07 1.35 relevant lcl|d2dzna_ d.211.1.1 (A:) automated matches {Baker's yeast (Sac... 49.7 1e-06 1.45 relevant lcl|d1ycsb1 d.211.1.1 (B:327-456) 53BP2 {Human (Homo sapiens) [T... 45.1 2e-05 1.55 relevant lcl|d1oy3d_ d.211.1.1 (D:) Transcription factor inhibitor I-kapp... 43.5 2e-04 1.80 relevant lcl|random1096763 unnamed protein product 35.4 0.14 2.65 irrelevant lcl|random336037 unnamed protein product 34.3 0.26 2.70 irrelevant lcl|d1dcqa1 d.211.1.1 (A:369-522) Pyk2-associated protein beta {... 33.1 0.57 2.75 relevant lcl|random627502 unnamed protein product 33.1 0.81 2.80 irrelevant lcl|d2ae0x1 b.52.1.4 (X:3-337) Membrane-bound lytic murein trans... 32.7 1.1 2.85 ignored: different superfamily
lcl|random927625 unnamed protein product 30.4 4.2 lcl|random831070 unnamed protein product 30.8 4.7 lcl|random78264 unnamed protein product 30.0 6.8 lcl|random548372 unnamed protein product 28.9 8.4 lcl|random899104 unnamed protein product 29.6 8.8
As detailed in Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate, we augmented the Astral40 1.75A database by 100 fold with random sequences. This augmented database is available as astral-scop-1.75A-withRandomSeqs100x.fa.gz (119 MB). Additionally, the following files have the identifiers for the sequences of the respective databases: Training-subset (103 sequences), Training (5,162 sequences) , and Test (5,161 sequences).
We executed BLAST, BLASTFDR, PSI-BLAST, and PSI-BLASTFDR on these three databases. The results of these runs are below, organized by database. In addition to the TAP (Carroll et al., 2010) method, we also calculated the mean precision and mean recall. The mean precision is the sum of the precision for each retrieval divided by the total number of retrievals. The mean recall is calculated in the same manner.
Method | TAP | Mean Precision | Mean Recall |
---|---|---|---|
defaults | 0.171 | 0.238 | 0.205 |
Method | α | TAP | Mean Precision | Mean Recall |
---|---|---|---|---|
Benjamini-Hochberg | 0.0005 | 0.168 | 0.549 | 0.148 |
Benjamini-Hochberg | 0.005 | 0.171 | 0.565 | 0.151 |
Benjamini-Hochberg | 0.05 | 0.203 | 0.557 | 0.184 |
Benjamini-Hochberg | 0.5 | 0.184 | 0.379 | 0.193 |
Bonferroni | 0.0005 | 0.163 | 0.539 | 0.143 |
Bonferroni | 0.005 | 0.170 | 0.569 | 0.141 |
Bonferroni | 0.05 | 0.198 | 0.580 | 0.177 |
Bonferroni | 0.5 | 0.199 | 0.526 | 0.185 |
Hochberg | 0.0005 | 0.081 | 0.307 | 0.067 |
Hochberg | 0.005 | 0.088 | 0.353 | 0.072 |
Hochberg | 0.05 | 0.102 | 0.402 | 0.086 |
Hochberg | 0.5 | 0.141 | 0.474 | 0.132 |
Holm | 0.0005 | 0.163 | 0.539 | 0.143 |
Holm | 0.005 | 0.170 | 0.569 | 0.141 |
Holm | 0.05 | 0.198 | 0.580 | 0.177 |
Holm | 0.5 | 0.199 | 0.526 | 0.185 |
Hommel | 0.0005 | 0.163 | 0.539 | 0.143 |
Hommel | 0.005 | 0.170 | 0.569 | 0.141 |
Hommel | 0.05 | 0.198 | 0.580 | 0.177 |
Hommel | 0.5 | 0.199 | 0.526 | 0.185 |
Method | TAP | Mean Precision | Mean Recall |
---|---|---|---|
defaults | 0.296 | 0.282 | 0.349 |
Method | α | TAP | Mean Precision | Mean Recall |
---|---|---|---|---|
Benjamini-Hochberg | 0.0005 | 0.309 | 0.745 | 0.287 |
Benjamini-Hochberg | 0.005 | 0.329 | 0.752 | 0.307 |
Benjamini-Hochberg | 0.05 | 0.332 | 0.662 | 0.323 |
Benjamini-Hochberg | 0.5 | 0.303 | 0.326 | 0.346 |
Bonferroni | 0.0005 | 0.302 | 0.745 | 0.279 |
Bonferroni | 0.005 | 0.319 | 0.765 | 0.294 |
Bonferroni | 0.05 | 0.327 | 0.740 | 0.304 |
Bonferroni | 0.5 | 0.323 | 0.626 | 0.323 |
Hochberg | 0.0005 | 0.215 | 0.637 | 0.190 |
Hochberg | 0.005 | 0.225 | 0.647 | 0.202 |
Hochberg | 0.05 | 0.257 | 0.675 | 0.232 |
Hochberg | 0.5 | 0.318 | 0.683 | 0.303 |
Holm | 0.0005 | 0.302 | 0.745 | 0.279 |
Holm | 0.005 | 0.319 | 0.765 | 0.294 |
Holm | 0.05 | 0.327 | 0.740 | 0.304 |
Holm | 0.5 | 0.323 | 0.626 | 0.323 |
Hommel | 0.0005 | 0.302 | 0.745 | 0.279 |
Hommel | 0.005 | 0.319 | 0.765 | 0.294 |
Hommel | 0.05 | 0.327 | 0.740 | 0.304 |
Hommel | 0.5 | 0.323 | 0.626 | 0.323 |
Method | TAP | Mean Precision | Mean Recall |
---|---|---|---|
defaults | 0.203 | 0.266 | 0.238 |
Method | α | TAP | Mean Precision | Mean Recall |
---|---|---|---|---|
Benjamini-Hochberg | 0.0005 | 0.199 | 0.615 | 0.174 |
Benjamini-Hochberg | 0.005 | 0.215 | 0.642 | 0.191 |
Benjamini-Hochberg | 0.05 | 0.229 | 0.640 | 0.207 |
Benjamini-Hochberg | 0.5 | 0.220 | 0.426 | 0.229 |
Method | TAP | Mean Precision | Mean Recall |
---|---|---|---|
defaults | 0.346 | 0.320 | 0.401 |
Method | α | TAP | Mean Precision | Mean Recall |
---|---|---|---|---|
Benjamini-Hochberg | 0.05 | 0.385 | 0.763 | 0.371 |
Method | TAP | Mean Precision | Mean Recall |
---|---|---|---|
defaults | 0.198 | 0.266 | 0.235 |
Method | α | TAP | Mean Precision | Mean Recall |
---|---|---|---|---|
Benjamini-Hochberg | 0.05 | 0.226 | 0.645 | 0.206 |
Method | TAP | Mean Precision | Mean Recall |
---|---|---|---|
defaults | 0.338 | 0.318 | 0.393 |
Method | α | TAP | Mean Precision | Mean Recall |
---|---|---|---|---|
Benjamini-Hochberg | 0.05 | 0.378 | 0.761 | 0.366 |
Bonferroni | 0.05 | 0.366 | 0.827 | 0.345 |
Hochberg | 0.5 | 0.337 | 0.740 | 0.321 |
In Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate, Figures 2 and 5 present histograms of the E-values of sequences in the Test database declared significant by BLAST (Figure 2) and PSI-BLAST (Figure 5) but not by BLASTFDR or PSI-BLASTFDR. To allow for reproducibility, these sequences are in the following files: