Changing Which Of The Following Blast Parameters Would Tend To Yield Fewer Search Results?

Curr Protoc Bioinformatics. Writer manuscript; available in PMC 2014 Oct fifteen.

Published in final edited grade every bit:

PMCID: PMC3848038

NIHMSID: NIHMS533669

Selecting the Right Similarity-Scoring Matrix

William R. Pearson

ⁱUniversity of Virginia School of Medicine, Charlottesville, VA

Abstract

Poly peptide sequence similarity searching programs similar BLASTP, SSEARCH (Unit of measurement 3.ten), and FASTA utilise scoring matrices that are designed to identify afar evolutionary relationships (BLOSUM62 for BLAST, BLOSUM50 for SEARCH and FASTA). Different similarity scoring matrices are most effective at different evolutionary distances. "Deep" scoring matrices similar BLOSUM62 and BLOSUM50 target alignments with 20 – thirty% identity, while "shallow" scoring matrices (e.g. VTML10 – VTML80), target alignments that share 90 – 50% identity, reflecting much less evolutionary alter. While "deep" matrices provide very sensitive similarity searches, they too require longer sequence alignments and can sometimes produce alignment overextension into not-homologous regions. Shallower scoring matrices are more than effective when searching for short poly peptide domains, or when the goal is to limit the scope of the search to sequences that are likely to be orthologous betwixt recently diverged organisms. Besides, in DNA searches, the match and mismatch parameters set evolutionary wait-back times and domain boundaries. In this unit of measurement, we will discuss the theoretical foundations that drive practical choices of protein and DNA similarity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50) should be used for sensitive searches with full-length protein sequences, simply curt domains or restricted evolutionary look-back require shallower scoring matrices.

Keywords: similarity scoring matrices, PAM matrices, BLOSUM matrices, sequence alignment

SIMILARITY SEARCHING, HOMOLOGY, AND STATISTICAL SIGNIFICANCE

Poly peptide similarity scoring matrices dramatically improve evolutionary look-dorsum fourth dimension, considering they capture amino-acid substitution preferences that have emerged over evolutionary time. Amino-acrid changes can range from biochemically conservative, e.g., leucine to valine or arginine to lysine, to dramatically different, e.m., tryptophan to glycine. Amino-acid scoring matrices capture this evolutionary information; conservative changes receive positive scores, while not-conservative changes will receive the largest negative scores. As a result, statistical expectation values (evalues) based on amino-acid similarity scores are far more sensitive than percent identity for finding homologs (UNIT three.1).

In this Unit, nosotros provide a brief overview of the history of scoring matrices, the algebra used to calculate scoring matrices, and the of import concepts of matrix information content and matrix target evolutionary distance. Considering finding distantly related protein sequences is more challenging than finding closely related sequences, the BLOSUM62 matrix used by the BLAST programs and the BLOSUM50 matrix used past the FASTA programs are designed to identify afar homologs using long (typically full-length) sequences. Understanding the explicit or implicit evolutionary models used in similarity scoring matrices makes information technology much easier to choose the right scoring matrix. Generally, searches for short domains (or with shorter query sequences) require shallower scoring matrices. Too, shallow scoring matrices can be more effective at highlighting common orthologs when comparing proteins that take diverged in the past 100 - 500 one thousand thousand years. While deep scoring matrices are more effective in identifying distant relationships, deep scoring matrices tin as well contribute to homologous overextension when 2 closely related domains are embedded in non-homologous protein contexts. Using the advisable scoring matrix can improve both search sensitivity and alignment accuracy.

AMINO-Acrid SUBSTITUTION MATRICES - HISTORY AND CLASSIFICATION

The earliest amino-acid scoring matrices were based on amino-acrid backdrop or genetic code differences, just modernistic amino-acid scoring matrices are based on empirical measurements of amino-acrid replacement frequencies from large sets of homologous sequences (Schwartz and Dayhoff, 1978). Empirical replacement frequency scoring matrices can be divided into two types: those with an explicit evolutionary model and the BLOSUM scoring matrices. Model-based scoring matrices include Dayhoff'southward original PAM series of matrices (Schwartz and Dayhoff, 1978), which were updated by Jones, Taylor and Thornton (Jones et al., 1992). More than recently, Gonnet (Gonnet et al., 1992) and Vingron and Mueller (VT and VTML; Mueller et al., 2002)) developed model-based parameters using alignments between more distantly related proteins.

Model-based scoring matrices are appealing considering they can be calculated for alignments at any evolutionary distance. Dayhoff's original PAM250 matrix was calculated based on 1572 observed mutations in 71 families of proteins with alignments that were more than 85% identical. The frequency of mutations was normalized for 1% change (99% identity), or PAM1, and then extrapolated to much longer evolutionary distances simply by multiplying the replacement frequency matrix. Thus, PAM10 corresponds to about 90% identity, PAM30 75% identity, PAM70 55% identity, PAM120 37% identity, and PAM250 about 20% identity. Table 1 presents a more comprehensive set of scoring matrices and target percent identities. More recently, Vingron and Mueller described strategies for estimating replacement frequencies that use measurements from a broader range of evolutionary distances. However, evolutionary models assume that the model accurately describes replacement frequencies over long evolutionary times (Mueller et al., 2002).

Table 1

Scoring matrix target identity, data content, and alignment length.

Matrix	gap^ane penalty	% ident.	$.25 / pos.	random aln. len.	fifty-bit length
	SSEARCH version 36.3.6

BLOSUM50^two	10/2	25.three	0.21	160	238
BLOSUM62	xi/1	28.9	0.40	86	125
VTML 160²,³	12/2	23.9	0.25	139	200
VTML 140	ten/1	28.4	0.44	82	114
VTML 120	11/ane	32.1	0.54	62	93
VTML 80	10/one	40.5	0.74	47	68
VTML 40	xiii/one	64.7	1.92	18	26
VTML 20	15/2	86.1	three.xxx	11	15
VTML 10	16/2	xc.9	three.87	nine	13

	BLAST version 2.two.27+

BLOSUM50²	thirteen/2	29.4	0.39	85	128
BLOSUM62	eleven/i	29.6	0.41	82	122
BLOSUM80	10/i	32.0	0.48	69	104
PAM70	10/i	33.ix	0.58	56	86
PAM30	9/1	45.9	0.90	34	56

In 1992, Steve and Jorja Henikoff described a direct approach to counting replacement frequencies at long evolutionary distances (Henikoff and Henikoff, 1992). The BLOSUM scoring matrices avoided the problem of extrapolating from PAM1 replacement frequencies past counting replacement frequencies straight, with the BLOSUM series of matrices. Rather than relying on alignments of relatively closely related proteins, they identified conserved BLOCKS, or ungapped patches of conserved sequences, in sets of proteins that were potentially very distantly related. They and so counted the amino-acid replacements within these blocks, using a percent identity threshold to exclude closely and more than moderately related sequences. In their description of the BLOSUM matrices, they showed that BLOSUM62 performed much more effectively than either the PAM120 (BLOSUM62 equivalent information content) or the PAM250 matrix (BLOSUM45 equivalent) for identifying distant homologs. BLOSUM62 was then incorporated as the default for the BLASTP (UNIT three.iv) program, while FASTA (UNIT iii.9) and SSEARCH (UNIT iii.10) switched to the BLOSUM50 matrix, which is more sensitive than BLOSUM62, only requires longer alignments.

THE ALGEBRA OF SIMILARITY SCORING (LOG-ODDS) MATRICES

Scoring matrices as odds ratios

Similarity scoring matrices for local sequence alignments, which are rigorously calculated past the Smith-Waterman algorithm (Smith and Waterman, 1981), and heuristically by BLASTP (Altschul et al., 1990; Altschul et al., 1997) or FASTA (Pearson and Lipman, 1988), require scoring matrices that produce negative values on average between random sequences. If the boilerplate or expected matrix score is positive, the alignment will extend to the ends of the sequences, and exist global, rather than local.) Dayhoff's initial PAM matrices were calculated as log odds-ratios; the logarithm of the ratio of the alignment frequency observed afterwards a given evolutionary distance divided by the alignment frequency expected by hazard: $\log (\frac{frequency in homologs}{frequency past chance})$ . The Henikoffs used the same odds-ratio algebra when developing the BLOSUM matrices, but calculated their transition frequencies by counting the number of weighted changes in different blocks.

In 1991, Altschul published a seminal paper (Altschul, 1991) that showed that any scoring matrix advisable for local alignments (one with a negative expected score) could exist treated as a "log-odds" matrix of the class: λs_i,j =log(q_i,j /p_ip_j ), where southward_i,j is the score given to the i,j alignment, q_i,j is the replacement frequency for amino-acid i to j, and the p_ip_j term gives the expected frequency of ii amino-acids aligning past chance. The λ term is used to scale the matrix then that individual scores can exist accurately represented with integers. Widely used scoring matrix values typically range from −ten to +20, reflecting λ scale factors of ln(ii)/two – half-scrap units used by BLOSUM62 and PAM120 – or ln(2)/three – third-bit units used by BLOSUM50 and PAM250. For instance, the BLOSUM62 score for aligning aspartic acid ('D') with itself is +half-dozen and BLOSUM62 is scaled in 1/ii-bit units, so a D:D alignment in related proteins is half dozen=2.0*lg ₂(q_D,D /p_Dp_D ) or 2³=eight times more probable to occur because of homology than by chance. Besides, the BLOSUM62 matrix assigns a D:Fifty alignment a score of −4, which means it is 2²=4 times more than likely to occur by risk than in the homologous blocks aligned for BLOSUM62.

This ratio of homologous replacement frequency to run a risk alignment frequency explains why modern scoring matrices can give very different scores to identical residues. In the denominator, amino acids are not uniformly arable (common amino acids like 'L', 'A', 'South', and 'G' are found more than than 4-times more frequently than rare amino acids like 'Westward', 'C', 'H', and 'M'), so common amino acids often have lower identity scores than rare ones. Likewise, amino acids are not uniformly mutable; 'A', 'S', and 'T' change frequently over evolutionary fourth dimension, while 'W' and 'C' change rarely. Thus, the highest identity score in the BLOSUM62 matrix (Fig. 1) is 11, respective to a W:West alignment, while 'A', 'I', 'L', 'S', and 'V' become identity alignment scores of 4. Differences in identity scores, together with positive scores for non-identity alignments betwixt conserved amino acids, explain why sequence similarity scores are dramatically more sensitive than percent identity for inferring homology (see UNIT 3.one).

An external file that holds a picture, illustration, etc. Object name is nihms-533669-f0001.jpg

The BLOSUM62 matrix

The BLOSUM62 matrix used by BLASTP, BLASTX, and TBLASTN is actually 23 × 23 – xx amino acids plus 'X' (whatsoever amino acid), 'B' ('D' or 'E') and 'Z' ('North' or 'Q'). Merely the lower one-half of the symmetric matrix is shown to highlight the identity scores on the diagonal. The most positive value is 11 ( 'W:W' alignment); the nigh negative is −4 (institute for many hydrophobic/hydrophilic and small-scale/large replacements). The BLOSUM62 matrix is scaled in 1/2-bit units, so the W:W alignment of 11 is 2^5.5=45 times more than common in homologous proteins than by chance. Weighted past amino acid affluence, the average similarity score is almost −1 half-bits.

Matrix information content, target identity, and alignment length

In addition to generalizing scoring matrices every bit log-odds matrices, Altschul (1991) also showed that log-odds scoring matrices have an associated information content (relative entropy), or score per aligned position ("bits-per-position"). "Bits-per-position" can exist used to estimate the number of aligned residues required to produce a statistically significant score. Shallow scoring matrices (e.g., PAM/VTML 10, PAM/VTML 20, or PAM/VTML forty) take higher information content than deep matrices (BLOSUM62, PAM25), which ways that a shorter alignment (10 - 50 residues) can produce a more statistically significant score. At the same time, shallower matrices tend to produce college identity alignments, because they give college positive scores to identities and more negative scores to replacements (Table one, Fig. ii). For example, if an alignment needs a l-bit score to exist pregnant in a database search (Unit three.1), and the boilerplate fleck score for BLOSUM62 is about 0.4 bits per aligned position (Table ane), then near 50/0.four=125 residues must exist included in the alignment. In dissimilarity, the VT20 matrix provides near 3.three bits per aligned position, so even a fifteen residue alignment tin can be pregnant. Thus, in a large-calibration similarity search that needs a 50 bit score for statistical significance, domains shorter than 125 amino acids, or Deoxyribonucleic acid exons shorter than 375 residues, often would non produce statistically meaning scores with BLOSUM62, the default matrix used by BLAST, while exons shorter than fifty residues tin easily be detected with VT20.

An external file that holds a picture, illustration, etc. Object name is nihms-533669-f0002.jpg

Comparison of a "shallow" (VTML xx) and "deep" (BLOSUM62) scoring matrix

Both matrices are scaled in 1/2-bits. For the small part of the matrices shown hither, the VTML20 matrix produces an average ii.lxxx half-flake identity score, and an average −0.59 non-identical score (weighted by amino-acrid affluence). In dissimilarity, BLOSUM62 produces 1.86 for identities only only −0.06 for non-identities. Thus, VTML20 targets shorter, college-identity alignments, because it penalizes non-identities much more than strongly.

"Shallow" scoring matrices accept more data content because they give more positive scores to identities and more negative scores to not-identical replacements past varying the q_i,j term in the log-odds matrices (the p_ip_j values practise not depend on evolutionary distance). From the evolutionary perspective, sequences that have diverged for less time, e.g., 10 – 20% modify, will have more identical residues and fewer replacements simply considering at that place has been less time for the sequences to change. Alternatively, sequences that take less than 25% identity because of a large amount of change will have many fewer identities and many more conservative replacements (PAM200 sequences will be less than 25% identical, on average). The numerical ground for this deviation can exist seen in Fig. 2, which compares parts of a "shallow" (VTML xx) and "deep" (BLOSUM62) matrix. Thus, in improver to differing in data content, scoring matrices have range of target percent identities and alignment lengths (Table i). Shallower scoring matrices produce shorter, more identical alignments, considering they give more negative scores to not-identical aligned residues. "Deeper" scoring matrices produce longer alignments with lower percentage identities considering the punishment for a mismatch is much lower and more conservative non-identities get positive scores.

In practise, the relationship between scoring matrix evolutionary distance, information content, percentage identity, and alignment length suggests two reasons for changing from the BLOSUM62 and BLOSUM50 matrices used by BLASTP and SSEARCH/FASTA. Kickoff, one should change to a shallower matrix when looking for short alignments. We need a shallower scoring matrix for short domains, short exons, or short DNA reads because deep scoring matrices like BLOSUM62 practise not have enough information content to produce significant scores. Short alignments crave shallow scoring matrices.

One should besides use a shallower scoring matrix when looking for orthologs – sequences that differ because of speciation events and are probable to share like functions – betwixt "relatively" closely related organisms (100 – 500 My). Poly peptide sequence comparison algorithms are very sensitive; BLASTP and SSEARCH routinely find significant alignments between human and yeast (one.2 meg year divergence) or homo and E. coli (>two.4 million years). Because of this sensitivity, a mouse-human comparing oft reports not only the orthologs (sequences that diverged at the primate/rodent dissever 80 million years ago) simply also dozens of more distantly related paralogs that may have diverged 200 – 2,000 million years ago. Mouse and human orthologs share most 83% amino-acid identity, thus for mammals, the VTML 20 matrix is expected to find all orthologs and paralogs that have diverged over the past 200 One thousand thousand years, but the matrix is much less likely to identify paralogs that share less than forty% sequence identity (divergence time > one,000 Million years).

SCORING MATRICES AND GAP PENALTIES

While there is an intuitive mathematical caption of pairwise similarity scores from the log-odds perspective, sensitive sequence alignments require both aligned residues and insertion or deletion gaps. Unfortunately, nosotros do non take an analytical model for gap penalties and evolutionary distances. The default gap-penalties provided for BLASTP, SSEARCH, and FASTA were determined empirically (east.m. Pearson, 1991) with a focus on identifying distant homologs. In general, default gap penalties for BLASTP and SSEARCH/FASTA are set as low as possible; lower gap penalties would convert alignments from local to global, which would invalidate the statistical estimates. Thus, when considering whether to alter gap penalties to improve search selectivity for a item protein family unit, gap penalties should exist increased (fabricated more stringent), not decreased. But as "shallower" scoring matrices target less divergence by giving college scores to identities and more negative scores to non-identities, gap penalties should increase with shallower scoring matrices (Reese and Pearson, 2002). Simulations to maximize the significance of brusque alignments suggest that for one/ii-chip scoring matrices, gap open penalties of 16.7-0.067*pam-altitude, eastward.g. 16.7-0.067*20=15 for VTML twenty, and gap extend penalties of 2, are nearly effective (Reese and Pearson, 2002).

Low gap-penalties can dramatically reduce the information content and average percent identity associated with a scoring matrix, and can dramatically increase the lengths of alignments produced by the matrix. The target pct identity, information content, and alignment lengths presented in Table 1 reverberate the observed median values of the highest scoring alignment produced by random queries against existent protein sequences with the specified matrix and gap penalties. If gaps are not allowed, the boilerplate percent identity and data content increase and alignment length gets shorter. For example, if gaps are not immune with BLOSUM62, the median percent identity increases from 28.9 (Table i) to 33, data content most doubles from 0.forty to 0.74, and median random alignment length drops from 86 to 45 residues. A like effect is seen with VTML 80, where information content increases and alignment lengths decrease almost 2-fold when gaps are not immune. Gap effects are less dramatic with shallower matrices like VTML xx – from 86 to 89% identity, 3.3 to 3.5 bits per position, and from 11 to ten residue median alignment lengths – because short evolutionary distances should permit many fewer insertions and deletions.

BLASTP gap penalties with shallow scoring matrices

While the BLAST programs offering a set of scoring matrices with dissimilar evolutionary horizons (BLOSUM50 and BLOSUM62 are "deep", PAM30 is relatively "shallow"), the small gap penalties provided with their shallow matrices dramatically change their effective evolutionary altitude (Table I). The "shallowest" combination of scoring matrix (PAM30) and gap penalties (9/1) requires an average of 56 aligned amino acids, or more than than 160 nucleotides, to produce a 50 bit alignment score. Because these gap penalties are too depression (Reese and Pearson, 2002), the Boom protein matrices are less effective for curt alignments or short evolutionary distances than they would be with higher penalties.

LONG ALIGNMENTS AND OVEREXTENSION

In addition to differing in information content (score or "bits" per aligned position) and optimal evolutionary distances (percentage identity), different scoring matrices take different preferred alignment lengths (Table 1). Shallow scoring matrices accept large negative values for amino-acrid replacements (Fig. 2), so alignments to non-homologous (random) sequences will be short. Deep scoring matrices accept less negative average replacement scores (VTML20's average non-identity score is −v.8 one-half-$.25, while BLOSUM62's is −1.2 half-bits), then their alignments tend to exist longer. Table 1 (random aln. len.) summarizes the median alignment length betwixt random queries and real poly peptide sequences. BLAST and SSEARCH/FASTA statistics are very accurate (UNIT 3.ane), so sequences that share statistically significant scores will always share a homologous domain. But BLAST and SSEARCH/FASTA calculate local sequence alignments — the alignments begin and stop at a position that maximizes the alignment score — then the boundaries of the alignment depend both on the location of the homologous domain and the scoring matrix used to produce the alignment. When a deep scoring matrix like BLOSUM62 is used to align more closely related sequences, the alignment tin extend (over-extend) into nonhomologous neighboring sequence. Gonzalez and Pearson (2010) termed this artifact "homologous over-extension," and showed that it is a major source of errors in PSI-BLAST searches.

Homologous over-extension often occurs from short repeated domains. For example, Fig. 3A shows a blastp alignment of vav_human (p15498) with skap2_xentr (q5fvw6), a protein that contains an SH3 domain that is homologous over 58 amino acids. However, the alignment is 198 residues long; the additional 140 residues in the alignment include a 100 residue Pleckstrin domain in skap2_xentr that is not homologous (vav_human contains an SH3 domain in the region that aligns to the Pleckstrin domain in skap2_xentr). The 58 residuum homologous SH3 domain contributes 85% of the bit score with the additional 140 residues contributing less than fifteen% of the score. Using the slightly more stringent (shallower) BLOSUM80 matrix does not change the alignment over extension.

An external file that holds a picture, illustration, etc. Object name is nihms-533669-f0003.jpg

Overextension of an alignment of homologous SH2 domains

(A) BLASTP alignment of vav_human with skap2_xentr. The two proteins share a homologous SH2 domain (highlighted in red) over about 58 amino acids that contributes more than 85% of the similarity score. The remaining 140 amino acrid alignment juxtaposes an SH3 domain from vav_human (chocolate-brown) with a Pleckstrin domain from skap2_xentr (green). These two domains are not homologous; they are classified as having dissimilar folds in SCOP. (B) Sub-alignment scores produced by the SSEARCH36 program using the same scoring matrix equally BLASTP (BLOSUM62, 11/i) for the vav_human / skap2_xentr alignment. Boundaries for annotated domains in the 2 proteins were taken from InterPro using the query vav_human (qRegion) or the subject skap2_xentr (sRegion). Thus, 103-206 for the Pleckstrin domain comes from InterPro annotations for skap2_xentr, as does 671-765 for SH3 domain in vav_human. The raw, bit-score, and percent identity are shown for the sub-regions. The Q-score is −10log(p-value) based on the bit score; thus Q=30 corresponds to a probability (uncorrected for database size) of 0.001.

The FASTA programs offer a new option for identifying homologous over-extension — sub-domain scoring (Fig. 3B). Past using the domain annotations bachelor for one of the sequences to sub-divide the alignment, it becomes credible that the 58-residue SH3 domain is responsible for almost all of the significant similarity found. It is often very difficult to guess the quality of a afar alignment visually; sub-domain scoring provides a quantitative strategy for identifying over-extension.

SCORING MATRICES FOR DNA

DNA scoring matrices, which are usually implemented as match/mismatch scores, can as well be treated as log-odds matrices with target evolutionary distances (States et al., 1991). For instance, the default match/mismatch penalties used by blastn in its most sensitive mode ( -task blastn) uses a score of +2 for a friction match and −iii for a mismatch, which targets sequences at PAM10, or 90% identity (States et al. 1991). Past default, searches on the NCBI nucleotide BLAST spider web site use megablast ( -chore megablast), with match/mismatch scores of +i/−3 that target sequences that are 99% identical. By default, the FASTA program uses +5/−4 (also bachelor with blastn -task blastn), which corresponds to about PAM forty, or 70% identity. Because Deoxyribonucleic acid sequence comparison is much less sensitive than poly peptide sequence comparison, it is very difficult to notice statistically significant Deoxyribonucleic acid:DNA sequence similarity at distances greater than PAM twoscore (PAM 40 is a brusk distance for protein comparisons).

In practice, the effective target identity for heuristic methods similar blat, blastn, megablast and other genome alignment programs that do utilize scoring matrices may be difficult to approximate from the reported match/mismatch scores. Heuristic programs typically employ a bureaucracy of filters to accelerate the similarity search, and each of those filters will affect the percentage identity and evolutionary altitude of the alignments that are displayed. As a result, information technology is possible that the displayed alignments may accept a lower per centum identity than other possible alignments that were excluded during the early stages of the filtering process.

Ideally, the friction match/mismatch penalties used in genome alignment would friction match the evolutionary distances of the sequences being aligned; human DNA to itself is expected to be more than than 99.ix% identical, but human-mouse alignments in poly peptide coding regions will be less than fourscore% identical (outside of protein coding regions, identity will typically be undetectable at <50%). Likewise, friction match/mismatch parameters should reflect potential alignment length; searches with short sequences volition need higher match/mismatch ratios with higher information content (States et al., 1991).

SUMMARY

The Smash and FASTA/SSEARCH protein alignment programs employ "deep" similarity scoring matrices like BLOSUM62 or BLOSUM50 to identify homologs that share less than 25% sequence identity. Deep scoring matrices require long sequence alignments to achieve statistically pregnant similarity scores and are more likely to extend alignments outside the homologous region. Shallower scoring matrices are more constructive when searching for short homologous domains, short (< 150 nt) exons, or over shorter evolutionary distances. Scoring matrices that are matched to the evolutionary distance of the homologous sequences are also less likely to produce homologous overextension.

The match/mismatch ratios used in Deoxyribonucleic acid similarity searches also have target evolutionary distances. The stringent lucifer/mismatch ratios used past MEGABLAST are virtually effective at matching sequences that are essentially 100% identical, east.k. mRNA sequences to genomic exons. Deeper, more sensitive DNA scoring parameters are more than effective for longer Dna evolutionary distances, e.g. mouse-human.

While scoring matrices and gap penalties tin dramatically affect search sensitivity and alignment regions, modern sequence comparison programs provide accurate similarity statistics, so information technology is unlikely that the "wrong" scoring matrix volition produce a significant lucifer to a nonhomologous protein. But the "wrong" matrix can prevent short homologous regions from existence found, or let an over-extension into a not-homologous region from a homologous domain. The quickly increasing volume of protein sequence means that close homologs will frequently be available, and shallower scoring matrices tin produce more than reliable, functionally informative alignments when closer homologs (>fifty% identical) are found.

ACKNOWLEDGEMENT

This work was funded by an NIH grant - NIH R01 LM04969.

LITERATURE CITED

Altschul SF. Amino acid commutation matrices from an information theoretic perspective. J. Mol. Biol. 1991;219:555–65. [PMC free article] [PubMed] [Google Scholar]
Altschul SF, Gish Due west, Miller W, Myers EW, Lipman DJ. A bones local alignment search tool. J. Mol. Biol. 1990;215:403–410. [PubMed] [Google Scholar]
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped Smash and PSI-Smash: a new generation of poly peptide database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed] [Google Scholar]
Gonnet GH, Cohen MA, Benner SA. Exhaustive matching of the entire protein sequence database. Science. 1992;256:1443–1445. [PubMed] [Google Scholar]
Gonzalez MW, Pearson WR. Homologous over-extension: a challenge for iterative similarity searches. Nuc. Acids Res. 2010;38:2177–2189. [PMC free article] [PubMed] [Google Scholar]
Henikoff South, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 1992;89:10915–10919. [PMC free article] [PubMed] [Google Scholar]
Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comp. Appl. Biosci. 1992;8:275–282. [PubMed] [Google Scholar]
Mueller T, Spang R, Vingron M. Estimating amino acid exchange models: a comparison of Dayhoff'due south estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 2002;19:eight–13. [PubMed] [Google Scholar]
Pearson WR. Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics. 1991;11:635–650. [PubMed] [Google Scholar]
Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.s.a.. 1988;85:2444–2448. [PMC gratuitous article] [PubMed] [Google Scholar]
Reese JT, Pearson WR. Empirical decision of effective gap penalties for sequence comparison. Bioinformatics. 2002;18:1500–1507. [PubMed] [Google Scholar]
Schwartz RM, Dayhoff One thousand. Matrices for detecting afar relationships. In: Dayhoff M, editor. Atlas of Protein Sequence and Construction. supplement 3. volume 5. National Biomedical Research Foundation; Silvery Spring, MD: 1978. pp. 353–358. [Google Scholar]
Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. [PubMed] [Google Scholar]
States DJ, Gish Due west, Altschul SF. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. METHODS: A companion to Methods in Enzymology. 1991;three:66–70. [Google Scholar]