PSIBLAST (2024)

[ Program Manual |User's Guide |Data Files |Databases ]

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INTERPRETING OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

ALGORITHM

Building PSSMS

Composition-based Statistics

CONSIDERATIONS

SUGGESTIONS

FILTERING OUT LOW COMPLEXITY SEQUENCES

AMINO ACID SCORING

COMMAND-LINE SUMMARY

CITING BLAST

ACKNOWLEDGEMENT

LOCAL DATA FILES

PARAMETER REFERENCE

FUNCTION [Top |Next ]

PSIBLAST iteratively searches one or more protein databases for sequences similar to one or moreprotein query sequences. PSIBLAST is similar to BLAST except that it uses position-specific scoringmatrices derived during the search.

DESCRIPTION [ Previous |Top |Next ]

PSIBLAST, or Position-Specific Iterated BLAST, uses the methods described in Altschul,et al. Nucleic Acids Res. 25(17): 3389-3402 (1997) and Schaffer, et al. Nucleic Acids Res. 29(14):2994-3005 (2001) to search for similarities between protein query sequences and all the sequences inone or more protein databases.

PSIBLAST uses position-specific scoring matrices (PSSMs) to score matches between query anddatabase sequences, in contrast to BLAST which uses pre-defined scoring matrices such asBLOSUM62. PSIBLAST may be more sensitive than BLAST, meaning that it might be able to finddistantly related sequences that are missed in a BLAST search.

PSIBLAST can repeatedly search the target databases, using a multiple alignment of high scoringsequences found in each search round to generate a new PSSM for use in the next round of searching.PSIBLAST will iterate until no new sequences are found, or the user specified maximum number ofiterations is reached, whichever comes first. Normally, the first round of searching uses a standardscoring matrix, effectively performing a blastp search.

PSIBLAST is a statistically driven search method that finds regions of similarity between your querysequence and database sequences and produces gapped alignments of those regions. Within thesealigned regions, the calculated score is higher than some level that you would expect to occur by chancealone.

You are prompted to set a maximum expectation level for each search round. The expectation of asequence is the probability of the current search finding a sequence with as good a score by chancealone. Therefore setting the maximum expectation level to 10.0, the default, limits the reportedsequences to those with scores high enough to have been found by chance only ten or fewer times.

You are also prompted to specify a maximum expectation threshold that sequences can score and stillbe used to build PSSMs. Typically, this threshold is a smaller value than the maximum expectationlevel and the default is 0.005.

It is possible to bypass the initial blastp step either by providing a PSSM saved from a previous searchor by specifying a set of aligned sequences which are then used to generate the initial PSSM. It is alsopossible to save a PSSM for use with BLAST in order to search nucleotide database with a proteinquery using the PSSM as scoring matrix.

You can specify any number of protein databases to PSIBLAST. In the current release, if you want tospecify multiple protein databases you must do so on the command line. In other words, you cannotspecify more than one database from the interactive menu. For example:

% psiblast -INfile2=PIR,SWPLUS

You can also specify multiple protein queries using any valid multiple sequence specification. Forexample:

% psiblast -INfile1=hsp70.msf{*}

EXAMPLE [ Previous |Top |Next ]: Here is a session using BLAST to find the sequences in PIR with similarities to a myoglobin gene:

% psiblast PSIBLAST with what query sequence(s) ? mywhp.pep Begin (* 1 *) ? End (* 153 *) ? Search for query in what sequence database: 1) pir p Protein Information Resource 2) swplus p SWISS-PROT + SP-TREMBL 3) genpept p GenPept (Translated GenBank) Please choose one (* 1 *): Ignore hits expected to occur by chance more than (* 10.0 *) times? Maximum expectation for inclusion in PSSMs (* 0.005 *) ? Maximum number of interations (* 2 *) ? Limit the number of sequences in my output to (* 500 *) ? What should I call the output file (* mywhp.blastpgp *) ? 1 Searching database "pir" with query "pir1:mywhp" CPU time (sec): 116.2 Output file: mywhp.blastpgp Number of query sequences searched: 1 CPU time (sec): 116.4%

OUTPUT [ Previous |Top |Next ]

Below is part of the output from the search in the example session:

The output has four parts: 1) an introduction that tells where the search occurred and what database(s)and query were compared; 2) a list of the sequences in the database(s) containing HSPs (high-scoringsegment pairs) whose scores were least likely to have occurred by chance (the entries in this list havebegin and end ranges on them unless -NOFRAGments is specified); 3) a display of the alignments of theHSPs showing identical and similar residues; and 4) a complete list of the parameter settings used forthe search.

The list and the alignments of high scoring sequences are sorted first showing the matches from thefirst round of iteration, followed by the matches found in each successive round and sequences notfound in previous rounds.

Immediately before the display of the results of the final search round, there is separator line whichreads:

 Final Round ..

Only the sequences listed below this line are treated as list file members. If you wish to includesequences from earlier rounds in the list file, or to exclude some of the existing members you mustmanually edit the PSIBLAST output.

By default, PSIBLAST looks for alignments that contain gaps. If you only look for alignments that donot contain gaps, there will often be more than one segment pair associated with each databasesequence

///////////////////////////////////////////////////////////////////////////////BLASTP 2.2.1 [Aug-1-2001]Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),"Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.Query= PIR1:MYWHP (153 letters)Database: pir 219,241 sequences; 76,174,552 total lettersSearching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .doneResults from round 1 Score E Sequences producing significant alignments: (bits) Value ..PIR1:MYWHP Begin: 1 End: 153!myoglobin [validated] - sperm whale 268 3e-72PIR1:MYWHW Begin: 1 End: 153!myoglobin - dwarf sperm whale 258 4e-69///////////////////////////////////////////////////////////////////////////////PIR2:S20270 Begin: 3 End: 145!hemoglobin alpha chain - Antarctic dragonfish (Gymno... 39 0.004PIR1:HAKOAW Begin: 7 End: 146!hemoglobin alpha-A chain - white stork 39 0.004!Searching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .done!Results from round 2! Score E!Sequences producing significant alignments: (bits) Value!Sequences used in model and found again:PIR2:A29392!hemoglobin alpha chain - Indian short-nosed fruit bat 235 4e-62PIR2:A29702!hemoglobin alpha chain - pallid bat 235 4e-62PIR2:A29391///////////////////////////////////////////////////////////////////////////////PIR1:MYTTM!myoglobin - map turtle 202 2e-52PIR1:MYOY!myoglobin - aardvark 202 2e-52!Sequences not found previously or not previously below threshold:PIR1:HAEMA!hemoglobin alpha chain - Amazon manatee 227 7e-60PIR1:HAMQP!hemoglobin alpha chain - hanuman langur 227 8e-60///////////////////////////////////////////////////////////////////////////////PIR1:HBLRS!hemoglobin beta chain - slow loris 203 1e-52PIR1:HBHO!hemoglobin beta chain [validated] - horse 203 2e-52\\End of ListResults from round 1>PIR1:MYWHP myoglobin [validated] - sperm whale Length = 153 Score = 268 bits (686), Expect = 3e-72 Identities = 153/153 (100%), Positives = 153/153 (100%)Query: 1 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDSbjct: 1 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60Query: 61 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPSbjct: 61 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120Query: 121 GDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153 GDFGADAQGAMNKALELFRKDIAAKYKELGYQGSbjct: 121 GDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153>PIR1:MYWHW myoglobin - dwarf sperm whale Length = 153 Score = 258 bits (660), Expect = 4e-69 Identities = 148/153 (96%), Positives = 151/153 (97%)Query: 1 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60 VLSEGEWQLVLHVWAKVEAD+AGHGQDILIRLFK HPETLEKFDRFKHLK+EAEMKASEDSbjct: 1 VLSEGEWQLVLHVWAKVEADIAGHGQDILIRLFKHHPETLEKFDRFKHLKSEAEMKASED 60Query: 61 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPSbjct: 61 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120Query: 121 GDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153 DFGADAQGAM+KALELFRKDIAAKYKELGYQGSbjct: 121 ADFGADAQGAMSKALELFRKDIAAKYKELGYQG 153///////////////////////////////////////////////////////////////////////////////Results from round 2>PIR2:A29392 hemoglobin alpha chain - Indian short-nosed fruit bat Length = 141 Score = 235 bits (601), Expect = 4e-62 Identities = 37/147 (25%), Positives = 58/147 (39%), Gaps = 6/147 (4%)Query: 1 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60 VLS + V W KV + +G + L R+F S P T F F SSbjct: 1 VLSPADKTNVKAAWDKVGGNAGEYGAEALERMFLSFPTTKTYFPHFDLAH------GSPQ 54Query: 61 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120 +K HG V AL + L L+ HA K ++ + +S ++ L + PSbjct: 55 VKGHGKKVGDALTNAVSHIDDLPGALSALSDLHAYKLRVDPVNFKLLSHCLLVTLANHLP 114Query: 121 GDFGADAQGAMNKALELFRKDIAAKYK 147 DF +++K L + +KY+Sbjct: 115 SDFTPAVHASLDKFLASVSTVLTSKYR 141>PIR2:A29702 hemoglobin alpha chain - pallid bat Length = 141 Score = 235 bits (601), Expect = 4e-62 Identities = 40/147 (27%), Positives = 60/147 (40%), Gaps = 6/147 (4%)Query: 1 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60 VLS + V W KV +G + L R+F S P T F F A++KSbjct: 1 VLSPADKTNVKAAWDKVGGHAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--- 57Query: 61 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120 HG V ALG + L L+ HA K ++ + +S ++ L HPSbjct: 58 ---HGKKVGDALGNAVAHMDDLPGALSALSDLHAYKLRVDPVNFKLLSHCLLVTLACHHP 114Query: 121 GDFGADAQGAMNKALELFRKDIAAKYK 147 GDF +++K L + +KY+Sbjct: 115 GDFTPAVHASLDKFLASVSTVLVSKYR 141///////////////////////////////////////////////////////////////////////////////>PIR1:HAEMA hemoglobin alpha chain - Amazon manatee Length = 141 Score = 227 bits (581), Expect = 7e-60 Identities = 35/147 (23%), Positives = 57/147 (37%), Gaps = 6/147 (4%)Query: 1 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60 VLS+ + V W K+ +G + L R+F S P T F F SSbjct: 1 VLSDEDKTNVKTFWGKIGTHTGEYGGEALERMFLSFPTTKTYFPHFDLSH------GSGQ 54Query: 61 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120 +K HG V AL + L L+ HA + ++ + +S ++ L SSbjct: 55 IKAHGKKVADALTRAVGHLEDLPGTLSELSDLHAHRLRVDPVNFKLLSHCLLVTLSSHLR 114Query: 121 GDFGADAQGAMNKALELFRKDIAAKYK 147 DF +++K L + +KY+Sbjct: 115 EDFTPSVHASLDKFLSSVSTVLTSKYR 141///////////////////////////////////////////////////////////////////////////////>PIR1:HAMQP hemoglobin alpha chain - hanuman langur Length = 141 Score = 227 bits (580), Expect = 8e-60 Identities = 36/147 (24%), Positives = 58/147 (38%), Gaps = 6/147 (4%)Query: 1 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60 VLS + V W KV +G + L R+F S P T F F A++KSbjct: 1 VLSPADKTNVKAAWGKVGGHGGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--- 57Query: 61 LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120 HG V AL + L L+ HA K ++ + +S ++ L + PSbjct: 58 ---HGKKVADALTNAVAHVDDMPHALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP 114Query: 121 GDFGADAQGAMNKALELFRKDIAAKYK 147 +F +++K L + +KY+Sbjct: 115 AEFTPAVHASLDKFLASVSTVLTSKYR 141/////////////////////////////////////////////////////////////////////////////// Database: pir Posted date: Aug 27, 2001 6:21 PM Number of letters in database: 76,174,552 Number of sequences in database: 219,241Lambda K H 0.316 0.196 0.662Lambda K H 0.267 0.0601 0.140Matrix: BLOSUM62Gap Penalties: Existence: 11, Extension: 1Number of Hits to DB: 66,397,381Number of Sequences: 219241Number of extensions: 3896070Number of successful extensions: 10016Number of sequences better than 10.0: 1540Number of HSP's better than 10.0 without gapping: 1363Number of HSP's successfully gapped in prelim test: 192Number of HSP's that attempted gapping in prelim test: 7585Number of HSP's gapped (non-prelim): 1593length of query: 153length of database: 76,174,552effective HSP length: 102effective length of query: 51effective length of database: 53,811,970effective search space: 2744410470effective search space used: 2744410470T: 11A: 40X1: 16 ( 7.3 bits)X2: 38 (14.6 bits)X3: 64 (24.7 bits)S1: 41 (21.0 bits)S2: 63 (28.3 bits)

The PSIBLAST output is a list file that is suitable for input to any GCG program that allows indirectfile specifications. For information about indirect file specification, see Chapter 2 of the User's Guide,Using Sequence Files and Databases.

INTERPRETING OUTPUT [ Previous |Top |Next ]

Bit Score

Each aligned segment pair has a normalized score expressed in bits that lets you estimate themagnitude of the search space you would have to look through before you would expect to find anHSP score as good as or better than this one by chance. If the bit score is 30, you would have toscore, on average, about 1 billion independent segment pairs (2⁽³⁰⁾) to find a scorethis good by chance. Each additional bit doubles the size of the search space. This bit scorerepresents a probability; one over two raised to this power is the probability of finding such asegment by chance. Bit scores represent a probability level for sequence comparisons that isindependent of the size of the search.The size of the search space is proportional to the product of the query sequence length times thesum of the lengths of the sequences in the database. This product, referred to as N in Altschul'spublications, is multiplied by a coefficient K to get the size of the search space. When searchingprotein databases with protein queries, K is about 0.13. PSIBLAST uses estimates of K producedbefore it runs by random simulation (Altschul & Gish, Methods in Enzymology 266; 460-480(1996)).

E Value

There is a probability associated with each pairwise comparison in the list and with each segmentpair alignment. The number shown in the list is the probability that you would observe a score orgroup of scores as high as the observed score purely by chance when you do a search against adatabase of this size.

An ideal search would find hits that go from extremely unlikely to ones whose best scores shouldhave occurred by chance alone (that is, with probabilities approaching 1.0).

PSIBLAST Parameters

At the end of the output is a listing of parameter settings along with some trace information aboutthe search. Some of these parameters are described in this document, but to get more completedocumentation on these parameters, look at the BLAST release notes on the World Wide Web at

http://www.ncbi.nlm.nih.gov/BLAST/newblast.html

INPUT FILES

[ Previous |Top |Next ]

PSIBLAST accepts any number of protein sequences as input. The search set is a specially formatteddatabase. See the GCGToBLAST entry in the Program Manual for information on how to create a localdatabase that PSIBLAST can search from a set of sequences in GCG format.

RELATED PROGRAMS

[ Previous |Top |Next ]

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or morequery sequences of any type. BLAST can produce gapped alignments for the matches it finds.

NetBLAST searches for sequences similar to a query sequence. The query and the database searchedcan be either peptide or nucleic acid in any combination. NetBLAST can search only databasesmaintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

GCGToBLAST combines any set of GCG sequences into a database that you can search with BLAST.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group ofsequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be moresensitive than BLAST.

TFastA does a Pearson and Lipman search for similarity between a protein query sequence and anygroup of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading framesbefore performing the comparison. It is designed to answer the question, "What implied proteinsequences in a nucleotide sequence database are similar to my protein sequence?"

FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and agroup of protein sequences, taking frameshifts into account. FastX translates both strands of thenucleic sequence before performing the comparison. It is designed to answer the question, "Whatimplied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

TFastX does a Pearson and Lipman search for similarity between a protein query sequence and anygroup of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement forTFastA, and like TFastA, it is designed to answer the question, "What implied protein sequences in anucleotide sequence database are similar to my protein sequence?"

SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a groupof sequences of the same type (nucleic acid or protein). This may be the most sensitive methodavailable for similarity searches. Compared to BLAST and FastA, it can be very slow.

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide querysequences, or searches a group of nucleotide sequences for similarity to one or more protein querysequences. For each sequence comparison, the program finds an optimal alignment between theprotein sequence and all possible codons on each strand of the nucleotide sequence. Optimalalignments may include reading frame shifts.

WordSearch identifies sequences in the database that share large numbers of common words in thesame register of comparison with your query sequence. The output of WordSearch can be displayedwith Segments.

RESTRICTIONS

[ Previous |Top |Next ]

You can only use protein queries and protein databases.

You cannot specify more than one query sequence if you are using the -REStorecheckpoint option.

Checkpoint files created using the -SAVcheckpoint are platform-specific binary files. For this reasoncheckpoint files created on one operating system will not work correctly if specified using-REStorecheckpoint when running PSIBLAST on a different type of system.

When restoring a checkpoint file you must use the exact same query sequence as was used for thesearch that produced the checkpoint file.

The query sequence must be present in a multiple alignment used to jumpstart a search. Both copies ofthe query must have the same number of sequence characters, however they may differ in the numbersand positions of gaps.

A jumpstart alignment may not have more than 500 sequences and the total length of the alignment(including gaps) multiplied by the number of sequences may not exceed 1,000,000.

You can only restore a single checkpoint file with a single run of PSIBLAST. You can only specify asingle jumpstart multiple alignment with a single round of PSIBLAST.

Because of the way PSIBLAST must estimate certain statistical parameters (see the ALGORITHMtopic in the BLAST chapter), the number of scoring matrices available for use with PSIBLAST islimited. Currently, valid choices for the -MATRix parameter are BLOSUM62 (the default),BLOSUM45, BLOSUM80, PAM30, and PAM70.

Gap creation and gap extension penalties are supported in limited combinations depending upon whichscoring matrix is in use. The following table shows the allowed combinations for amino acids. The firstvalues listed are the defaults for each scoring matrix.



 Scoring Matrix Gap Opening Penalty Gap Extension Penalty 


 BLOSUM62 11 1 
 7 2
 8 2
 9 2
 10 1
 12 1


 BLOSUM80 10 1 
 6 2
 7 2
 8 2
 9 1
 11 1


 BLOSUM45 14 2 
 10 3
 11 3
 12 3
 13 3
 12 2
 13 2
 15 2
 16 1
 17 1
 18 1
 19 1


 PAM30 9 1 
 5 2
 6 2
 7 2
 8 1
 10 1


 PAM70 10 1 
 6 2
 7 2
 8 2
 9 1
 11 1

ALGORITHM [ Previous |Top |Next ]

For the most part, the description of the BLAST search algorithm given in the BLAST chapter isapplicable to PSIBLAST. There are three main characteristics that are unique to a PSIBLAST search:the use of PSSMs, iterative searching and composition-based statitics.

PSSM-based searches use the PSSM as both the query sequence and the scoring matrix. For a givenregister of comparison between a PSSM and a sequence, the scores for the residues at each position inthe target sequence come from the value corresponding to that residue at that position in the PSSM.

Building PSSMS [ Previous |Top |Next ]

After each search round, high-scoring sequences are used to create a multiple alignment that is thenused to calculate match scores for the PSSM. When building the PSSMs part of each score is basedupon observed amino acid frequencies in the multiple alignment, and part is base on prior knowledge ofamino acid substitutability. The prior information, represented as "pseudocounts", is dervied from astandard scoring matrix, such as BLOSUM62. Pseudocounts are particularly useful when thesequences included in the multiple alignments do not constitute an adequate sample of the proteinfamily that they represent.

You can control the relative contribution of the alignments and pseudocounts with the pseudocountconstant.

Composition-based Statistics [ Previous |Top |Next ]

PSIBLAST differs from ordinary blastp by taking the amino acid compositions of of the query anddatabase sequences into account when computing E-values. This is done because, for gappedalignments, the precomputed lambda and K values used by blastp are based upon comparisons of alarge number of "random protein sequences" generated using standard amino acid frequencies. Withthis approach, it is possible for the lambda values to be greater than is warranted for a pair ofsequences under consideration, especially, when the sequences have a similar, slightly biased aminoacid composition. This can lead to a calculated E-value that is significantly smaller (i.e. better) than isjustified. The same problem can arise when using PSSM-based comparisons.

The specific method used to take composition into account is detailed in Schaffer,et al. Nucleic Acids Res. 29(14): 2994-3005 (2001)

CONSIDERATIONS [ Previous |Top |Next ]

Specifying the Number of Rounds to Iterate

When run with the default settings, PSIBLAST will perform two search rounds, the first ofwhich is a blastp-style search. You may specify up to ten iterations, which will cause PSIBLASTto perform up to ten rounds. However, if after any round of searching no new matches werefound, no more iterations are performed (a condition known as "convergence"). If you specify-ROUNDs=0 then then PSIBLAST will iterate until convergence occurs. Usually, there is little tobe gained by specifying more than five iterations, because the chance of finding false positivematches increases with the number search rounds. Considering that each search round takes aslong as a single equivalent run of BLAST, you should consider breaking the job into a series oflow-round searches, saving the PSSM in a checkpoint file at each step. Then, upon examinationof the output you can decide whether to restore the PSSM and continue searching.

E-values change when PSSMs are used

Do not expect E-values for a given database sequence to remain constant between search rounds.This is particularly the case between the first and second search rounds. In the first searchround, matches between query and database sequences are scored using a static scoring matrix.In contrast, successive search rounds determine scores by comparing database sequences to thePSSM. In addition, if more sequences are added to the set used to make PSSMs as the searchiterates, the scores for matching sequences may change. It is not possible to predict whether thescore for an individual sequence will increase or decrease between search rounds.

Save PSSMS in checkpoint files

The -SAVecheckpoint file allows you to save the PSSM and specify with a later search using the-REStorecheckpoint parameter. This is particularly useful when you wish to change the searchconditions. For example, you could search a database using a PSSM that was based onsequences found by searching a different database. The main restrictions to observe are: 1) theexact same query sequence used when the PSSM was created must be used when ever thecheckpoint file is restored; and 2) the same operating system must used for all searches involvinga given checkpoint file.

Using Multiple Alignments to Jumpstart PSIBLAST

The composition of the first PSSM that is built tends to guide the direction of the search, yet thevalidity of the multiple alignment scheme used by PSIBLAST has some drawbacks compared todedicated multiple alignment approaches such as the one used by PILEUP. For this reason, youmay wish to create a multiple alignment and then specify it to PSIBLAST using the -JUMPstartparameter. This has the additional benefit of allowing you to use a "seed" PSSM that is notbased on the content of the target database, which might be useful when searching differentdatabases.

Bit Scores and the Size of the Search

Altschul has shown that for sequences that have diverged by a certain amount, there is aninformativeness (or ability to discriminate between chance scores and significant scores)associated with each residue pair in the segment pair. This informativeness is the amount ofinformation obtainable from each residue pair in a real alignment that can be used to distinguishthe real alignment from a random one. This informativeness can be expressed in bits. The sumof the information available from each residue pair in a segment is the segment pair's score inbits. Such scores are intuitively understandable as the significance of a segment pair score. Toexpress such scores as a fraction you would divide 1 by 2 to the number of bits in the score. Forexample, if a segment pair has a bit-score of 16, then the appropriate fraction(1/2⁽¹⁶⁾=1/65,536) would suggest that you should see a score this high by chanceabout once for every 65,000 independent segment pairs you examine.

For nucleotide sequences that have not diverged, there should be an informativeness of about 2 bitsper nucleotide pair. For protein sequences that have not diverged, the informativeness should beslightly over 4 bits per amino acid pair. (The informativeness per pair goes down as the sequencesdiverge and a segment pair score is maximally informative only when a scoring matrix appropriateto the extent of divergence between the sequences is used to calculate the score.)

The bit scores are absolute, but the expectation of finding any particular score depends on the sizeof the search space. The number of places where a segment pair might originate is proportional tothe product of the length of the query times the sum of the lengths of all the sequences searched.This product is multiplied by a coefficient K to get the size of the search space. When searchingprotein databases with protein queries, K is approximately 0.13.

For a query sequence of length 300 aa searching a database of 12 million residues, the size of thesearch space would be 300 x 12,000,000 x 0.13 or 468,000,000. For a search this size, a score thatonly occurs once in every 65,000 potential segment pairs (that is, with a bit score of 16) would beexpected to occur about 7,200 times by chance alone.

If the database being searched is highly redundant (as it might be if it contained several hundredhom*ologous cytochromes), then size of the search space calculated by these methods willoverestimate the size of the real search space.

Increasing Program Speed Using Multithreading

This program is multithreaded. It has the potential to run faster on a machine equipped withmultiple processors because different parts of the analysis can be run in parallel on differentprocessors. By default, the program assumes you have one processor, so the analysis is performedusing one thread. You can use -PROCessors to increase the number of threads up to the numberof physical processors on the computer.

Under ideal conditions, the increase in speed is roughly linear with the number of processors used.But conditions are rarely ideal. If your computer is heavily used, competition for the processors canreduce the program's performance. In such an environment, try to run multithreaded programsduring times when the load on the system is light.

As the number of threads increases, the amount of memory required increases substantially. Youmay need to ask your system administrator to increase the memory quota for your account if youwant to use more than two threads.

Never use -PROCessors to set the number of threads higher than the number of physicalprocessors that the machine has -- it does not increase program performance, but instead uses up alot of memory needlessly and makes it harder for other users on the system to get processor time.Ask your system administrator how many processors your computer has if you aren't sure.

When Blastpgp Produces No Output

You may see an error indicating that blastpgp produced no output (blastpgp is the name of thePSIBLAST executable provided by NCBI). One of the possible causes of this condition is thepresence of a file in your home directory called ".ncbirc" which contains an invalid path to the NCBIdata directory. The NCBI data directory should contain seqcode.val, gc.code, BLOSUM62, andperhaps some other data files. If your home directory does indeed contain such a file, werecommend that you either rename it (the safest option), edit it to update the path to the NCBI datadirectory (this takes some effort, but that path is contained in the logical name "NCBI"), or delete it(thesimplest option). Your system administrator should be able to help you do this if you havetrouble, or you may contact support at [email protected]

SUGGESTIONS [ Previous |Top |Next ]

Using Checkpoint files with BLAST

Checkpoint files created with PSIBLAST can be specified to BLAST using-REStorecheckpoint in order to perform single-round PSSM-based searchs of a nucleotidedatabases. The same query and filter settings must be used for both the PSI-BLAST and BLASTsearches.

Jumpstarting

If the alignment used to jumpstart a search is in an MSF or RSF file, then you should considerspecifying the query sequence from the same file. For example:

% psiblast -in1=calm.msf{calmhuman} -jump=calm.msf{*}

List Size Limit: A list size that is too small to display all the significant hits is a common problem. To see theunlisted hits you must run the search again with the list size limit set high enough to includeeverything significant.
Segment Pair Alignment Limit: For each round, PSIBLAST displays alignments of segment pairs from the top 250 sequences inthe list. You can adjust this limit with -ALIgnments. PSIBLAST will not show alignments forsequences not present in the list.
Sensitivity: PSIBLAST uses a word size of three for proteins, which is appropriate for a wide range ofsearches, but you can adjust the synonym threshold T downwards to two in order to increasesensitivity at the price of speed. Read the PARAMETER REFERENCE topic for moreinformation on -HITEXTTHRESHold and -EXPect.
Batch Queue: Using BLAST to search a large local database can take a long time. You may want to runsearches in the batch queue. You can specify that this program run at a later time in the batchqueue by using -BATch. Run this way, the program prompts you for all the requiredparameters and then automatically submits itself to the batch or at queue. For moreinformation, see "Using the Batch Queue" in Chapter 3, Using Programs in the User's Guide.
E-Values compared to BLAST: If a single round of searching is specified, then PSIBLAST just performs a blastp search.However, the reported scores and E-values will probably differ from those generated bypeforming a blastp search with BLAST. This is because PSIBLAST computes the statisticalsignificance of a match by taking into account the composition of the query and databasesequences, where as BLAST does not. Please refer to Schaffer, et al. (Nucleic Acids Res. 29(14):2994-3005 (2001)) for a detailed discussion of composition-based statistics.

FILTERING OUT LOW COMPLEXITY SEQUENCES [ Previous |Top |Next ]

PSIBLAST always filters out regions of low complexity from database sequences using the SEG filterprogram (Wootton and Federhen, Computers in Chemistry 17: 149-163 (1993); Wootton and Federhen,Methods in Enzymology 266: 554-571 (1996)). For a general discussion of the role of filtering in searchstrategies, see Altschul et al., Nature Genetics 6: 119-129 (1994).

Short repeats and low complexity sequences, such as glutamine-rich regions, confound most databasesearching methods. For PSIBLAST, the random model against which the significance of segment pairscores is evaluated assumes that at each position, each residue has a probability of occurring which isproportional to its composition in the database as a whole. Low complexity or highly repetitivesequences are inconsistent with this assumption.

Aminos acid characters in regions of low complexity sequence are substitued with the letter X. Here isan example of a sequence aligned to a filtered copy of itself to show which parts are filtered out:

 1 MAAKIFCLIMLLGLSASAATASIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60 1 MAAKIFCLIMXXXXXXXXXXXXIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60 61 AIAAGILPLSPLFLQQSSALLQQLPLVHLLAQNIRAQQLQQLVLANLAAYSQQQQFLPFN 120 61 AIAAGIXXXXXXXXXXXXXXXXXXXXXXXXXXNIRXXXXXXXXXXXXXXYSQQQQFLPFN 120121 QLAALNSAAYLQQQQLLPFSQLAAAYPRQFLPFNQLAALNSHAYVQQQQLLPFSQLAAVS 180121 QXXXXXXXXXXXXXXXXPFSQLAAAYPRQFLPFNQLAALNSHAYVXXXXXXPFSQLAAVS 180181 PAAFLTQQQLLPFYLHTAPNVGTLLQLQQLLPFDQLALTNPAAFYQQPIIGGALF 235181 PAAFLTQQQLLPFYLHTAPNVGTXXXXXXXXXXXXXXXTNPAAFYQQPIIGGALF 235

By default PSIBLAST does not filter query sequences, in contrast to BLAST which does. You can turnquery sequence filtering on using -FILter but this should be done only when you plan to use a PSSMfrom PSIBLAST with BLAST to perform a PSI-TBLASTN search. An alternative approach in suchcases is to use -NOFILter when running the PSI-TBLASTN search.

You can also mask selected positions in the query by using -LOWercasemask, which replaceslowercase letters in query with the letter X. of the query sequence.

AMINO ACID SCORING [ Previous |Top |Next ]

For the first search round, PSIBLAST normally uses the BLOSUM62 scoring matrix from Henikoff andHenikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)) whenever the sequences being comparedare proteins (including cases where nucleotide databases or query sequences are translated into proteinsequences before comparison). You can use other BLOSUM45, BLOSUM80, or the more traditionalPAM70 and PAM30 scoring matrices with -MATrix, for example -MATrix=PAM40. Each matrix ismost sensitive for finding hom*ologs at the corresponding PAM distance. The seminal paper on thissubject is Stephen Altschul's "Amino acid substitution matrices from an information theoreticperspective" (J. Mol. Biol. 219; 555-565 (1991)). If you are new to this literature, an easier place tostart reading might be Altschul et al., "Issues in searching molecular sequence databases" (NatureGenetics, 6; 119-129 (1994)).

COMMAND-LINE SUMMARY [ Previous |Top |Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summarybelow and to specify parameters before the program executes. In the summary below, the capitalizedletters in the parameter names are the letters that you must type in order to use the parameter.Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "UsingProgram Parameters" in Chapter 3, Using Programs in the User's Guide.

Minimal Syntax: % psiblast [-INfile1=]pir:mywhp -DefaultPrompted Parameters:-BEGin=1 -END=153 sets the ranges of interest in query sequences[-INfile2=]pir specifies database(s) to search-EXPect=10.0 ignores scores that would occur by chance more than 10 times-THRESHold=0.005 sets e-value threshold for inclusion in a PSSM-ROUNDs=2 sets the number of iterations (0 for no imposed limit)-LIStsize=500 sets maximum number of sequences listed in the output[-OUTfile=]mywhp.blastpgp names the output fileLocal Data Files:[-DATa2=blast.ldbs] names the list of available local databases[-DATa3=blast.sdbs] names the list of available site-specific databasesOptional Parameters:-ALIgnments=250 sets number of sequences for which to show alignments-PROCessors=1 sets the number of processors to use-GAPweight=0 sets gap creation penalty-LENgthweight=0 sets gap extension penalty-REStorecheckpoint[=mywp.chk] read in checkpoint file-SAVecheckpoint[=mywhp.chk] save checkpoint file-JUMPstart=hsp70.msf{*} jumpstart with specified alignment-TABle[=mywhp.psitable] write PSSM to a file as an ASCII table-NOFRAgments suppresses showing list file entries as fragments-VIEW=0 selects alignment view type (0-8 allowed)-NATive produces unmodified BLAST2 output-HTML uses HTML for output format-FILter filters low complexity segments out of query sequences using SEG-LOWercasemask masks lowercase characters in query sequence-MATRix=blosum62 assigns the substitution matrix for proteins-PSEudoconst=9 set relative empahsis given to pseudocounts-SWAlign compute locally optimal Smith-Waterman alignments-WORdsize=0 sets word size (0 selects program default)-HITEXTTHRESHold=0 sets minimum score to extend hits [T]-HITWindow=40 sets multiple hist window size [A]-TRIGger=22.0 sets number of bits to trigger gapping-XDRopoff=0 sets X dropoff value for gapped alignments [X2]-BESthits=0 sets number of best hits from a region to keep [K]-OLDSTATistics don't use composition-based statistics-EFFdbsize=0 sets effective database size (0 for real size)-APPend="string" appends "string" to pass-through command line-BATch submits program to batch queue-DBReport lists valid databases then exits

CITING BLAST [ Previous |Top |Next ]

The original paper describing BLAST is Altschul, Stephen F., Gish, Warren, Miller, Webb, Myers,Eugene W., and Lipman, David J. (1990). Basic local alignment search tool. J. Mol. Biol. 215;403-410. PSIBLAST was first described in Altschul, Stephen F., Madden, Thomas L., Schaffer,Alejandro A., Zhang, Jinghui, Zhang, Zheng, Miller, Webb, and Lipman, David J. (1997). GappedBLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.25(17); 3389-3402.

ACKNOWLEDGEMENT [ Previous |Top |Next ]

BLAST was written by Warren Gish, formerly of the National Center for Biotechnology Information(NCBI), in collaboration with Stephen Altschul, Webb Miller, Eugene Myers, David Lipman, and DavidStates. The document you are now reading was written by John Devereux, with modifications by TedSlater and Eric Cabot.

Blastpgp (NCBI's implementation of PSIBLAST) was written for NCBI by Tom Madden and AlejandroSchaffer. Eric Cabot developed the PSIBLAST client by extensively modifying the BLAST clientwritten by Ted Slater for Version 10.0 of the Wisconsin Package. Some portions were taken from theoriginal GCG Wisconsin Package BLAST client written by Scott Rose. The output post-processor forrelease 10.0 was written by Ron Stewart.

We are extremely grateful to Stephen Altschul, Tom Madden, Alejandro Schaffer and Warren Gish fortheir careful and original work on BLAST and PSIBLAST, and for their critical comments on thedocumentation that you are now reading. We are also very grateful to NCBI for making theseprograms and services available to the molecular biology community.

LOCAL DATA FILES [ Previous |Top |Next ]

The files described below supply auxiliary data to this program. The program automatically readsthem from a public data directory unless you either 1) have a data file with exactly the same name inyour current working directory; or 2) name a file on the command line with an expression like-DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

PSIBLAST reads two files, blast.ldbs (local databases), and blast.sdbs (site-specific databases). Thesetogether list the search sets in the menu. We update blast.ldbs when we send database updates to yourinstitution. If you have sequences of local interest that you would like to search with PSIBLAST, readthe documentation for GCGToBLAST to see how to create local BLAST-searchable databases, thenfetch the file blast.sdbs, and add the name of the local search set so that it appears in the menu.

PARAMETER REFERENCE [ Previous |Top |Next ]

You can set the parameters listed below from the command line. For more information, see "UsingProgram Parameters" in Chapter 3, Using Programs in the User's Guide.

Following some of the optional parameters described below is a letter or short expression inparentheses. These are the names of the corresponding parameters at the bottom of your PSIBLASToutput.

-EXPect=10.0

This parameter, for which there is a prompt if you don't set it on the command line, lets youinfluence the number of hits in your output having scores that would be expected to haveoccurred by chance alone. There is nothing to prevent many biologically significant butstatistically insignificant segment pairs from being screened out, so you may sometimes want toincrease this parameter in order to have an opportunity to see them.

-THRESHold=0.005

After each round of searching, matches whose expectation scores are less than or equal to thespecified value are used to construct the PSSM for the next round. Sequences with scores thatexceed the threshold but not the setting of -EXPect will still be reported. You are prompted toset the threshold value if you do not set it on the command line.

-ROUNDs=2

This parameter governs the maximum number of times that the search is iterated. The cycle ofsearching and PSSM building will repeat until the specified number of rounds have occured oruntil the search "converges" (i.e. until no more new sequences can be added to the PSSM).Setting -ROUNDs=0 causes the iterations to stop only upon convergence. Failure of a search toconverge by 10 rounds suggests that the PSSM may have become "corrupted", meaning that toomany unrelated sequences (i.e. false positives) have been included. You can minimize the risk ofcorruption by using checkpoint files with a series of search runs and low settings of the -ROUNDsparameter. See the descriptions of the -REStorecheckpoint and -SAVcheckpointparameters for additional details.

Since the first round of searching uses a standard scoring matrix (e.g. BLOSUM62), specifyingonly a single round is the equivalent of using BLAST to perform a blastp search. It is, however,possible to perform a single-round, PSSM-dependent search by using either-RESstorecheckpoint or -JUMPstart.

-LIStsize=500

By default, the PSIBLAST output list file will contain up to 500 sequences (or fragments thereof,depending upon the state of -FRAgments), even if more than 500 sequences had scores abovethe cutoff score. The list is sorted in order of increasing probability, that is, with the mostsignificant sequences first. Use -LIStsize to change the number of sequences in your outputto any value between 0 (for blastpgp's program defaults) and 1000.

-ALIgnments=250

By default, BLAST displays the alignments of HSPs from the best 250 sequences in the list. Use-ALIgnments to change the number of sequences for which alignments are shown in youroutput to any value between 0 and 1000.

-PROCessors=2

tells the program to use 2 threads for the database search on a multiprocessor computer. Checkwith your system manager for the number of processors available at your site. Never set thenumber of processors greater than what you have available.

-GAPweight=11

sets the penalty for adding a gap to the alignment. See the RESTRICTIONS topic for moreinformation about setting the gap opening penalty.

-LENgthweight=1

sets the penalty for lengthening an existing gap in the alignment. See the RESTRICTIONS topicfor more information about setting the gap extension penalty.

-REStorecheckpoint[=mywp.chk]

Read a checkpoint file from an earlier search and use the stored PSSM as the scoring matrix forthe first search round. After the first round of searching, PSSMS are built using the normalrules. It is essential to use the exact same query sequence as was used to construct thecheckpoint file although you do not have to search against the same database.

If you are running PSIBLAST interactively and do not specify the name of a checkpoint file withthe -REStorecheckpoint parameter, you are prompted for one. You cannot specify multiplequeries when using -REStorecheckpoint.

Checkpoint files have a hardware-dependent binary format. Therefore it is unlikely that you willbe able to restore a checkpoint on a platform that is different from the one used when it wascreated.

-SAVecheckpoint[=mywhp.chk]

Save a representation of the PSSM and other details of the last search round into a file that canbe used to initiate another search at a later time, possibly using different databases andparameter settings.

You can specify a filename with -SAVecheckpoint only in the case of a single query sequence.With multiple queries, or if no name is specified, checkpoint file names are based on the namesof the query sequences. For example, with a query named "sw:calm_human" the checkpoint filewould be named "calm_human.chk".

A checkpoint file generated from an earlier search can be used to provide the PSSM for the firstround of the the current search. It is essential to use the exact same query sequence as was usedto construct the checkpoint file although you do not have to search against the same database.With a restored checkpoint file, the first search round uses the PSSM from the file as the scoringmatrix. After the first search round, PSSMS are built using the normal rules.

Checkpoint files provide a mechanism by which successive single-round, PSSM-dependentsearches can be performed, permitting you to examine the results between searchs.

Checkpoint files are also useful for performing successive single-round, PSSM-dependentsearches affording you an opportunity to examine the results of one search before initiatinganother.

-JUMPstart=hsp70.msf{*}

This option allows you to specify a group of aligned sequences that will be used to build a PSSMthat then used with the first search round. After the first round, PSSMs are build using thenormal rules.

You can use any valid multiple sequence specification (e.g. MSF, RSF, list files, and databaseand filename wildcards) as long as it represents a set of mulitply aligned sequences. If thesequences are not aligned, then PSIBLAST will probably yield incorrect results. Currently,alignments are limited to minimum of 2 and a maximum of 500 sequences, and the product ofthe number of sequences times the alignment length, after endgapping, may not exceed 1million.

The alignment must contain a sequence with the exact same content and length as the querysequence although the two copies of the query sequence may have different names. Thealignment may contain gaps since positions corresponding to gaps in the alignment copy of thequery sequence are simply ignored when building the PSSM.

The -JUMPstart parameter is ignored if -REStore is also specified. However, in contrast tothe -REStore, you can use the same jumpstart alignment for searches that use multiplequeries. For example, the following command lines all use valid syntax:

% psiblast -INfile1=hsp70.msf{*} -JUMPstart=hsp70.msf{*}% psiblast -INfile1=hsp70.msf{s*} -JUMPstart=hsp70.msf{*}
% psiblast -INfile1=hsp70.msf{s*} -JUMPstart=hsp70.msf{s*}

The characters within a given column of the jumpstart alignment must all be the same case.Positions corresponding to columns that are represented with upper case characters will bescored using the standard scoring matrix (e.g. BLOSUM62) instead of the PSSM. Note: this isone of the few examples where the case of a sequence character is significant in the WisconsinPackage.

-TABle[=mywhp.psitable]

writes a text file containg a representation of the PSSM used with the final round of searching.If no filename is specified, then filenames are based on the names of the query sequence andhave the extension ".psitable".

You cannot read the table into any programs in the Wisconsin Package but it may be of use sinceafter examination, you might want to mask regions of the query sequence using-LOWercasemask and then re-run the search.

-NOFRAgments

suppresses the appearance of begin and end ranges on each output list file entry based on thealignment between the entry and the query sequence.

-VIEW=0

sets the alignment view type. Acceptable values are 0 through 8, which correspond to thefollowing:

0 = pairwise (the default);
1 = showing identities as dots
2 = showing insertions
3 = showing identies as dots and gapping for insertions;
4 = gapping for insertions;
5 = with endgaps and showing insertions
6 = with endgaps flat master-slave and gapping for insertions
7 = XML output
8 = tab-delimited summary table

The specification of the XML output is available from NCBI at:

ftp://ftp.ncbi.nlm.nih.gov/toolbox/xml/ncbixml.txt

Here are descriptions of the columns in the tab-delimited format:

1 = Query sequence name
2 = Database sequence name
3 = Percent of positions that are identical
4 = Alignment length
5 = Number of mismatches (alignment length - identities - gapped positions)
6 = Number of gaps of any length
7 = Start of alignment for query sequence
8 = End of alignment for query sequence
9 = Start of alignment for database sequence
10 = End of alignment for database sequence
11 = Expectation
12 = Score (bits)

-NATive

produces unmodified BLAST2 output.

-HTML

uses HTML format for output. This parameter has no effect if you use -VIEW=7 (XML output)or -VIEW=8 (tab-delimited output).

-FILter

filters low complexity segments out of sequences using the SEG algorithm. Filtered residues arereplaced with the letter X and are ignored when calculating scores. Normally it is notnecessarily to filter query sequences since database sequences are always filtered. Use thisparameter only when you plan on saving a PSSM in order to run a PSI-TBLASTN search withthe program BLAST.

-LOWercasemask

masks lowercase characters in the query sequence by replacing them with the letter X during thesearch. Masked residues are ignored when calculating scores. This is one of the few cases in theWisconsin Package where the uppercase and lowercase characters in input sequences canproduce different results.

-MATrix=BLOSUM62

sets the amino acid substitution matrix to use for the first round and for pseudocounts.PSIBLAST normally uses the BLOSUM62 amino acid substitution matrix from Henikoff andHenikoff. Other valid options are BLOSUM45, BLOSUM80, PAM30, and PAM70.

-PSEudoconst=9

sets the relative emphasis given to pseudocounts derived from a scoring matrix such asBLOSUM62 versus the observed amino acid frequences of the multiple alignment whenconstructing PSSMs. The relative emphasis on pseudocounts increases with a value known asthe "pseudocount constant" that is set with this parameter.

-SWAlign

Use the Smith-Waterman algorithm for displayed aligments and calculating bit scores andexpectation values. The default heuristic algorithm is quicker however it may completely misssome significant alignments and may even produce non-optimal alignemnts for some of sequencesimilarities that are found. For purposes of speed, the full Smith-Waterman alignment is onlyused for matching sequences identified using the heuristic algorithm.

-WORdsize=0

sets the size of the short regions of similarity between sequences that PSIBLAST initiallysearches for. If -WORdsize=0, PSIBLAST uses the default value of 3. Lower the word size to 2results in a more sensitive search at the expense of a longer search time.

-HITEXTTHRESHold=0

sets the threshold for extending hits using the two-hit method. Words with scores at least thishigh can be extended as ungapped alignments.

-HITWindow=40

Sets the maximum distance allowed for two non-overlapping sequence segments on the samediagonal, when looking for matches between the query and a database sequence.

-TRIGger=22.0

sets number of bits that an initial ungapped alignment must score in order for it to be extendedas a gapped alignment.

-XDRopoff=0 [X2]

sets the X2 dropoff value for gapped alignments (in bits). Gapped alignments are extended untilthe score drops below this value. This limits the (computationally expensive) extension of hits.Use -XDRopoff=0 for default behavior.

-BESthits=0

sets the maximum number of hits from a given region of the query sequence. Only the highestscoring hits from the region are kept. With -BESthits=0, the maximum number is setinternally. This parameter can be used to counter the tendency of highly abundant, conservedregions to be so prevalent in the output that the detection of other domains would be precluded.

-OLDSTATistics

suppresses composition based statistics. Note, however that other subtle differences existbetween blastp and PSIBLAST so it is unlikely that E-values from the first round of aPSIBLAST search will agree with those from a blastp search run using the program BLAST.

-EFFdbsize=0

sets the effective database size. A value of 0 selects the program default.

-APPend="string"

The GCG Wisconsin Package implementation of BLAST is what is known as a "wrapper"program. After collecting your input parameters, the wrapper calls the locally-builtimplementation of BLAST from NCBI called blastall. If you are familiar with the interface tothe blastall program as it was originally written, you can pass parameters to it directly usingthis parameter. Please call us if there are additional parameters you want to use with BLASTthat you would like to look more like GCG parameters.

-BATch

submits the program to the batch queue for processing after prompting you for all required userinputs. Any information that would normally appear on the screen while the program is runningis written into a log file. Whether that log file is deleted, printed, or saved to your currentdirectory depends on how your system manager has set up the command that submits thisprogram to the batch queue. All output files are written to your current directory, unless youdirect the output to another directory when you specify the output file.

-DBReport

lists valid databases then exits without searching.

The release notes for PSIBLAST and BLAST can be found at

http://www.ncbi.nlm.nih.gov/BLAST/newblast.html.

Printed: January 9, 2002 13:45 (1162)

[ Program Manual |User's Guide |Data Files |Databases ]

Technical Support: [email protected]
or [email protected]

Licenses and Trademarks WisconsinPackage is a trademark and GCG and theGCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation maybe trademarks, and if so, are trademarks or registered trademarks oftheir respective holders and are used in this documentation foridentification purposes only.