Why DRAGON? The acronym allegedly stands for "Distance Regularisation Algorithm for Geometry OptimisatioN'', but in fact any other silly acronym would have done, the main purpose of its silliness being that people would remember it.
The Sun executables were compiled under Solaris 2.X using the Gnu GCC/G++ compiler (version 2.8.1). Please note that OpenGL graphics is not supported by the Sun version. Executables are provided for SPARC and X86 architectures.
DRAGON can run under PVM, a simulated parallel-processing environment. You can obtain PVM from Netlib but DRAGON will run without it anyway if you don't want to install it.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Once you selected the sequences, perform the alignment. Unfortunately DRAGON will not do this for you; you have to use a separate sequence alignment program. With due respect, do not take the results for granted, no matter how famous the author of the alignment program might be: silly things can (and will) happen. The best is to use an alignment program with a good graphical interface that lets you after-edit the raw results. Experiment with various similarity matrices and gap penalties. If in doubt, prepare a few different alignments.
Several decades of research into biological sequences provided mankind with a plethora of sequence alignment formats. DRAGON can read a relaxed version of the GCG format (MSF), a PIR alignment format or a horrible vertical format, which is similar to the output of MULTAL. I regret that currently no other formats are acceptable.
secmap alignment_file target_no DSSP_file [ DSSP_file ...]
where alignment_file is the same alignment you plan to use for modelling, target_no is the number of the target sequence in the alignment, the DSSP_files (of which at least one is mandatory) contain the secondary structure assignments for the scaffold structures as generated by the DSSP program of Kabsch and Sander (not supplied). secmap then works out the secondary structure assignment for each residue in the target sequence and prints an output like this:-
# >> Align_: Trying MULTAL format...
# >> Align_: MULTAL parsing successful, seqno=8
# Alignment file: lact/la.aln
# Target sequence number = 2
# Scaffold: "lact/1ALC.dssp", sequence number = 1
# Scaffold: "lact/1HEL.dssp", sequence number = 5
- [---] - - -
...
R [ 1]
I [ 2]
...
I [ 5] h h h
...
L [ 12] h h
K [ 13] non-cons! 3 h
...
L [ 43] 52aB---- 52aB---- 52aC----
N [ 44] 51aB---- 51aB---- 51aC----
Y [ 45] 50aB---- 50aB---- 50aC----
Y [ 46]
- [---] - - -
- [---] - -
N [ 47]
G [ 48]
S [ 49]
S [ 50] 45aB---- 45aB---- 45aC----
S [ 51] 44aB 58a 44aB 58a 44aC 58a
H [ 52] 43aB 57a 43aB 57a 43aC 57a
G [ 53]
L [ 54]
F [ 55]
Q [ 56]
I [ 57] 52aB---- 52aB---- 52aC----
N [ 58] 51aB---- 51aB---- 51aC----
Q [ 59]
...
The first two columns list the amino acids in the target sequence, with the positions in square brackets. The character '-' indicates that the target is gapped at that position. The next column is the mapped secondary structure assignment, the following columns contain the secondary structure assignments from the given scaffold DSSP files. The characters '3', 'h', 'p' in the middle of the assignment columns indicate 3/10-, alpha- and pi-helices, respectively, any other capital letter stands for a beta-sheet ID (as defined in the DSSP files). For beta assignments, the numbers before and after the sheet ID show the partner amino acids in the neighbouring strands, 'a' and 'p' indicate antiparallel and parallel orientation, respectively. Note that in some cases the warning "non-cons!" is printed in the mapped assignment, indicating that the mapping is inconsistent, ie. not all scaffolds have the same secondary structure at that particular position. This is usually an indication of an alignment problem.
It would be nice if secmap could print the mapped assignment in the format required by the Sstrfnm parameter. However, there are two reasons why this is not so. First, it would be damn complex to teach secmap how to handle bifurcating sheets. Second, you are forced to think about the secondary structure assignment while you are constructing the Sstrfnm file. Getting it right is crucial.
If you know the whereabouts of disulfide bridges in your protein, then these can easily be encoded into distance restraints. Also, if someone has determined the distances between lysines on the surface of the molecule by crosslinking with bifunctional diimidates, his data could be used immediately. Surface residue distances can also be deduced from fluorescence energy transfer experiments. For all distance restraints, a lower and upper bound may be specified, together with a weight between 0 and 1 which tells the program how reliable that restraint is (restraints with weight 0 are ignored completely, with weight 1 are enforced very strictly).
Homology-derived restraints will be selected by specifying two additional parameters, Maxdist and Minsepar. The former is a C-alpha distance cutoff: only residue pairs closer than this threshold will provide restraints. The latter chooses a minimal sequential separation between the residue pairs.
In the first stage of the iteration, the restraints are applied to the matrix and it is then embedded into a D>3-dimensional Euclidean space. Various refinements are performed here, then the distance matrix is reconstructed and the process is repeated until a 3-dimensional embedding is achieved. In the second part of the simulation, the 3D structure is refined and the best structure found so far is saved. The embedding is repeated once more if no good structures were found in the last 5 iterations. The 3D iterations are performed Maxiter times, then the final result is saved to disk. You get a one-line report on the reason for termination: normally this says
"EXIT: no further improvement on 3D reprojection''.
As with all Distance Geometry-based methods, a good exploration of the conformational space is essential. Once you made sure your parameters make sense and obtained some promising conformations, repeat the runs, starting from a different random distance matrix each time setting the Randseed parameter to 0. (I tend to do at least 20 parallel simulations.) Go through the results carefully, cluster them into families, perform more runs if necessary.
dragon -c command_file [other options...]
where the file command_file may contain comments, commands and parameter name-value pairs, one per line. If the file cannot be found or opened, then the program starts up in interactive mode. Command files may be nested using the c nested_file command up to 16 levels deep. This is especially useful for performing a large number of simulations when a few parameters have to be changed systematically.
The contents of command files can also be piped or redirected to DRAGON using the standard UNIX mechanisms. This facility is provided for those who are willing to create a graphical user interface for the program.
dragon -p parameter_file [-r run_no]
If the -r option is omitted then just one simulation is performed which is mainly useful for testing that the parameter file is OK before attempting a long simulation session with it.
dragon -m procno [other options...]
DRAGON spawns procno child processes when the r[un] command is issued. procno should be larger than 1. These child processes then execute in parallel in the background and perform the simulations. The output is sent to logfiles, one file per simulation. The original parent process handles the communication with you exactly in the same way as in a serial run, and kills the child processes once they finished the calculations. This option, although available on single processors as well, is best used on a multiprocessor machine. There is practically no extra I/O overhead since the internal data structures are fully set up when the parent process spawns the children, and they inherit the data automatically. Note that the code was not optimised for multiprocessor machines, and there is no load balancing.
george wd=/usr/people/joebloggs/dragonhome
otherwise the default data files will not be found. See also the PVM manuals.
To enable PVM support, invoke DRAGON with the -M flag:-
dragon -M [other options...]
This starts the program as the master on the local computer, and then spawns slaves as PVM tasks on all the other nodes in the virtual machine. On multiprocessor nodes, DRAGON checks the number of processors and the average load, and launches as many slaves as necessary to fill the machine up completely :-). E.g. on an 8-processor machine with an average load of 5.0, 3 slaves will be started. The slaves will automatically re-nice themselves to priority 10 to get out of the way as much as possible. The master writes the following information to stdout after each slave was launched:-
Slave (1/2) [4022e] started on host machine.domain.edu
this means that slave 1 out of 2 possible slaves with a task ID 4022e was launched on "machine.domain.edu". The number of possible slaves may vary for a given machine, depending on the overall load, but will never exceed the number of available CPU-s.
The master task will communicate with you and when you request a simulation with the r[un] command, the master will inspect the status of the virtual machine, re-spawns slaves if necessary and then broadcasts the parameters to the slaves. Once they reported back that all data have been received, the master starts assigning the runs to the slaves, one by one. Each run is sent to the first available slave: this way, if your virtual machine is composed of faster and slower computers, the faster machines will do most of the job. Simulation output is redirected to logfiles, however, there is one logfile per slave (as opposed to the multiprocess option where one logfile per run is generated). Logfile names have the form xxxxx@hostname, where xxxxx is a hexadecimal number (the PVM task ID for the slave) and hostname is the name of the actual computer that slave was running on. The master periodically writes to the standard output to inform you about the status of the slaves. If a slave crashes (this should not happen), its job will be re-sent to another slave. DRAGON is smart enough to take notice of any changes you make to the virtual machine. If nodes are added, new DRAGON slaves will soon be spawned on them, upon the removal of nodes the half-finished tasks will be re-started on the remaining nodes as soon as possible.
Note that if PVM is not running on your computer, then the following uninformative error message may be printed several times to the standard error:-
libpvm [pid4138]: /tmp/pvmd.761: No such file or directory
which may be safely ignored. DRAGON automatically detects the situation and will run in single-processor mode.
A few tips to avoid disappointment with the PVM support option:-
Good morning!
Welcome to DRAGON 4.17.8-n32 [May 15 1998, 11:39:20]
Algorithms by William R. Taylor & András Aszódi
Implementation by András Aszódi
(C) 1993-1997. All rights reserved.
PVM: supported
OpenGL graphics: supported
The program greets you with a version and copyright information listing. Note that the ABI is indicated right after the version number.
>>Align_: Trying MULTAL format...
>>Align_: MULTAL parsing successful, seqno=8
=== THE MODEL CHAIN ===
# No. of sequences = 8, model = Seq. #1, no. of residues = 75
# Aa Cons Phob Brad Acdist
1 K 0.0714 0.03 2.67 4.06
2 S 0.288 0.49 1.94 2.01
3 P 0.398 0.18 2.21 1.97
...
=== THE MODEL CHAIN ===
# No. of sequences = 8, model = Seq. #1, no. of residues = 75
--KSPEE--- -LKGIFEKYA AKEGDPNQLS KEELKLLLQT EFPSLLK--- GPSTLDELFE
----SEEMIA EFKAAFDMFD --ADGGGDIS TKELGTVMR- MLG---QNPT KEEL-DAIIE
---LAKKSNE ELEAIFKILD --QDKSGFIE DEELELFLQ- NFSAGARTLT KTET-ETFLK
---MKETDSE MIREAFRVFD --KDGNGVIT AQEFRYFMV- HMG---MQFS EEEV-DEMIK
---LSSKSAD DVKNVFAILD --QDRSGFIE EEELKLFLQ- NFSASARALT DAET-KAFLA
-PSQMEHAME TMMLTFHRFA ---GEKNYLT KEDLRVLMER EFPGFLENQK DPLAVDKIMK
-----EAMQE ELREAFRLYD --KQGQGFIN VSDLRDILR- ALD---DKLT EDEL-DEMIA
MCSSLEQALA VLVTTFHKYS CQEGDKFKLS KGEMKELLHK ELPSFVGEKV DEEGLKKLMG
ELDKNGDGEV SFEEFQVLVK KISQ------ --------
EVDEDGSGTI DFEEFLVMMV RQMKEDA--- --------
AGDSDGDGKI GVDEFQKLVK A--------- --------
EVDVDGDGEI DYEEFVKMMS NQ-------- --------
AGDSDGDGKI GVEEFQSLVK P--------- --------
DLDQCRDGKV GFQSFLSLVA GLIIACNDYF VVHMKQKK
EIDTDGSGTV DFDEFMEMMT G--------- --------
NLDENSDQQV DFQEYAVFLA LITVMCNDFF QGCPDRP-
# Target Cons Phob Brad Acdist Alignment
1 K 0.0714 0.03 2.67 4.06 -------M
2 S 0.288 0.49 1.94 2.01 -----P-C
3 P 0.398 0.18 2.21 1.97 K----S-S
4 E 0.78 0.01 2.47 3.4 S-LMLQ-S
5 E 0.612 0.01 2.47 3.4 PSAKSM-L
6 L 0.647 2.56 2.6 2.82 EEKESEEE
First the multiple alignment is read and parsed, then the model chain parameters are listed. Target is the one-letter amino acid code, Cons is the conservation at the given position, Phob is the average hydrophobicity of the position (which is not the same as the hydrophobicity of the amino acid in the master sequence), Brad is the radius of the fake C-beta atom representing the side chain, Acdist is the distance between the C-alpha atom and the centroid of the side chain. Alignment lists all residues in the same position. The program then lists the restraints, accessibilities and secondary structure assignments in the same format as their corresponding input files (see below). Look for error messages here as they indicate input file formatting problems.
RUN 1 STARTED: Fri 04-Apr-1997 11:55:41
# Randseed=117
nonlin11_reg():.......................................Done
Q=2.359e-02, Stepno=27, t-stat=2.600e-02
D=-9.793e+00 * H^6.619e-01 + 2.809e+01
...
The Randseed value is the actual long number used for initialising the random number generator. When in parallel mode, this number gets "spiced'' with a combination of the process ID and system time to avoid identical parallel runs. The nonlinear regression is used to calculate the distance distribution for residue pairs with unknown distances; the data are shown for decorative purposes only.
SMUP: 4
SMLOW: 3, triangle violations=0
These two lines show how many cycles were used for upper- and lower-bound restraint smoothing. If the number of violations is larger than zero, you might consider checking your restraint file for mistakes.
CYCLE: 5 (61%, 42 secs)
DIST: BD=1.965e-02, NB=1.640e-06, RS=7.094e-03, SC=1.569e+00, AC=1.500e+00
PROJ: Dim=6, Df=1.003e+00, STR=7.927e-04
TNGL: 0 (cyc=3)
EUCL: IN=1.955e-05 ALL=1.408e-03
EUCL: BD=7.263e-04, NB=2.156e-05, RS=1.137e-01, SC=4.703e+00, AC=1.792e+00
This is what you see during high-dimensional iteration. The DIST and the second EUCL lines list the scores during distance matrix and Euclidean space adjustments, respectively. BD is the virtual bond score (between first and second neighbours), NB is the non-bond score (bumps between anyone else), RS is the external restraint score, SC is the secondary structure score, AC is the accessibility score. The first EUCL line lists intermediary adjustment scores (to be ignored). The PROJ line shows the new embedding dimension, the isotropic density adjustment factor (which is usually very close to 1.0 except in the first embedding) and the Spectral Gradient "stress'' value between the actual distance matrix of the projected structure and the initial distance matrix (the lower, the better). The TNGL line shows the number of remaining tangles after some detangling cycles. You would like to see 0 here.
HAND: (secstr) Good:Bad=1:4 (2:32)
DIST: BD=2.211e-01, NB=7.904e-04, RS=5.051e-02, SC=2.517e+00, AC=7.493e-01
PROJ: Dim=3, Df=1.019e+00, STR=2.103e-03 , flip
TNGL: 0 (cyc=0)
EUCL: IN=2.124e-05 2oSTR=1.061e+00 ALL=2.584e-03
** BEST: BD=5.237e-02, NB=0.000e+00, RS=3.433e-02, SC=2.185e+00,
AC=7.400e-01
This is the 3D iteration output. HAND shows the results of the overall handedness checks which are done by inspecting the chirality of the secondary structure elements. If more "bad'' than "good'' chiralities are found, then the structure is "flipped'' after projection (reflected through its centroid) to get the chiralities right. Note that the HAND line looks slightly different during homology modelling because in that case the correct overall chirality is obtained from a comparison to a scaffold structure. The rest of the output is similar to the high-dimensional lines. However, you want to see a **BEST line here, which indicates that a good conformation has been found and saved.
EXIT: no further improvement on 3D reprojection
TIME: 58 secs
END: BD=2.875e-04, NB=3.286e-04, RS=6.409e-04, SC=1.354e+00, AC=8.160e-01,
Itno:20=6+14
SAVE: 3icb_test_1.pdb
VIOLS: 3icb_test_1.viol
When the simulation finishes, the EXIT line gives you the reason for termination. The TIME line prints the total time used for this run. The END line lists the scores of the best conformation once again, and the Itno field gives a summary of the cycles used in the high-dimensional and 3D iterations. SAVE and VIOLS list the names of the result and violation files.
Occasional warning and error messages, indicating the class and method where the problem occurred, plus a very brief and often uninformative description, may also be printed to the standard error during the run. Warnings are preceded by a question mark `?' and are quite likely to occur at the beginning, if one of the input files is incorrectly formatted. Sometimes you get one or more warnings like this before a projection step:-
? centre_dist(): Cdist2[32]=-1.164e+01
These indicate non-metricity in the distance matrix and can safely be ignored unless there are too many of them. Another quite innocent warning is sometimes printed by the Spectral Gradient optimisation:-
? Specgrad_::iterate(Maxiter=30, Eps=2.000e-02): No convergence
? Steric_::adjust_xyz(SPECGRAD): no convergence
Do not worry, DRAGON switches to another, more robust optimisation when Spectral Gradient does not converge.
What you definitely do not want to see is a fatal error message,
preceded by an exclamation mark `!'. Theoretically, they should
never occur. If they do, then a coredump usually follows, and even if you
get a model in the end, it is best to throw it away.
# Atom pair Type Actual Ideal (Strict) Rel.viol Error
CA[ 42]: CA[ 41] BOND 3.40 < 3.80 (2.97) 0.31 10.4 %
where the relative violation column is the error multiplied by the strictness and therefore could look frightening for C-alpha:C-alpha violations which have a high weighting. To calm your nerves, read the last column only. However, you should not see any BOND violations and only a few NBOND ones. If they do occur, then probably some of your external restraints were inconsistent or you have found a horrible new bug. RESTR and SECSTR violations are more common. Check these carefully, too: the deviations from ideal secondary structures may be OK if the model otherwise looks reasonable. Innocent violations below 5 % are not listed at all.
rank [-bnr] DRAGON_PDB_file(s)
The flags -b, -n, -r specify that the structures are to be sorted according to their bond, non-bond or restraint scores, respectively. The score flags may be combined, in which case all specified scores will be used in the ranking process.
clumsy [-as] [-w window_len] [-c smooth_cycno] [-o output] PDB_files...
The options have the following meaning: -a causes all atoms to be used in the comparison (the default is C-alphas only), -s performs smoothing on the C-alpha trace with a window length and smooth cycle number specified by -w and -c, respectively. -o saves the average structure of the top cluster to the specified output file. The argument PDB files must have identical sequences and only the first chain from each file is used in the comparison. The program prints a dendrogram to the standard output with the coordinate RMS deviations between the clusters.
Clustering can detect outliers or the presence of fold families which satisfy the restraints equally well. Generate average structures for the clusters you like, or just pick one structure with the best scores from each cluster and use these as representative conformations.
Start with building the main-chain from the C-alpha trace. We used the catomain program in Willie Taylor's lab, kindly provided by M. Levitt: this software is not included in the DRAGON suite. Once you have the main-chain complete with N, C and O, then you can use my sidech program to build partial sidechains using the original multiple alignment and the known structures if you did homology modelling. Usage:-
sidech alignment mainchain homstruct outfile
where alignment is the original multiple alignment you used
for the model construction, mainchain is the PDB file with the
main-chain atoms of the model, homstruct contains the scaffold
structures used for homology modelling (see Homfnm).
The almost-all-atom model with the partial sidechains will be written to
outfile. CHARMm can complete these sidechains, other
modelling programs can perhaps do the whole story so that you might not
need sidech at all.
c[ommand] command_file
Executes the commands in command_file. Command file calls can
be nested up to the maximal depth of 16. If command_file is omitted
then the program enters interactive mode. The number of ">'' characters
following the "DRAGON'' prompt indicates the
current call depth.
d[efault]
Resets all parameters to their default values.
h[elp]
Prints a short help on all available commands. This command works in
interactive mode only and is ignored when issued from within a command
file.
l[ist] Param
Lists a short description and the value of parameter Param to
the standard output. If Param is omitted, then all parameters are
listed.
o[s]
Invokes an OS shell (your default). You can return to DRAGON by typing
"exit" on the shell command line.
p[aram] parameter_file
Reads the parameter specifications in parameter_file. It is
an error if the file cannot be opened. For the parameter description format,
refer to the Parameters section below. A word of warning:
parameters not specified in parameter_file will
retain their previous values, sometimes causing confusion. You
could either specify all parameters in your files, or you could issue a
d[efault] command prior to reading in new parameters.
q[uit]
Quits DRAGON. If invoked in a nested command file, then execution of
the file is terminated and control will be returned to the caller. DRAGON
exits only if q[uit] was issued at the topmost level. Since execution
automatically terminates at the end of command files anyway, q[uit]
is mainly useful in interactive mode. The program politely asks for confirmation
before exiting.
r[un] repetition
Performs the simulation repetition times using the current parameters
but starting with a different random distance matrix each time. If repetition
is omitted, then one simulation is carried out. Simulations can be interrupted
by typing Ctrl-C. (Note: this feature is not supported when compiled
with GCC.)
s[ave] parameter_file
Saves the parameters to parameter_file or to the standard output
if parameter_file is omitted.
Param value
where Param, the parameter's name and its value are separated by whitespaces. Invalid parameter names and malformed parameter specifications are ignored silently.
There are two kinds of parameters: numeric parameters and filename parameters. The latter specify various ASCII files which either describe your modelling problem (such as the multiple alignment file or the secondary structure assignment) or they hold generic data necessary for the operation of DRAGON. These data files live in the subdirectory pointed to by the $DRAGON_DATA environment variable (usually the dragon4/data subdirectory).
For your convenience, all parameters have default values which will be used if value is missing or does not make any sense to the program. The values contained in the default data files are also hardwired into the program so it is possible to perform a run even if the files are missing or inaccessible. In addition to their default values, numeric parameters have a permitted range as well. If the value specified is outside the range, it will be adjusted silently to the closest upper (or lower) limit. All distance measurements are given in Å units.
Residues which are known to be either on the surface or buried inside may be specified in this file. For these residues, the normal accessibility checks are suspended and DRAGON forces them either to the surface or to the interior.
The accessibility file consists of lines of the following format:-
access_code resno [ resno... ]
where access_code is the letter s or S for surface residues, b or B for buried residues, followed by a whitespace-separated list of residue numbers (resno). More than one line of either kind may be specified in arbitrary order. DRAGON filters out those residues which do not fit into the target molecule or which were specified both as surface and buried and prints appropriate warnings.
Specifies the average distances of side-chain atoms from the C-alpha atoms and from the centroid of the side chain. The default file contains data derived from the Ponder/Richards rotamer library. The data in this file are used to convert interatomic distance restraints into restraints between C-alpha atoms and/or side-chain centroids to match the reduced representation of residues inside DRAGON. Since it is quite painful to construct this file, I do not give the format here. For all practical purposes the default values should be adequate.
This is perhaps the most important parameter because you specify the sequence of your protein to be modelled as one of the sequences contained in the multiple alignment (see the Masterno parameter for details). The default alignment file is provided only as an example. Alignments may be specified in the GCG format (also known as multiple sequence format or MSF), or in MULTAL vertical format (which actually has a few variants), or in PIR format.
The GCG format acceptable to DRAGON is more relaxed than the original. Here is the specification:-
<...any number of lines containing anything...>
Name: first_seq_name Len: XXX
Name: second_seq_name Len: YYY
...
Name: last_seq_name Len: ZZZ
<...any number of lines containing anything...>
first_seq_name ..ALIG nM---eNT ...
second_seq_name ..ALIG nM...DNT ...
...
last_seq_name ..ALIG nM-X-eNT...
<... any number of lines containing anything ...>
first_seq_name ..ALIG nM---eNT ...
second_seq_name ..ALIG nM...DNT ...
...
last_seq_name ..ALIG nM-X-eNT...
<... any number of lines containing anything ...>
As you can see, both "." and "-" are acceptable as gap characters, whitespaces are ignored and the amino acid codes may be lower- or uppercase. Each line can be at least as long as the maximal alignment length (2048 chars) if I counted the bytes correctly. The length specifications (XXX, YYY, ... ,ZZZ) need not be equal: DRAGON will use the largest as the alignment length.
The only snag with the MULTAL format is that there is no such thing as the MULTAL format. There are subtle differences in the first few lines where the number and names of sequences are specified. Currently the following variants are recognised:-
Seqno number_of_sequences
and the sequence names are in general unspecified.
Block 0
number_of_sequences seqs
USER>BS_HYDRO = Bean soup hydrolase
USER>NAC_DX = Nicotine deoxygenase
USER>ANOTH_SEQ = another sequence
where the sequence abbreviations after the USER> keyword can be anything. This is what comes out from Willie Taylor's multiple sequence/structure alignment program MSAP. Note that currently there is a bug in some versions of MSAP which sometimes causes the loss of the last few amino acids from the aligned sequences.
block 1 = number_of_sequences seqs
-----USER>BS_HYDRO = Bean soup hydrolase
-----USER>NAC_DX = Nicotine deoxygenase
-----USER>ANOTH_SEQ = another sequence
This is the "MULTAL output" of CAMELEON, the commercial implementation of MULTAL (by Oxford Molecular).
Once we got past this mess, the rest of the non-comment lines are alignment positions containing a string of 1-letter amino acid codes (upper- or lowercase) or the gap character "-". Warnings are printed if invalid characters are encountered and they will be replaced by "X" (meaning any amino acid). Here is a sample alignment file:-
Seqno 6
-AAa-G
LLLIIL
RRE--K
...
The PIR format is relatively simple. All aligned sequences are listed after each other in PIR format, with gaps inserted in the appropriate places. The first line should contain the ">P1;" thing and the sequence name, the second line is a description which is ignored (but must be present), then follow the sequence lines, terminated with an asterisk. Again, lowercase letters are allowed, gaps can be '-' or '.' characters. Comment lines beginning with "#" are also allowed. If some of the sequences happen to have different aligned lengths, then you get a warning and the ends of the offending sequences will be padded up by gaps. Here is an example:-
>P1;BS_HYDRO
Bean soup hydrolase
LFSR--GtHrS--QWETPY
THRSRLLK--*
>P1;NAC_DX
Nicotine deoxygenase
TTLPTR-VVMFhASLK
LLYKHLDNNLaLA---WQD*
.....
I despise hard-coded limits. However, there is an upper limit of 256 sequences and 2048 positions built into the alignment module. In practice you should refrain from modelling proteins larger than about 300 residues, mainly because DRAGON cannot yet handle multidomain structures.
This parameter specifies the number of C-alpha atoms per cubic Å. The default value is an average calculated from a non-homologous set of well-resolved cytosolic proteins which is is surprisingly constant: you may use the default value with confidence if no better guess is available.
This parameter specifies the fraction of eigenvalues to be retained in each stage of the gradual projection. A low value means larger jumps in dimensionality towards 3D but embedding accuracy is reduced.
This parameter affects the run time needed for the first part of the simulations when DRAGON wanders around in high-dimensional spaces. See Maxiter to get an idea how to change the speed and precision in the second stage of the simulations.
This option is ignored on architectures not supporting OpenGL and in non-interactive mode. When set to 1, then the actual distance matrices before ("Dist") and after ("Eucl") the embedding are displayed in fancy graphics windows, and the 3D iterations can be monitored in a little molecular movie. This option slows down the calculations slightly and therefore should be switched off when not needed (but it is very nice to watch if you are not in a hurry).
This file, if specified, contains the 3D structure of one or more of the sequences in the alignment in PDB format. Only monomeric structures are considered: they may be separated by TER cards or enclosed between MODEL/ENDMDL cards. Chain identifier characters are ignored for the ATOM cards. It is sufficient to provide the C-alpha coordinates only since all other PDB information will be ignored.
The sequences belonging to the structures are automatically deduced from the ATOM cards (the SEQRES cards are ignored!) and then the structures are used as scaffolds for homology modelling. Structures whose sequences cannot be found in the alignment will be ignored. A common problem is to submit slightly different sequences to the multiple alignment program but this results in disaster since DRAGON demands an exact string match. If you wanted to do homology modelling and DRAGON tells you that no homology-derived restraints were generated, then check the sequences carefully.
Specifies which sequence in the multiple alignment should serve as the "master sequence'', i.e. the model chain's sequence. If set to 0 (the default), then the consensus sequence of the alignment will be the model sequence.
This parameter specifies the maximal C-alpha distance between two residues in the known structure(s) which are used as homology-derived restraints. The default value roughly corresponds to the radius of the first coordination sphere in protein interiors. Increased Maxdist values give better accuracy but the larger number of restraints might result in slightly longer simulation times. This parameter is ignored if Homfnm is not specified (no homology modelling).
DRAGON handles 3D iterations in a special way because you are interested in 3D models only. Untangled 3D structures are saved during the 3D iterations if their scores are better than those of the previously saved "best'' structure. If no acceptable structures are found, then DRAGON repeats the 3D embedding step in every five or so cycles to get into a new local minimum. If no acceptable structures were found in Maxiter iterations, then DRAGON repeats the whole simulation, starting afresh from high dimensions.
The choice of Maxiter affects the CPU time requirements to a large extent. The default value is a good starting point but you should experiment with different values to get a good tradeoff between model quality and simulation time. In general, larger structures would need higher Maxiter values. Note that in my experience it is probably a better idea to run more rough simulations rather than refining a few to the extreme.
The minimal relative change of the steric violation and distance scores between two iterations. Serves as an exit criterion.
The minimal value of the steric violation and distance scores. The simulation exits when the scores fall below this value.
The minimal sequential separation between two residues for which a homology restraint will be generated. This parameter has to be larger than or equal to 2 (the default value) and will be ignored when no homology file is specified. It does not make much sense to vary this parameter and probably will not be supported in the next release.
Specifies the name from which the result filenames and various logfile names are derived. If the optional directory path dir_path is given, then the program attempts to create the necessary subdirectories in the path if they do not exist already. Should the directory creation fail for whatever reason, then the output files are created in the current working directory. Note that environment variables like "$HOME" and other shell-dependent things like "~" will NOT be expanded.
The best simulation result is saved in PDB format, listing the C-alpha
atoms and the fake sidechain centroids as C-beta atoms, as well as the
sequence and secondary structure assignment. The result of the k-th
run will be saved as "filename_k.pdb". If a valid 3D embedding was
found, then a restraint violation file will also be generated with the
name "filename_k.viol''. In rare circumstances it might happen that
no untangled models could be found: in this case the last horrible structure
is saved anyway under the name "filename_TEMPORARY_k.pdb".
A desperate attempt is also made to untangle the structure and the result
will be saved as "filename_UNTANGLED_k.pdb". These files
are saved only to frighten you and should be discarded.
In parallel mode (-m option) the child processes generate
log files for each run called "filename_k.log''.
Specifies the amino acid hydrophobicity values. No need to be changed. Every non-comment line in the file lists an amino acid (with one-letter code) and its hydrophobicity value separated with whitespaces like this:-
# Membrane hydrophobicity data
A 1.73
B 0.02
...
Z 0.02
This number serves as the seed for the random number generator used to fill up the initial distance matrix. If it is 0 (the default), then the random number generator will be seeded with the system time, otherwise with the specified integer. If multiple runs are specified (with the -r command-line option or via the r[un] command) then the program assumes that Randseed=0.
Contains the list of external distance restraints. Restraints may be specified between C-alpha, side-chain atoms or a pseudo-atom called "SCC" (side chain centroid) in the form of lower/upper-limit pairs with a "strictness" value. Atom names should follow the PDB conventions. No file is specified as the default, meaning that no external distance restraints are available. The format of a line in the file is:-
res1 res2 lowlim uplim strict atom1 atom2
where res1, res2 are the residue numbers (>=1), lowlim and uplim are the lower and upper distance limits in Å units, 0.0<=strict<=1.0 is the strictness value reflecting the reliability of the restraint (0.0 means totally unreliable, 1.0 is absolutely certain), and atom1, atom2 are the atoms linked by the restraint. Restraints within residues can be specified if res1=res2. Here is an example:-
# Example restraint file
6 9 4.89 5.69 0.986 CA CA
12 15 4.89 7.11 0.627 SCC SCC
15 17 3.83 4.15 0.635 CB SG
...
Specifies the amino acid similarity matrix. The default file contains Dayhoff's PAM250 matrix. A variety of other similarity matrices are also available in $DRAGON_DATA*.sim files. You can also specify your own, here is the format:-
# Mutation Data Matrix (250 PAMs) DRAGON 4.x default
ARNDCQEGHILKMFPSTWYVBZX
2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -4 1 1 1 -6 -3 0 0 0 0
-2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 0
...
The first non-comment line should be a string that specifies the order of the amino acids in the columns and the rows. There must be exactly as many rows and columns as the number of characters in the order string. The matrix elements are floating-point values separated by whitespaces.
Spectral Gradient is an iterative optimisation method used to move a set of points in Euclidean space so that their distances correspond to a prescribed distance matrix (see Wells et al, J. Mol. Struct. 308: 263-271 (1994) for a detailed description). This parameter sets the precision for the iteration: when the relative "stress" change is less than Speceps, then the iteration is terminated. Lower values mean more iterations. See also Speciter below.
This parameter controls the maximal number of Spectral Gradient iterations used in Euclidean adjustments. Sometimes the method does not converge, in these cases DRAGON performs a less elegant but more robust steepest descent-like optimisation.
This file holds the secondary structure assignments. Currently 3/10-, alpha- and pi-helices and beta-sheets are implemented. You must supply the alignment information for strands in a beta-sheet. Bifurcated sheets may be specified as overlapping "normal" sheets following the PDB convention. A warning is issued when overlapping sheets are encountered: all other overlapping secondary structure elements are ignored. An optional "strictness" value between 0.0 and 1.0 may be specified for each secondary structure element in the file which regulates the extent to which ideal secondary structure is enforced on the model. 1.0 corresponds to full adjustment, 0.0 means that the ideal geometry is not enforced at all. In some cases it is worthwhile to specify a medium strictness, especially for long helices (which are sometimes bent, as opposed to the ideal straight helices generated by DRAGON) and for curved sheets.
DRAGON lists the accepted secondary structure specification to standard output prior to the runs which can be used to verify that you supplied a correct assignment. No file is specified as default but if you fail to provide one, then you are on your own: DRAGON cannot predict secondary structure yet and since the detangling relies on the assignment, the results will be of dubious value.
A few words about the file format. Secondary structure elements may be specified in any order, with comment lines in between. Helix specifications have the format:-
helixtype beg end [strict]
where helixtype=ALPHA or HELIX for alpha-helices, HX310 for 3/10-helices or HXPI for pi-helices, beginning at residue beg and ending at end, with the optional strict value between 0.0 and 1.0.
The sheet description spans multiple lines. It starts with a line that contains the keyword SHEET, optionally followed by a strictness value for the whole sheet. Then comes the first strand with the format:-
STRAND beg end
The rest of the strands in the sheet are described like this:-
STRAND beg end sense this_pos prev_pos
where sense=PAR or ANTI indicates whether the current strand is parallel or anti-parallel with respect to the previous strand. The last two numbers describe the phasing of the strand: the residue indicated by this_pos is hydrogen-bonded to the residue prev_pos on the previous strand. The sheet description ends with a line containing the keyword END. Easy, isn't it? Here is a full example:-
# Example secstr file
# an alpha-helix
ALPHA 12 25
# another alpha-helix
HELIX 39 45
# a 3/10 helix
HX310 69 76
# a pi-helix
HXPI 104 116
# helix we're not sure about (strictness 0.5)
ALPHA 156 172 0.5
# the main sheet has a bulge at 30 and the last strand is bifurcated
SHEET
STRAND 27 29
STRAND 1 7 PAR 1 27
STRAND 47 54 PAR 47 1
STRAND 86 92 PAR 86 48
STRAND 119 121 PAR 119 86
END
# note that most strand descriptions are just repeated
SHEET
STRAND 27 29
STRAND 1 7 PAR 1 27
STRAND 47 54 PAR 47 1
STRAND 86 92 PAR 86 48
STRAND 145 147 PAR 145 90
END
# little extra antiparallel sheet at strictness=0.7
SHEET 0.7
STRAND 137 138
STRAND 141 142 ANTI 142 137
STRAND 124 125 ANTI 124 142
END
You may also consult the PDB Format guide because the sheet representation in this file closely follows the PDB conventions. Be careful when specifying beta-barrels, though: I haven't tried that yet. The PDB convention of specifying the first strand as the last would probably not work.
Let me give you some tactical advice about secondary structure assignment. It is relatively straightforward to write the assignment file if you perform homology modelling: all you have to do is to map the secondary structure elements in the template structure(s) onto the target sequence. If you attempt ab initio modelling then the assignments usually come from secondary structure predictions. Since it is not possible to assign a secondary structure strictness value to every residue based on the probabilities generated by most prediction programs, the workaround is to use your judgment and assign an average strictness to the secondary structure elements. The conformation adjustment routine does not like very short elements, i. e. 3-residue "helices" or 2-residue strands: if you need these, then you are probably better off by supplying some distance restraints in the Restrfnm restraint file.
Beta-sheets pose another problem. Prediction programs usually generate the strands only but DRAGON needs the sheet topology as well. In most cases you have to generate a few plausible topologies by hand and then compare the results obtained from runs done with each assignment. This approach is feasible for small sheets only.
Maximal number of detangling iterations. The detangling tries to get rid of the tangled conformations which are an annoying artefact of Distance Geometry projections. The default iteration number is probably a safe compromise between speed and efficiency. Note that detangling cannot be carried out if no secondary structure was specified.
Specifies the average amino acid side-chain volumes: No need to be changed. The default file looks like this:-
# Amino acid volume data for DRAGON 4.x (default)
A 22.7
B 50.2
...
Aszódi, A. and Taylor, W. R. (1994):
Folding polypeptide alpha-carbon backbones by distance geometry methods.
Biopolymers 34, 489-505.
Taylor, W. R. and Aszódi, A. (1994):
Building protein folds using distance geometry: Towards a general modelling
and prediction method.
In: Merz, K. M., Jr. and LeGrand, S. M. (eds): The
Protein Folding Problem and Tertiary Structure Prediction, 165-192.
Birkhäuser, Boston. (Book chapter)
Aszódi, A. and Taylor, W. R. (1994):
Secondary structure formation in model polypeptide chains.
Protein Engng. 7, 633-644.
Aszódi, A., Gradwell, M. J. and Taylor, W. R. (1995):
Global fold determination from a small number of distance restraints.
J. Mol. Biol. 251, 308-326.
Aszódi, A. and Taylor, W. R. (1995):
Estimating polypeptide alpha-carbon distances from multiple sequence
alignments.
J. Math. Chem. 17, 167-184.
Aszódi, A. and Taylor, W. R. (1996):
Homology modelling by distance geometry.
Folding & Design 1, 325-334.
Aszódi, A. and Taylor, W. R. (1997):
Hierarchic inertial projection: A fast distance matrix embedding algorithm.
Computers & Chemistry 21, 13-23.