MAP Help

A MULTIPLE ALIGNMENT PROGRAM (MAP):

copyright (c) 1992 Xiaoqiu Huang
The distribution of the program is granted provided no charge is made
and the copyright notice is included.
E-mail: huang@cs.mtu.edu

Proper attribution of the author as the source of the software would
be appreciated:
	 Huang, Xiaoqiu
	 On Global Sequence Alignment,
	 Computer Applications in the Biosciences, 10(3), 227-235, 1994.

	      Xiaoqiu Huang
	      Department of Computer Science
	      Michigan Technological University
	      Houghton, MI 49931

The MAP program computes a multiple global alignment of sequences using
iterative pairwise method. The underlying algorithm for aligning
two sequences computes a best overlapping alignment bewteen
two sequences without penalizing terminal gaps. In addition,
long internal gaps in short sequences are not heavily penalized.
So MAP is good at producing an alignment where there are long
terminal or internal gaps in some sequences. The MAP program is
designed in a space-efficient manner, so long sequences can be aligned. 

Users supply scoring parameters. In the simplest form, users
provide 3 integers: ms, q and r, where ms is the score of a mismatch
and the score of an i-symbol indel is -(q + r * i). Each match
automatically receives score 10. In addition, an integer gs is
provided so that any gap of length > gs in a short sequence is 
given a penalty of -(q + r * gs), the linear penalty for a gap of
length gs. In other words, long gaps in the short sequence are
given a constant penalty. This simple scoring scheme may be used
for DNA sequences.  NOTE: all scores are integers.

In general, users can define an alphabet of characters to appear
in the sequences and a matrix that gives the substitution score
for each pair of symbols in the alphabet. The 127 ASCII characters
are eligible. The alphabet and matrix are given in a file, where
the first line lists the characters in the alphabet and the lower
triangle of the matrix comes next. An example file looks as follows:

ARNDC	       
 13
-15  19
-10 -22  11
-20 -10 -20  18
-10 -20 -10 -20  12

Here the -22 at position (3,2) is the score of replacing N by R.
This general scoring scheme is useful for protein sequences where the
set of protein characters and Dayhoff matrix are specified in the file.
Note that the characters in the alphabet must be exactly the same
(including lower or upper cases) as ones appearing in sequences.

The MAP program is written in C and runs under Unix systems on
Sun workstations and under DOS systems on PCs.
We think that the program is portable to many machines.

Sequences to be aligned are stored in one file.
A sample file of sequences looks like:
>Human-beta
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>Horse-beta
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>Human-alpha
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>Horse-alpha
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
>Sea-lamprey
PIVDTGSVAPLSAAEKTKIRSAWAPVYSDYETSGVDILVKFFTSTPAAEEFFPKFKGLTT
ADELKKSADVRWHAERIIDAVDDAVASMDDTEKMSSMKDLSGKHAKSFEVDPEYFKVLAA
VIADTVAAGDAGFEKLLRMICIL
LRSAY
>Sperm-whale
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELG
YQG
>Yellow-lupin
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSSFLKGGTSEVPQNNPE
LQAHAGKVFKLVYEAAIQLEVTGVVASDATLKNLGSVHVSKGVVADAHFPVVKEAILKTIK
EVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA
The string after ">" is the name of the following sequence.

To find the best alignment of sequences in file A,
use a command of form

	   map  A  gs  ms  q  r > result

where map is the name of the object code, gs is the minimum length
of any gap in a short sequence charged with a constant gap penalty,
ms is a negative integer specifying mismatch weight, q and r are
non-negative integers specifying gap-open and gap-extend penalties,
respectively. Output alignments are saved in the file "result".

For using a scoring matrix defined in file S, use a command of form

	   map  A  gs  S  q  r > result

Note that ms is replaced by the file S.

Acknowledgments
The function diff() from Gene Myers is modified and used here.
The author thanks Chunwei Wang for pointing out the problem
with existing multiple alignment software.
The author also thanks Dave Gordon and John Hunt for suggesting
that the alignment be produced in flat and interleaved formats
so that it can be read by some phylogenetic analysis programs.