A MULTIPLE ALIGNMENT PROGRAM (MAP):
copyright (c) 1992 Xiaoqiu Huang
The distribution of the program is granted provided no charge is made
and the copyright notice is included.
Proper attribution of the author as the source of the software would
On Global Sequence Alignment,
Computer Applications in the Biosciences, 10(3), 227-235, 1994.
Department of Computer Science
Michigan Technological University
Houghton, MI 49931
The MAP program computes a multiple global alignment of sequences using
iterative pairwise method. The underlying algorithm for aligning
two sequences computes a best overlapping alignment bewteen
two sequences without penalizing terminal gaps. In addition,
long internal gaps in short sequences are not heavily penalized.
So MAP is good at producing an alignment where there are long
terminal or internal gaps in some sequences. The MAP program is
designed in a space-efficient manner, so long sequences can be aligned.
Users supply scoring parameters. In the simplest form, users
provide 3 integers: ms, q and r, where ms is the score of a mismatch
and the score of an i-symbol indel is -(q + r * i). Each match
automatically receives score 10. In addition, an integer gs is
provided so that any gap of length > gs in a short sequence is
given a penalty of -(q + r * gs), the linear penalty for a gap of
length gs. In other words, long gaps in the short sequence are
given a constant penalty. This simple scoring scheme may be used
for DNA sequences. NOTE: all scores are integers.
In general, users can define an alphabet of characters to appear
in the sequences and a matrix that gives the substitution score
for each pair of symbols in the alphabet. The 127 ASCII characters
are eligible. The alphabet and matrix are given in a file, where
the first line lists the characters in the alphabet and the lower
triangle of the matrix comes next. An example file looks as follows:
-10 -22 11
-20 -10 -20 18
-10 -20 -10 -20 12
Here the -22 at position (3,2) is the score of replacing N by R.
This general scoring scheme is useful for protein sequences where the
set of protein characters and Dayhoff matrix are specified in the file.
Note that the characters in the alphabet must be exactly the same
(including lower or upper cases) as ones appearing in sequences.
The MAP program is written in C and runs under Unix systems on
Sun workstations and under DOS systems on PCs.
We think that the program is portable to many machines.
Sequences to be aligned are stored in one file.
A sample file of sequences looks like:
The string after ">" is the name of the following sequence.
To find the best alignment of sequences in file A,
use a command of form
map A gs ms q r > result
where map is the name of the object code, gs is the minimum length
of any gap in a short sequence charged with a constant gap penalty,
ms is a negative integer specifying mismatch weight, q and r are
non-negative integers specifying gap-open and gap-extend penalties,
respectively. Output alignments are saved in the file "result".
For using a scoring matrix defined in file S, use a command of form
map A gs S q r > result
Note that ms is replaced by the file S.
The function diff() from Gene Myers is modified and used here.
The author thanks Chunwei Wang for pointing out the problem
with existing multiple alignment software.
The author also thanks Dave Gordon and John Hunt for suggesting
that the alignment be produced in flat and interleaved formats
so that it can be read by some phylogenetic analysis programs.