CS Department
A presentation for CSE 549
09/10/02
Outline:
Why Perl?
Perl stands for Practical Extraction and Reporting Language. The first version was released in 1987 by Larry Wall. A lot of programming languages existed by that time so it was really something special that allowed Perl to become one of the most popular and widely used languages as it is now. Probably the most important feature of Perl is its ability to do various string operations such as matching and substitution easily.
To demonstrate how powerful Perl might be here are solutions for a simple text processing problem in two languages: C++ and Perl. Suppose we want to remove identical words going one by one in a string regardless of their case. That is, given
This is is a a text with repeating words words.
we want to get
This is a text with repeating words.
The C++ program might look as follows:
char text[1024];
char
cur_word[1024], cur_word_upr[1024], next_word[1024], next_word_upr[1024];
// get a line
cin >> text;
istrstream string(text);
// get the 1st word
string.getline(cur_word, sizeof(cur_word), ' ');
// convert the 1st word to upper case to do non-case sensitive comparision
later
strcpy(cur_word_upr, cur_word);
strupr(cur_word_upr);
// get the 2nd word
string.getline(next_word, sizeof(next_word), ' ');
while (strcmp(next_word, "") != 0)
{
strcpy(next_word_upr, next_word);
strupr(next_word_upr);
// compare the word and the next word
if (strcmp(cur_word_upr, next_word_upr) == 0)
{
// same word detected
// do nothing
}
else
{
cout << cur_word << " "; cout.flush();
strcpy(cur_word, next_word);
strcpy(cur_word_upr, next_word_upr);
}
// get the next word
string.getline(next_word, sizeof(next_word), ' ');
}
cout << cur_word << endl;
}
while the solution in Perl is only 2 lines long:
$text = <STDIN>;
$text =~ s/\b(\w+)\s+\1\b/\1/gi;
Since different operations on strings are common in computational biology Perl became very popular among people in this field.
Our first Perl program: Goodbye, world!
The name of the program is "Goodbye, world!".
# Goodbye world program
print "Goodbye, world!\n";
die;
If you run it you will get something like this:
Goodbye, world!
Died at goodbye.pl line 4.
Data types and interpolation
We will consider the following data types:
scalar;
array;
hash.
A scalar is a single number or a string:
$a = 1;
$b = "text";
Variables in Perl should not be declared before they are used.
Most arithmetic operations (such as addition, multiplication etc.) work in Perl as they do in other programming languages. However, concatenation of the strings is performed with "." operator:
$b = "bb";
$c = "cc";
$a = $b . $c; # now $a is "bbcc";
You can print the value of the variable in a number of different ways:
print $a;
print "$a";
print '$a';
In the first two cases you will see "bbcc" as the output. This example demonstrates an important feature of Perl -- its ability to interpolate the strings in "". This means that the values of variables are substituted into the string before it is printed out. Remember that this works only with double quotes, not with single. In the last case you will see "$a" as the output. The ability to interpolate allows us to make the output easier. Instead of typing
print "the value of a is " . $a . "\n";
you type
print "the value of a is $a\n";
An array is a list of scalars:
@bases = ("A", "C", "G", "T");
Remember that one element of a list is a scalar. You should write $bases[1] instead of @bases[1] to get access to the second element of a list.
Multiple assignments are possible:
($head, @tail) = @list;
A hash is basically an array which elements are addressed by keywords. For example, the following hash represents a part of bases to acids translation table:
%translate = ("AAT", "Asn", "TAT", "Tyr", "TTT", "Phe");
If we want to know the acid that the sequence "AAT" is translated to we just write
$translate{"AAT"};
and it will return "Asn".
Regular expressions and string operations
This is probably the most exciting thing in Perl. It has a powerful mechanism that allows us to do a variety of text operation with almost no effort.
A regular expression (regexp) is a pattern representing a possibly infinite set of words. Examples:
"abc" -- single word;
"a|b" -- two words "a" and "b";
"(a|b)c" -- either ac or bc;
"ab*" -- a, ab, abb, abbb, ...
"(a|b)*" -- any string consisting of a and b plus an empty string.
Searching for a substring in a string can be done with "=~" operator:
if ($text =~ /AAT/)
{
print "The string has AAT substring";
}
Anything matched in parenthesis is remembered in variables $1,...,$9 and can be referred to later:
if ($text =~ /([A-Z]{3})/)
{
print "The string has $1 substring";
}
Perl has some predefined symbols for regexps:
. # Any single character
^ # Beginning of a line
$ # End of line
\b # Word boundary
* # Zero or more matches
+ # One or more matches
[abc] # Either a, b or c
[a-zA-Z] # Any letter
\w # Any word character (the same as [a-zA-Z0-9_]
\s # Space character
String substitution can be done by using s function:
$text =~ s/([A-Z])/:\1:/g;
It turns "TexT" into ":T:ex:T:". Note the "g" at the end of expression. This option makes Perl find all occurrences of the substring. Another possible option is "i" that makes Perl ignore the letter case.
Now we can look back at the first example of Perl program (the one that eliminates duplicate words) and understand how it works.
The tr function performs symbol-by-symbol translation:
$text =~ tr/ACGT/TGCA/;
This expression builds a complementary bases chain.
File handling
It is easy. Functions open and close are used to work with files. The following example shows how they should be used:
$file = "input.txt";
open(INFO, $file);
while ($string = <INFO>)
{
# do something
}
close(INFO);
Our second program: converting DNA sequence into amino acids sequence.
Now we are ready to write a really useful program. It will be converting a DNA string into a string of corresponding amino acids. The program works as follows. First it reads the conversion table from a file called trans.dat. This file contains a number of lines (namely 64 lines), each of them representing one DNA to amino acid translation. It looks as follows:
...
GCG
Ala
TAT Tyr
TAC Tyr
TAA TERM
...
And here is the program:
# convert DNA into amino acids
# read convertion table
$file = "trans.dat";
open(INFO, $file);
while ($string = <INFO>)
{
$string =~ /^\s*([ACGT]{3})\s+(\w+)/ || die "Bad trans.dat line $string\n";
$trans{$1} = $2;
}
close(INFO);
# read DNA sequence
print "DNA: ";
$dna_string = <STDIN>;
# convert it to uppercase
$dna_string =~ tr/acgt/ACGT/;
# convert DNA to amino acids
print "Amino acids: ";
while ($dna_string =~ s/(^[ACGT]{3})//i)
{
print "$trans{$1} ";
}
print "\n";
Why Bioperl?
A lot of problems in computational biology involve text-processing problems such as string matching and file parsing. This is why Perl is so popular among developers in bioinformatics. Since there were no standard modules in Perl for computational biology, different people developed their own programs and their own file formats. This resulted in many difficulties. For example, it was very difficult to interchange the data among different labs. Although most of labs used the same equipment and tried to solve similar problems they were unable to do any joint computational experiments primarily because of the differences in their file formats and software.
Bioperl is a project that coordinates efforts of different developers in bioinformatics whose goal is to build a standard tools for genomic analysis. Bioperl is a set of well-documented and freely available Perl modules. Each module represents one or more objects. The core objects of Bioperl are:
Sequence -- an object that represents a single DNA sequence;
Sequence Alignment;
BLAST -- can execute commands to generate BLAST reports and helps analyze them
Alignment Factories -- implements one of the alignment algorithms
3D Structure -- encapsulates coordinate data for 3D structures.
Below we discuss the Sequence Object since it is the most frequently used object in Bioperl.
Sequence Object
There is a number of sequence objects in Bioperl, but the most common of them is Seq. This object represents a single nucleotide sequence. It is implemented in file Seq.pm. This is the most widely used object in Bioperl because every program that creates, modifies and does any operations with DNA sequences uses Sequence object. It can be created either automatically when you read a sequence from a file or a database or explicitly. The object contains methods for reading and writing sequences from and to files in different formats. The supported file formats are Fasta, GenBank, Raw and many others. Sequence object also supports some other basic operations such as sequence translation, DNA to RNA conversion, subsequence extraction and so on.
The following example shows how a sequence object can be created explicitly:
$seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact',
'-desc'=>'Sample Bio::Seq object',
'-display_id' => 'something',
'-accession_number' => 'accnum',
'-alphabet' => 'dna' );
However, you can read a sequence from a file as well. The following example transforms sequence files from Fasta format into text (raw) format:
$in = Bio::SeqIO->new('-file' => "inputfilename",
'-format' => 'Fasta');
$out = Bio::SeqIO->new('-file' => ">outputfilename",
'-format' => 'raw');
while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
Sequences can have a number of attached features such as notes and categories, positions within the sequence.
$input_seqs = Bio::SeqIO->new ( '-format' => 'GenBank', '-file' => 't7.gb' );
while ( $s = $input_seqs->next_seq() )
{
print( "sequence length is ", $s->length(), "\n" );
print( "ac number is ", $s->accession(), "\n" );
foreach $f ( $s->all_SeqFeatures() )
{
print "Feature from ", $f->start, " to ", $f->end, " Primary tag ",
$f->primary_tag, " From", $f->source_tag(), "\n";
print "Feature description is ", $f->gff_string(), "\n";
}
}
There are some other useful methods in sequence objects:
$seqobj->revcom # reverse complements sequence
$seqobj->translate # translation of the sequence
The last command does the same as our last Perl program -- it translates a DNA string into amino acids string.