Genome informatics
Dot-matrix methods for pairwise alignment
Branimir
Lazarević 3156/2017
Pairwise sequence alignment
Sequence alignment is the procedure of comparing two (pair‐wise alignment) or more
multiple sequences by searching for a series of individual characters or
patterns that are in the same order in the sequences.
How similar are two sequences? This
simple question drives much of bioinformatics, from assembly of overlapping
sequence fragments into contigs, alignment of new sequences against reference
genomes, BLAST searches of sequence databases, molecular phylogeny, and
homology modeling of protein structures.
Answering this question requires
finding the optimal alignment between two different sequences, scoring their
similarity based on the optimal alignment, and then assessing the significance
of this score. The optimal alignment, of course, depends on the scoring scheme.
There are three the most famous methods
for pairwise sequence alignment:
1) global alignment (Needleman-Wunsch)
2) local alignment (Smith Waterman)
3) dot plot.
Global alignment
The Needleman–Wunsch algorithm is
an algorithm used in bioinformatics to
align protein or nucleotide sequences. It
was one of the first applications of dynamic programming to compare biological sequences.
The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and
published in 1970. The algorithm essentially divides a large problem (e.g.
the full sequence) into a series of smaller problems and uses the solutions to
the smaller problems to reconstruct a solution to the larger problem. It
is also sometimes referred to as the optimal matching algorithm and
the global alignment technique. The Needleman–Wunsch
algorithm is still widely used for optimal global alignment, particularly when
the quality of the global alignment is of the utmost importance.
Local alignment
The Smith–Waterman algorithm performs
local sequence
alignment for
determining similar regions between two strings of nucleic
acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman
algorithm compares segments of all possible lengths and optimizes the similarity
measure.
The
algorithm was first proposed by Temple F. Smith and Michael
S. Waterman in
1981. Like the Needleman–Wunsch
algorithm, of which
it is a variation, Smith–Waterman is a dynamic
programming algorithm.
As such, it has the desirable property that it is guaranteed to find the
optimal local alignment with respect to the scoring system being used (which
includes the substitution
matrix and
the gap-scoring scheme). The main difference to
the Needleman–Wunsch
algorithm is
that negative scoring matrix cells are set to zero, which renders the (thus
positively scoring) local alignments visible. Traceback procedure starts at the
highest scoring matrix cell and proceeds until a cell with score zero is
encountered, yielding the highest scoring local alignment. Because of its
cubic computational complexity in time and quadratic complexity in space, it often cannot be
practically applied to large-scale problems and is replaced in favor of less
general but computationally more efficient alternatives such as (Gotoh, 1982),
(Altschul and Erickson, 1986), and (Myers and Miller 1988).
Dot plot matrix method
In bioinformatics a dot plot is a graphical method
that allows the comparison of two biological sequences and identify regions of close
similarity between them. It is a type of recurrence plot. The aim, usually, is to find or
visualize similarity between two nucleic
acid sequences or two protein.
A dot plot matrix are
also called similarity matrix. These were introduced by Gibbs and McIntyre in
1970 and are two-dimensional matrices that have the sequences of the
proteins being compared along the vertical and horizontal axes. For a simple
visual representation of the similarity between two sequences, individual cells
in the matrix can be shaded black if residues are identical, so that matching
sequence segments appear as runs of diagonal lines across the matrix.
Some idea
of the similarity of the two sequences can be gleaned from the number and
length of matching segments shown in the matrix. Identical proteins will
obviously have a diagonal line in the center of the matrix. Insertions and
deletions between sequences give rise to disruptions in this diagonal. Regions
of local similarity or repetitive sequences give rise to further diagonal
matches in addition to the central diagonal. One way of reducing this noise is
to only shade runs or 'tuples' of residues,
e.g. a tuple of 3 corresponds to three residues in a row. This is effective
because the probability of matching three residues in a row by chance is much
lower than single-residue matches.
Dot plots
compare two sequences by organizing one sequence on the x-axis, and another on
the y-axis, of a plot. When the residues of both sequences match at the same
location on the plot, a dot is drawn at the corresponding position. Note, that
the sequences can be written backwards or forwards, however the sequences on
both axes must be written in the same direction. Also note, that the direction
of the sequences on the axes will determine the direction of the line on the
dot plot. Once the dots have been plotted, they will combine to form lines. The
closeness of the sequences in similarity will determine how close the diagonal line
is to what a graph showing a curve demonstrating a direct
relationship is.
This relationship is affected by certain sequence features such as frame
shifts, direct repeats, and inverted repeats. Frame shifts include insertions,
deletions, and mutations. The presence of one of these features, or the
presence of multiple features, will cause for multiple lines to be plotted in a
various possibility of configurations, depending on the features present in the
sequences. A feature that will cause a very different result on the dot plot is
the presence of low-complexity region/regions. Low-complexity regions are regions in the sequence with only a few amino acids, which in
turn, causes redundancy within that small or limited region. These regions are
typically found around the diagonal, and may or may not have a square in the
middle of the dot plot.
Alghoritm in steps
1 1. One
sequence (A) is listed across the top of the matrix and the other (B) is listed
down the left side.
2 2. Starting
from the first character in B, one moves across the page keeping in the first
row and placing a dot in many column where the character in A is the same.
3 3. The
process is continued until all possible comparisons between A and B are made.
4 4. Any
region of similarity is revealed by a diagonal row of dots.
5 5. Isolated
dots not on diagonal represent random matches.
Detection of matching regions can be improved by
filtering out random matches and this can be achieved by using a sliding window.
It
means that instead of comparing a single sequence position more positions is
compared at the same time and dot is printed only if a certain minimal number
of matches occur. Dot matrix analysis can also be used to
find direct and inverted repeats within the sequences.
Interpretation of dot matrix
Regions of similarity appear as
diagonal runs of dots. If the main diagonal is filled with dots, the match is
complete, we compare two similar sequences. Reverse diagonals (perpendicular to
diagonal) indicate inversions. Reverse diagonals crossing diagonal indicate
palindromes. If we have a
secondary diagonal filled with dots, we compare two palindromes.
Figure 2: Palindrome
"abcdeedcba" (left) and alignment with gaps(right)
Uses for dot plot matrix
·
We
can use dot plot matrix to align two proteins or two nucleid acid sequences.
·
Also,
we can use to find amino acid repeats within a protein by comparing a protein
sequence to itself. Repeats appear as a set of diagonal runs stacked
vertically.
·
Can
use to find self base-pairing of an RNA by comparing a sequence to itself
complemented and reversed.
·
Excellent
approach for finding sequence transpositions.
Filtering to remove "noise"
A problem with dot matrix for long
sequences is that they can be very noisy due to lots of insignificant matches.
Solution use a window and a treshold:
·
compare
character by character within a window (have to choose window size).
·
require
certain fraction of matches within window in order to display it with a
"dot".
Window size changes with goal of
analysis:
·
size
of average exon (A segment
of a DNA or RNA molecule containing information coding for a protein or peptide
sequence.)
·
size
of average protein structural element
·
size
of gene promoter
·
size
of enzyme active site.
Tips for DNA (nucleic acid sequence)
dot plots:
·
Use
a nucleic acid scoring matrix: ednafull in EMBOSS (A
high-quality package of free, Open Source software for molecular biology)
·
Because
there are only 4 nucleotides, increase window size and threshold score until
the background disappears and you are left with clear signal.
·
Using
too large a window, such as 100, with a low threshold will cause the diagonals
to overlap and lose resolution to see small repeats or inversions. Use a
smaller window (less than 30) and raise the threshold score to favor almost
exact matches
Adventages
This algorithm is fairly easy to
implement. Also, it is easy to understand visually. It is so intuitive for programmers. In figures, on the previous
pages we can see that algorithm has good overview of places for good alignment.
The output of algorithm are all
possible alignment of pairs. It can be used in combination with other methods. Readily reveals the presence
of insertions/deletions and direct and inverted repeats that are more difficult
to find by the others, more automated methods.
Disadventages
It is necessary to find the best
window size and treshold by trial-and-error. And when you find the best window
size and treshold you can compare only 2 sequences. Also, it doesn't tell you
what mutation occurred in the region of similarity(if there is one) since the
two sequences shared a common ancestor.
Most dot matrix computer programs do
not show an actual alignment. Does not return a score to indicate how ‘optimal’
a given alignment is (no statistical significance that could be tested).
Software for making dotplots
R library: Function dotPlot() allows you to specify a window size and
treshold.
EMBOSS: dottup allows you to specify
a windowsize but not a treshold.
EMBOSS: dotmatcher allows you to
specify a window size and treshold. Instead of using the number of identities in a window as the
window score, it calculates a more complex score based on the similarities of the bases/amino
acids.