Dot-matrix methods for pairwise alignment: 2018

Genome informatics

Dot-matrix methods for pairwise alignment

Branimir Lazarević 3156/2017

Pairwise sequence alignment

Sequence alignment is the procedure of comparing two (pair‐wise alignment) or more multiple sequences by searching for a series of individual characters or patterns that are in the same order in the sequences.

How similar are two sequences? This simple question drives much of bioinformatics, from assembly of overlapping sequence fragments into contigs, alignment of new sequences against reference genomes, BLAST searches of sequence databases, molecular phylogeny, and homology modeling of protein structures.

Answering this question requires finding the optimal alignment between two different sequences, scoring their similarity based on the optimal alignment, and then assessing the significance of this score. The optimal alignment, of course, depends on the scoring scheme.

There are three the most famous methods for pairwise sequence alignment:

1) global alignment (Needleman-Wunsch)

2) local alignment (Smith Waterman)

3) dot plot.

Global alignment

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem (e.g. the full sequence) into a series of smaller problems and uses the solutions to the smaller problems to reconstruct a solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance.

Local alignment

The Smith–Waterman algorithm performs local sequence alignment for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. Like the Needleman–Wunsch algorithm, of which it is a variation, Smith–Waterman is a dynamic programming algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme). The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are set to zero, which renders the (thus positively scoring) local alignments visible. Traceback procedure starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment. Because of its cubic computational complexity in time and quadratic complexity in space, it often cannot be practically applied to large-scale problems and is replaced in favor of less general but computationally more efficient alternatives such as (Gotoh, 1982), (Altschul and Erickson, 1986), and (Myers and Miller 1988).

Dot plot matrix method

In bioinformatics a dot plot is a graphical method that allows the comparison of two biological sequences and identify regions of close similarity between them. It is a type of recurrence plot. The aim, usually, is to find or visualize similarity between two nucleic acid sequences or two protein.

A dot plot matrix are also called similarity matrix. These were introduced by Gibbs and McIntyre in 1970and are two-dimensional matrices that have the sequences of the proteins being compared along the vertical and horizontal axes. For a simple visual representation of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix.

Some idea of the similarity of the two sequences can be gleaned from the number and length of matching segments shown in the matrix. Identical proteins will obviously have a diagonal line in the center of the matrix. Insertions and deletions between sequences give rise to disruptions in this diagonal. Regions of local similarity or repetitive sequences give rise to further diagonal matches in addition to the central diagonal. One way of reducing this noise is to only shade runs or 'tuples' of residues, e.g. a tuple of 3 corresponds to three residues in a row. This is effective because the probability of matching three residues in a row by chance is much lower than single-residue matches.

Dot plots compare two sequences by organizing one sequence on the x-axis, and another on the y-axis, of a plot. When the residues of both sequences match at the same location on the plot, a dot is drawn at the corresponding position. Note, that the sequences can be written backwards or forwards, however the sequences on both axes must be written in the same direction. Also note, that the direction of the sequences on the axes will determine the direction of the line on the dot plot. Once the dots have been plotted, they will combine to form lines. The closeness of the sequences in similarity will determine how close the diagonal line is to what a graph showing a curve demonstrating a direct relationship is. This relationship is affected by certain sequence features such as frame shifts, direct repeats, and inverted repeats. Frame shifts include insertions, deletions, and mutations. The presence of one of these features, or the presence of multiple features, will cause for multiple lines to be plotted in a various possibility of configurations, depending on the features present in the sequences. A feature that will cause a very different result on the dot plot is the presence of low-complexity region/regions. Low-complexity regions are regions in the sequence with only a few amino acids, which in turn, causes redundancy within that small or limited region. These regions are typically found around the diagonal, and may or may not have a square in the middle of the dot plot.

Alghoritm in steps

1 1. One sequence (A) is listed across the top of the matrix and the other (B) is listed down the left side.

2 2. Starting from the first character in B, one moves across the page keeping in the first row and placing a dot in many column where the character in A is the same.

3 3. The process is continued until all possible comparisons between A and B are made.

4 4. Any region of similarity is revealed by a diagonal row of dots.

5 5. Isolated dots not on diagonal represent random matches.

Detection of matching regions can be improved by filtering out random matches and this can be achieved by using a sliding window. It means that instead of comparing a single sequence position more positions is compared at the same time and dot is printed only if a certain minimal number of matches occur. Dot matrix analysis can also be used to find direct and inverted repeats within the sequences.

Figure 1: Demonstration dot plot matrix method

Interpretation of dot matrix

Regions of similarity appear as diagonal runs of dots. If the main diagonal is filled with dots, the match is complete, we compare two similar sequences. Reverse diagonals (perpendicular to diagonal) indicate inversions. Reverse diagonals crossing diagonal indicate palindromes. If we have a secondary diagonal filled with dots, we compare two palindromes.

Figure 2: Palindrome "abcdeedcba" (left) and alignment with gaps(right)

Uses for dot plot matrix

· We can use dot plot matrix to align two proteins or two nucleid acid sequences.

· Also, we can use to find amino acid repeats within a protein by comparing a protein sequence to itself. Repeats appear as a set of diagonal runs stacked vertically.

· Can use to find self base-pairing of an RNA by comparing a sequence to itself complemented and reversed.

· Excellent approach for finding sequence transpositions.

Filtering to remove "noise"

A problem with dot matrix for long sequences is that they can be very noisy due to lots of insignificant matches. Solution use a window and a treshold:

· compare character by character within a window (have to choose window size).

· require certain fraction of matches within window in order to display it with a "dot".

Window size changes with goal of analysis:

· size of average exon (A segment of a DNA or RNA molecule containing information coding for a protein or peptide sequence.)

· size of average protein structural element

· size of gene promoter

· size of enzyme active site.

Tips for DNA (nucleic acid sequence) dot plots:

· Use a nucleic acid scoring matrix: ednafull in EMBOSS (A high-quality package of free, Open Source software for molecular biology)

· Because there are only 4 nucleotides, increase window size and threshold score until the background disappears and you are left with clear signal.

· Using too large a window, such as 100, with a low threshold will cause the diagonals to overlap and lose resolution to see small repeats or inversions. Use a smaller window (less than 30) and raise the threshold score to favor almost exact matches

Adventages

This algorithm is fairly easy to implement. Also, it is easy to understand visually. It is so intuitive for programmers. In figures, on the previous pages we can see that algorithm has good overview of places for good alignment.

The output of algorithm are all possible alignment of pairs. It can be used in combination with other methods. Readily reveals the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by the others, more automated methods.

Disadventages

It is necessary to find the best window size and treshold by trial-and-error. And when you find the best window size and treshold you can compare only 2 sequences. Also, it doesn't tell you what mutation occurred in the region of similarity(if there is one) since the two sequences shared a common ancestor.

Most dot matrix computer programs do not show an actual alignment. Does not return a score to indicate how ‘optimal’ a given alignment is (no statistical significance that could be tested).

Software for making dotplots

R library: Function dotPlot() allows you to specify a window size and treshold.

EMBOSS: dottup allows you to specify a windowsize but not a treshold.

EMBOSS: dotmatcher allows you to specify a window size and treshold. Instead of using the number of identities in a window as the window score, it calculates a more complex score based on the similarities of the bases/amino acids.

Tuesday, June 12, 2018