DNA Sequence Alignment Algorithms

Overview

This application implements two fundamental algorithms for DNA sequence alignment:

  • Needleman-Wunsch Algorithm - For global alignment
  • Smith-Waterman Algorithm - For local alignment

These algorithms enable biologists to score the similarity between DNA sequences and view the aligned portions.

Needleman-Wunsch Algorithm (Global Alignment)

The Needleman-Wunsch algorithm aligns two DNA sequences over their entire length. Here's how it works:

  1. Create a matrix where nucleotides of sequence 1 are along the vertical axis and sequence 2 along the horizontal axis
  2. Initialize the first row and column with gap penalties (default: -2)
  3. Fill each cell with the maximum score from:
    • Cell to the left + gap penalty (horizontal movement)
    • Cell above + gap penalty (vertical movement)
    • Diagonal cell + match reward (+1) or mismatch penalty (-1)
  4. Traceback from the bottom-right cell to top-left, following the path that generated each score
  5. The alignment score is the value in the bottom-right cell

Use case: When you need to align entire sequences to compare overall similarity.

Smith-Waterman Algorithm (Local Alignment)

The Smith-Waterman algorithm finds the most similar subsequences within two DNA sequences:

  1. Create a matrix similar to Needleman-Wunsch
  2. Fill each cell with the maximum score from:
    • Cell to the left + gap penalty
    • Cell above + gap penalty
    • Diagonal cell + match reward or mismatch penalty
    • Zero (0) - allows alignment to start fresh
  3. Traceback from the highest scoring cell until reaching zero or a cell with no direction
  4. The alignment score is the highest value in the entire matrix

Use case: When you need to find conserved regions or similar subsequences within larger DNA sequences.

Scoring Parameters

Parameter Value Description
Match Score +1 Reward for matching nucleotides
Mismatch Penalty -1 Penalty for non-matching nucleotides
Gap Penalty -2 Penalty for inserting a gap in alignment

Input Format

DNA sequences should be provided as strings containing only the nucleotide letters:

  • A - Adenine
  • T - Thymine
  • C - Cytosine
  • G - Guanine

The input is case-insensitive. Sequences can be entered manually (separated by newlines) or uploaded as a CSV file.

Output Explanation

For each pair of sequences, the application provides:

  • Global Alignment 1 & 2: The two sequences aligned over their entire length, with gaps (-) inserted for optimal alignment
  • Global Score: Integer score representing overall similarity (can be negative)
  • Local Alignment 1 & 2: The most similar subsequences from each sequence, aligned with gaps
  • Local Score: Positive integer representing the similarity of the aligned subsequences
  • Timestamp: When the alignment was computed

Resources