DNA Sequencing, Proteomics, and Microarray Methods
Chromosome and Genome Sequencing Approaches
Understanding the sequence information of entire chromosomes and bacterial genomes involves several methods, including random shotgun, ordered shotgun, and directed sequencing, followed by assembly.
Sequencing Methods Overview
- Random Shotgun Sequencing
- Ordered Shotgun Sequencing
- Directed Sequencing (Primer Walking; Direct Genomic Sequencing)
Random Shotgun Sequencing
This method determines the specific sequence of nucleotides within a genome:
- Long DNA strands are randomly sheared or cut into small pieces (typically 2-4 Kb).
- These fragments are cloned into sequencing vectors like M13 or pUC.
- The cloned fragments are sequenced from both ends.
- Computer algorithms assemble the overlapping sequences into contiguous blocks (contigs), aiming for one complete sequence or as few contigs as possible.
Directed Sequencing (Primer Walking)
This process is often used for finishing after the initial shotgun phase, specifically for gap closure:
- Specific sequencing primers are designed to extend appropriate cloned sequences into regions where gaps exist in the assembly.
- Specific sequencing primers can also be used to sequence directly from the genomic DNA template to resolve ambiguities or close gaps.
Ordered Sequencing
This approach involves mapping before sequencing:
- A library of large sequence clones (e.g., in lambda phage vectors) is generated. These might be subcloned from larger constructs like YACs (Yeast Artificial Chromosomes) or BACs (Bacterial Artificial Chromosomes) if necessary.
- The ends of these large clones are sequenced to determine their order and create a physical map of the genome.
- A minimal tiling path is selected, consisting of the minimum set of overlapping clones required to cover the entire genome.
- The lambda inserts corresponding to the minimal tiling path are sheared, subcloned into sequencing vectors.
- Each of these subcloned lambda inserts is individually sequenced using the shotgun method and assembled.
- Finally, the sequences from all the individual lambda inserts are assembled to reconstruct the complete, contiguous genome sequence.
Genome Annotation via Signal Searches
Signal searches within sequence data identify functional elements like intron/exon boundaries and transcriptional regulation sites, which are crucial for annotating the genome database.
ORF (Open Reading Frame): By definition, an ORF is any continuous stretch of codons that begins with a start codon (usually AUG) and ends with a stop codon (UAA, UAG, or UGA). While less relevant for raw eukaryotic genomic DNA due to introns, ORFs are highly relevant when analyzing mRNA, cDNA, or prokaryotic DNA.
Sanger Dideoxy Sequencing Principle
The classical chain-termination method, developed by Frederick Sanger, relies on the following components:
- A single-stranded DNA template
- A DNA primer (complementary to the template)
- A DNA polymerase enzyme
- Normal deoxynucleoside triphosphates (dNTPs: dATP, dGTP, dCTP, dTTP)
- Modified dideoxynucleoside triphosphates (ddNTPs: ddATP, ddGTP, ddCTP, ddTTP)
These chain-terminating ddNTPs lack the 3′-OH group necessary for forming a phosphodiester bond between nucleotides. When a ddNTP is incorporated by the DNA polymerase, DNA strand elongation ceases. The ddNTPs can be radioactively or fluorescently labelled for detection.
The process involves setting up four separate sequencing reactions. Each reaction contains the DNA template, primer, polymerase, and all four standard dNTPs. Crucially, each reaction also contains a small amount of one of the four ddNTPs (ddATP in the ‘A’ reaction, ddTTP in the ‘T’ reaction, etc.).
During the reaction, the polymerase extends the primer. Elongation stops whenever a ddNTP is incorporated instead of the corresponding dNTP. This results in a collection of DNA fragments of different lengths, each ending with a specific ddNTP. After the reactions, the resulting DNA fragments are heat denatured (made single-stranded) and separated by size using gel electrophoresis, typically on a denaturing polyacrylamide-urea gel. The four reactions are loaded into separate lanes (A, T, G, C). The DNA bands are visualized (e.g., by autoradiography for radioactive labels or UV light for fluorescent labels), and the DNA sequence can be read directly from the gel image by noting the order of bands from bottom (shortest fragment) to top (longest fragment).
Distinguishing Motifs, Domains, and Repeats
- Motifs
- Short, conserved sequence patterns that often indicate a specific function or binding site and appear in various molecules.
- Domains
- Conserved parts of a protein sequence and structure that can evolve, function, and exist independently. Domains often have a distinct three-dimensional fold, carry a unique function (e.g., DNA binding, enzymatic activity), and can appear as modules in different, otherwise unrelated proteins.
- Repeats
- Sequence segments that occur multiple times within a protein or nucleic acid, often forming structurally or functionally interdependent modules.
Strategies for Protein Identification in Proteomics
Proteomics refers to the large-scale experimental analysis of proteins. It often specifically involves techniques for protein purification and identification using mass spectrometry.
Protein Isolation via TAP Tagging
Tandem Affinity Purification (TAP Tagging) is a technique that enhances the specificity of protein complex isolation using two distinct affinity tags and sequential purification steps.
Specificity is a major challenge in pull-down assays. The TAP method improves this:
- In the first purification round, the tagged protein and its associated complex are captured via the first affinity tag (e.g., Protein A binding to IgG Sepharose beads).
- The bound complexes are then eluted, often gently by cleaving a specific protease site engineered between the two tags.
- The eluted complexes are then subjected to a second round of affinity purification using the second tag (e.g., a calmodulin-binding peptide binding to calmodulin resin, or a chitin-binding domain (CBD) binding to chitin).
- The final, highly purified complex is specifically eluted (e.g., by adding calcium chelators like EGTA to disrupt calmodulin binding, or by specific conditions disrupting the CBD/Chitin interaction).
Protein Identification using MS-MS
Tandem Mass Spectrometry (MS-MS) is a powerful technique used to identify proteins:
- Proteins are first isolated (e.g., cut from a gel band or collected from HPLC fractions) and subjected to enzymatic digestion, typically using trypsin. This breaks the protein into smaller peptides.
- The resulting peptide mixture is introduced into the mass spectrometer, ionized, and sent into a first mass analyzer. Ions of a specific mass-to-charge ratio (parent ions, often doubly charged) are selected.
- These selected parent ions are directed into a collision cell where they are fragmented, usually by Collision-Induced Dissociation (CID) with an inert gas.
- The resulting fragment ions (daughter ions, often singly charged) are analyzed by a second mass analyzer. The pattern of daughter ions (the MS-MS spectrum) provides information about the amino acid sequence of the parent peptide.
- This sequence information (or the fragmentation pattern itself) is used to search sequence databases and identify the protein from which the peptide originated.
- CID fragmentation of peptides less than 2-3 kDa is generally most reliable for MS-MS analysis. The specificity of trypsin (cleaving C-terminal to arginine (R) and lysine (K) residues) ensures that most resulting peptides fall within this optimal size range.
- Placing basic residues (R/K) at the C-terminus causes peptides to fragment more predictably along their backbone during CID, aiding sequence determination.
Understanding DNA Microarrays
A DNA microarray (also known as a DNA chip or biochip) is a tool used in molecular biology consisting of a collection of microscopic DNA spots attached to a solid surface (like glass or silicon).
Scientists use DNA microarrays for large-scale measurements, such as:
- Measuring the expression levels of thousands of genes simultaneously.
- Genotyping multiple regions of a genome.
Each DNA spot contains picomoles (10-12 moles) of a specific DNA sequence known as a probe (or reporter or oligo). Probes are short sections of genes or other DNA elements designed to hybridize (bind specifically) to complementary sequences in a sample, known as the target. The target is typically labelled cDNA or cRNA derived from mRNA.
Hybridization is carried out under high-stringency conditions (e.g., specific temperature and salt concentration) to ensure specificity. The amount of target bound to each probe is detected and quantified, usually by measuring the signal from fluorescent, silver, or chemiluminescent labels attached to the target. This signal intensity reflects the relative abundance of the corresponding nucleic acid sequences in the target sample, providing insights into gene expression levels or genetic variations.