Reading your Blueprint: Genome Sequencing
In this post I’ll go over the third and most recent method of identifying people through DNA: looking at their genome. A genome is the sum total of all of the genetic information of an individual organism. If using electrophoresis and PCR is DNA fingerprinting, determining someone’s Genome is writing their DNA biography. It took some ten years for the human genome project to be successful in producing the genome of a human, a testament to just how much information the genome contains. Nowadays, of course, scientists and technicians can determine someone’s genome in much less time (about 4 weeks with one machine in 2009), due to the incredible advances in technology we’ve enjoyed, but how did we even get started? DNA fingerprinting is great if you already know what you’re looking for, or if you want to compare samples of something to something else, but how do you get from that to figuring out every piece of genetic information about that thing?
Well the place it starts is with DNA sequencing. DNA is made up of nucleotides that code for various proteins. About the best thing we could hope for then is to know the sequence of these nucleotides on a sample of DNA so we can know which proteins it will code for. DNA sequencing is the process of determining this sequence.
You might recall that with DNA fingerprinting, restriction enzymes cut up DNA into fragments at specific nucleotide patterns. Those fragments split and replicate again and again through PCR, then a researcher will push them through a gel by electrophoresis, which forms bands on the gel at different levels. The closer the band is to the other end of the gel, the smaller the fragments of DNA are inside that band, and you can compare different samples by seeing where the bands show up when you subject them to the same restriction enzymes.
DNA sequencing takes this a step further. The nucleotides that make up DNA can be in two forms, deoxynucletide-tri-phosphate (dNTP) which is the usual version, and DIdeoxynucleotide-tri-phosphate (ddNTP) which has an extra hydrogen on it that keeps any more peptide bonds from forming. This means that if a DNA strand is elongating, and a ddNTP attaches, instead of a dNTP, the DNA can’t elongate any more. It’s done, terminated, its story is over.
Okay, so what? How does this get us to DNA sequencing? The way we get the sequence of a piece of DNA is by labeling the DNA fragments by attaching a fluorescent or radioactive marker to either the ddNTP, the primer (which starts the elongation of a piece of DNA) or the dNTPs that make up the rest of the DNA. Then you can separate each sample into four separate containers, and in each container you add in a different kind of ddNTP. Remember there are four nucleotides found in DNA, Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). Each of these has a terminating, ddNTP form that we can add to a sample. So now you have four samples and each one has a different terminating nucleotide added to it. What this means is that wherever that nucleotide normally appears in the DNA sequence, the DNA well stop elongating there. So for example, lets say we have a piece of DNA with the sequence
CACGATTCGA(10 NTPs)
In the first sample we add the adenine ddNTP so in that sample we’ll have DNA fragments that look like this:
CA*(2NTPs)
CACGA* (5NTPs)
CACGATTCGA*(10NTPs)
Because the ddNTP will attach at different points as the DNA elongates during PCR and each possible fragment size will be amplified equally, all of these fragment sizes will be available in the amplified sample that is put through electrophoresis. These different fragment sizes will then form different bands based on how large they are.
If you put all the ddNTP additions together on a gel , the example above would look something like this:
|——A—–|—-T—-|—-C—-|—-G—-|
10–BAND-|—-0—-|—-0—-|—-0—-|
9—–0—-|—-0—-|—-0—-|–BAND—|
8—–0—-|—-0—-|–BAND—|—-0—-|
7—–0—-|–BAND—|—-0—-|—-0—-|
6—–0—-|–BAND—|—-0—-|—-0—-|
5—BAND—|—-0—-|—-0—-|—-0—-|
4—–0—-|—-0—-|—-0—-|–BAND—|
3—–0—-|—-0—-|–BAND—|—-0—-|
2—BAND—|—-0—-|—-0—-|—-0—-|
1—–0—-|—-0—-|–BAND—|—-0—-|
Looking at this result, a technician can tell the exact sequence of the DNA by simply putting the nucleotide where it’s base shows up in the sequence: 1:C,2:A,3:C… and so on.
This is pretty nifty, but this example only deals with a DNA fragment ten NTPs in length. A human genome has DNA that is billions of NTPs long. How in the world can a genome get sequenced in any reasonable amount of time?
The answer comes in three parts. The first trick is to automate the process, so that a researcher doesn’t have to guide each process along by hand. The next trick, related to the first, is to conduct sequencing experiments in parallel. In other words, you want to have several sequencing experiments going on at the same time. The reason why this second technique is related to the first is that the way to do this is to have an array of wells with samples and the required chemicals inside them, then have a machine which deposits a controlled amount of required enzymes or other materials to each sample at the same time. The machine can then heat each well and allow it to cool as needed for PCR. The final trick is to break apart a long strand of DNA into much smaller fragments and then sequence those fragments randomly, rather than try to do fragments in the order they appear naturally this technique is called shotgun sequencing.
You might wonder how, after all these random fragments are sequenced, can researcher’s put them back together in the proper order. The way this works is similar to a jigsaw puzzle. In a puzzle you may have several pieces that are obviously part of the sky, say and other pieces that are part of other separate areas. If you have several sequences, some of which overlap each other, you can put them together by recognizing that some of them are part of a recognizable pattern. You can put the sequences from fragments with the same pattern together and proceed through the whole genome that way. Obviously, if you were to try to do this by eye, it might be very time consuming, but with the help of computers the situation becomes manageable. So much so that the newest forms of genome sequencing use much smaller fragments that are only few base pairs long, allowing them to sequence all of them in parallel very quickly. A computer can then use statistics to predict where each fragment would show up. This method produces more errors, but makes up for it by the amount of usable information it gives, similar to how Wikipedia makes more errors than an encyclopedia, but makes up for it by being so convenient and covering such a vast array of topics.