Sunday, August 8, 2010

5. The Adleman’s experiment :

5. The Adleman’s experiment :


There is no better way to understand how something works than by going through an example step by step. So let’s solve our own directed Hamiltonian Path problem, using the DNA methods demonstrated by Adleman. The concepts are the same but the example has been simplified to make it easier to follow and present.






Suppose that I live in Boston, and need to visit four cities: Atlanta, San Diego , St.Louis, and NY, with NY being my final destination. The airline I’m taking has a specific set of connecting flights that restrict which routes I can take (i.e. there is a flight from Boston. to San Diego, but no flight from St.Louis to San Diego). What should my itinerary be if I want to visit each city only once?


Figure 1. A sample traveling salesman problem involving the shortest path connecting all cities. Arrows indicate the direction that someone can travel. For example, a voyager can leave Atlanta and arrive in St. Louis, and vice versa

It should take you only a moment to see that there is only one route. Starting from Boston you need to fly to San Diego , Atlanta, St.Louis and then to N.Y. Any other choice of cities will force you to miss a destination, visit a city twice, or not make it to N.Y. For this example you obviously don’t need the help of a computer to find a solution. For six, seven, or even eight cities, the problem is still manageable. However, as the number of cities increases, the problem quickly gets out of hand. Assuming a random distribution of connecting routes, the number of itineraries you need to check increases exponentially.

Pretty soon you will run out of pen and paper listing all the possible routes, and it becomes a problem for a computer.....or perhaps DNA. The method Adleman used to solve this problem is basically the shotgun approach mentioned previously. He first generated all the possible itineraries and then selected the correct itinerary. This is the advantage of DNA. It’s small and there are combinatorial techniques that can quickly generate many different data strings. Since the enzymes work on many DNA molecules at once, the selection process is massively parallel.

Specifically, the method based on Adleman’s experiment would be as follows:

1) Generate all possible routes.
2) Select itineraries that start with the proper city and end with the final city.
3) Select itineraries with the correct number of cities.
4) Select itineraries that contain each city only once.

All of the above steps can be accomplished with standard molecular biology techniques.


Part I: Generate all possible routes

Strategy : Encode city names in short DNA sequences. Encode itineraries by connecting the city sequences for which routes exist.

DNA can simply be treated as a string of data. For example, each city can be represented by a "word" of six bases:

Boston GCTACG
San Diego CTAGTA
Atlanta TCGTAC
St.Louis CTACGG
New York ATGCCG

The entire itinerary can be encoded by simply stringing together these DNA sequences that represent specific cities. For example, the route from Boston -> San Diego -> Atlanta -> St.Louis -> New York would simply be GCTACGCTAGTATCGTACCTACGGATGCCG, or equivalently it could be represented in double stranded form with its complement sequence.

So how do we generate this? Synthesizing short single stranded DNA is now a routine process, so encoding the city names is straightforward. The molecules can be made by a machine called a DNA synthesizer or even custom ordered from a third party. Itineraries can then be produced from the city encodings by linking them together in proper order. To accomplish this you can take advantage of the fact that DNA hybridizes with its complimentary sequence.

For example, you can encode the routes between cities by encoding the compliment of the second half (last three letters) of the departure city and the first half (first three letters) of the arrival city. For example the route between St.Louis (CTACGG) and NY (ATGCCG) can be made by taking the second half of the coding for St.Louis (CGG) and the first half of the coding for NY (ATG). This gives CGGATG. By taking the complement of this you get, GCCTAC, which not only uniquely represents the route from St.Louis to NY, but will connect the DNA representing St.Louis and NY by hybridizing itself to the second half of the code representing St.Louis (...CGG) and the first half of the code representing NY (ATG...). For example:

Random itineraries can be made by mixing city encodings with the route encodings. Finally, the DNA strands can be connected together by an enzyme called ligase. What we are left with are strands of DNA representing itineraries with a random number of cities and random set of routes. For example:

We can be confident that we have all possible combinations including the correct one by using an excess of DNA encodings, say 10^13 copies of each city and each route between cities. Remember DNA is a highly compact data format, so numbers are on our side.


Part II: Select itineraries that start and end with the correct cities:


Strategy: Selectively copy and amplify only the section of the DNA that starts with LA and ends with NY by using the Polymerase Chain Reaction.

After Part I, we now have a test tube full of various lengths of DNA that encode possible routes between cities. What we want are routes that start with Boston and end with NY. To accomplish this we can use a technique called Polymerase Chain Reaction (PCR), which allows you to produce many copies of a specific sequence of DNA. PCR is an iterative process that cycles through a series of copying events using an enzyme called polymerase. Polymerase will copy a section of single stranded DNA starting at the position of a primer, a short piece of DNA complimentary to one end of a section of the DNA that you're interested in.

By selecting primers that flank the section of DNA you want to amplify, the polymerase preferentially amplifies the DNA between these primers, doubling the amount of DNA containing this sequence. After many iterations of PCR, the DNA you're working on is amplified exponentially. So to selectively amplify the itineraries that start and stop with our cities of interest, we use primers that are complimentary to Boston and NY. What we end up with after PCR is a test tube full of double stranded DNA of various lengths, encoding itineraries that start with Boston and end with NY.

Part III: Select itineraries that contain the correct number of cities.

Strategy: Sort the DNA by length and select the DNA whose length corresponds to 5 cities.

Our test tube is now filled with DNA encoded itineraries that start with Boston and end with NY, where the number of cities in between Boston and NY varies. We now want to select those itineraries that are five cities long. To accomplish this we can use a technique called Gel Electrophoresis, which is a common procedure used to resolve the size of DNA. The basic principle behind Gel Electrophoresis is to force DNA through a gel matrix by using an electric field. DNA is a negatively charged molecule under most conditions, so if placed in an electric field it will be attracted to the positive potential.

However since the charge density of DNA is constant (charge per length) long pieces of DNA move as fast as short pieces when suspended in a fluid. This is why you use a gel matrix. The gel is made up of a polymer that forms a meshwork of linked strands. The DNA now is forced to thread its way through the tiny spaces between these strands, which slows down the DNA at different rates depending on its length. What we typically end up with after running a gel is a series of DNA bands, with each band corresponding to a certain length. We can then simply cut out the band of interest to isolate DNA of a specific length. Since we known that each city is encoded with 6 base pairs of DNA, knowing the length of the itinerary gives number of cities. In this case we would isolate the DNA that was 30 base pairs long (5 cities times 6 base pairs).


Part IV: Select itineraries that have a complete set of cities:


Strategy: Successively filter the DNA molecules by city, one city at a time. Since the DNA we start with contains five cities, we will be left with strands that encode each city once.

DNA containing a specific sequence can be purified from a sample of mixed DNA by a technique called affinity purification. This is accomplished by attaching the compliment of the sequence in question to a substrate like a magnetic bead. The beads are then mixed with the DNA. DNA, which contains the sequence you're after then hybridizes with the complement sequence on the beads. These beads can then be retrieved and the DNA isolated.

So we now affinity purify fives times, using a different city complement for each run. For example, for the first run we use Boston.'-beads (where the ' indicates compliment strand) to fish out DNA sequences which contain the encoding for Boston (which should be all the DNA because of step 3), the next run we use Atlanta '-beads, and then San Diego '-beads, St.Louis '-beads, and finally NY'-beads.











\














The order isn’t important. If an itinerary is missing a city, then it will not be "fished out" during one of the runs and will be removed from the candidate pool. What we are left with are the are itineraries that start in Boston, visit each city once, and end in NY. This is exactly what we are looking for. If the answer exists we would retrieve it at this step.


Reading out the answer:

One possible way to find the result would be to simply sequence the DNA strands. However, since we already have the sequence of the city encodings we can use an alternate method called graduated PCR. Here we do a series of PCR amplifications using the primer corresponding to Boston, with a different primer for each city in succession. By measuring the various lengths of DNA for each PCR product we can piece together the final sequence of cities in our itinerary. For example, we know that the DNA itinerary starts with Boston and is 30 base pairs long, so if the PCR product for the LA and Atlanta primers was 24 base pairs long, you know Atlanta is the fourth city in the itinerary (24 divided by 6). Finally, if we were careful in our DNA manipulations the only DNA left in our test tube should be DNA itinerary encoding Boston, San Diego, St.Louis, Atlanta, and NY. So if the succession of primers used is Boston & San Diego, Boston & St.Louis, Boston & Atlanta, and Boston & NY, then we would get PCR products with lengths 12, 18, 24, and 30 base pairs.




Caveats:

Adleman's experiment solved a seven city problem, but there are two major shortcomings preventing a large scaling up of his computation. The complexity of the traveling salesman problem simply doesn’t disappear when applying a different method of solution - it still increases exponentially. For Adleman’s method, what scales exponentially is not the computing time, but rather the amount of DNA. Unfortunately this places some hard restrictions on the number of cities that can be solved; after the Adleman article was published, more than a few people have pointed out that using his method to solve a 200 city HP problem would take an amount of DNA that weighed more than the earth. Another factor that places limits on his method is the error rate for each operation. Since these operations are not deterministic but stochastically driven (we are doing chemistry here), each step contains statistical errors, limiting the number of iterations you can do successively before the probability of producing an error becomes greater than producing the correct result. For example an error rate of 1% is fine for 10 iterations, giving less than 10% error, but after 100 iterations this error grows to 63%.

Conclusions :


So will DNA ever be used to solve a traveling salesman problem with a higher number of cities than can be done with traditional computers? Well, considering that the record is a whopping 13,509 cities, it certainly will not be done with the procedure described above. It took this group only three months, using three Digital AlphaServer 4100s (a total of 12 processors) and a cluster of 32 Pentium-II PCs. The solution was possible not because of brute force computing power, but because they used some very efficient branching rules. This first demonstration of DNA computing used a rather unsophisticated algorithm, but as the formalism of DNA computing becomes refined, new algorithms perhaps will one day allow DNA to overtake conventional computation and set a new record.

On the side of the "hardware" (or should I say "wetware"), improvements in biotechnology are happening at a rate similar to the advances made in the semiconductor industry. For instance, look at sequencing; what once took a graduate student 5 years to do for a Ph.D thesis takes Celera just one day. With the amount of government funded research dollars flowing into genetic-related R&D and with the large potential payoffs from the lucrative pharmaceutical and medical-related markets, this isn't surprising. Just look at the number of advances in DNA-related technology that happened in the last five years. Today we have not one but several companies making "DNA chips," where DNA strands are attached to a silicon substrate in large arrays (for example Affymetrix's genechip). Production technology of MEMS is advancing rapidly, allowing for novel integrated small scale DNA processing devices. The Human Genome Project is producing rapid innovations in sequencing technology. The future of DNA manipulation is speed, automation, and miniaturization.

And of course we are talking about DNA here, the genetic code of life itself. It certainly has been the molecule of this century and most likely the next one. Considering all the attention that DNA has garnered, it isn’t too hard to imagine that one day we might have the tools and talent to produce a small integrated desktop machine that uses DNA, or a DNA-like biopolymer, as a computing substrate along with set of designer enzymes. Perhaps it won’t be used to play Quake IV or surf the web -- things that traditional computers are good at -- but it certainly might be used in the study of logic, encryption, genetic programming and algorithms, automata, language systems, and lots of other interesting things that haven't even been invented yet.

No comments:

Post a Comment