The Human Genome Project (as of 2002)

Todd Edwards

(Note: I wrote this article in March 2002, so I guess it is time to write the follow-up article to show where we are now.)

Sequencing the entire human genome is the sort of undertaking that is on par with launching a person into space. It is a massive project that began in 1990 and is anticipated to be complete in 2003 after thirteen years and an estimated $3 billion. Many new technologies had to be developed in order to complete the sequencing in a reasonable amount of time. The project inspired science fiction along with science. But just what exactly is it?

As mentioned before, proteins are the major component of the “bricks and mortar” that make up an organism. The designs for proteins are contained in genes in the DNA, and the Human Genome Project’s goal is to determine the makeup of these genes. One of the primary motivations for the project is to advance health care. Many disorders, such as Parkinson Disease, result from mutations to specific genes, while other diseases, such as hypertension and coronary heart disease, are caused by both genetic and environmental factors. Knowing the code for each gene should make it easier for us to find the mutations and then to develop a treatment.

The first thing we need to know is how the genetic information is stored. DNA is a double stranded polymer that is twisted into a double helix. Each individual unit in the polymer is called a base, and only four different types of bases are found in nature: adenine (A), thymine (T), guanine (G), and cytosine (C). The two DNA strands are connected together by the bases, so we often refer to single units as “base pairs”. Bases will only form pairs of A and T or C and G, so we only need to know the sequence of bases on one strand. Since we know how they pair, we can always fill in the information for the second strand. Scientists usually stretch DNA out when they draw it, since the main concern is usually the sequence of base pairs.

Human Genome Project figure 1

A single piece of DNA is quite long and contains sequences of base pairs that correspond to genes, that are used for promotion and regulation of gene expression, and with unknown purposes. In a cell, the gene sequences are translated into proteins while the rest of the sequences remain untranslated. Each individual piece of DNA in a human cell would be too long and unwieldy if it were stretched out, so special proteins called histones help tightly coil the DNA into a structure called a chromosome.

The Human Genome Project’s goal is to determine the sequence of all the base pairs in the DNA of a human cell, know as the genome. Fortunately all the cells in a person, except sperm and egg cells, contain identical copies of their genome. However, finding out the sequence of all the approximately three billion base pairs is only part of the project. Researchers will also have to figure out which bases are parts of genes and which are untranslated sequence. Furthermore, they will have to link the genes to proteins and genetic disorders. To give you an idea of the scope of the project, estimates of the total number of genes range from 30,000 to 40,000.

The technology of 2002 could determine the sequence of about 700 bases at a time. That means that the genome has to be broken up into small pieces, sequenced, and then reassembled. Celera, the biotech company that sequenced the genome in parallel with the publicly funded effort, used a “shotgun” approach. They broke the genome into random fragments and designed a computer program to assemble the sequence like a puzzle. First they put the pieces together to form larger blocks and then they put the blocks together to make the final sequence. The publicly funded groups, the Department of Energy’s Human Genome Project and the National Institutes of Health’s National Human Genome Research Institute for example, used a similar approach, but instead of making random fragments of the entire genome, they made fragments of defined regions. That method simplifies the final step of putting the blocks together.

Everyone, with the exception of identical twins, has a slightly different genome, so you might be wondering just who’s DNA is being sequenced? The simple answer is no one person’s. Before the project was conceived, scientists isolated human cells from various sources that are used as references. They keep the cells alive in dishes and everyone works with daughter cells of the original. These daughter cells all have the same genome, so work will be consistent across all sequencing laboratories. Different labs may use different cells, so the final database will merge information from the various sources to create a reference sequence. Most of the DNA is the same between two people, so the reference database will be adequate in most cases. Genes that are involved in diseases will be sequenced from many different sources, so we’ll eventually have more information about variation in those hot spots.

In June 2000, researchers announced that they had a rough draft of the human genome. That means that they determined the sequence of all the base pairs, but that they still had to figure out which parts of the sequence were genes, regulation sequences, etc. As of March 2002, the working draft of the human genome is over 90% complete, and scientists anticipate completing this historic project in 2003.

When the project began back in 1990, researchers knew that the technology at the time was insufficient to handle the task. They planned for spending a significant portion of the first five year budget on improving the technology, in order to reduce the time and cost for DNA sequencing. The project led to a revolution in high throughput molecular biology techniques, where multiple experiments are performed in parallel and in small volumes to reduce time and cost. These high throughput experimental techniques were then automated so the work could be done faster and cheaper by a robot rather than a human technician.

High throughput, automated sequencing rapidly generates large amounts of data. The project spawned a whole new branch of science, called Bioinformatics, which studies the best ways to store and access the data. Not only must we store the sequence information, but we need to store information about where genes are located, how they are regulated, what physical proteins, traits or diseases they correspond to, and so on. Bioinformaticists created databases to store the vast amount of information and developed tools to sift through the databases and to visualize the information. They also created search algorithms that help find patterns in the sequence that show similarities between proteins with many different functions. Molecular biologists can then study and understand the similar proteins better. Often, bioinformaticists find connections between proteins that would never have been found using previous techniques.

The Human Genome Project deals with ethical, legal, and social issues as well as scientific issues. Knowledge of the genetic basis of disease could lead to ethical problems involving health care and insurance. New legal issues arose as companies applied for patents on DNA sequences. Some patents were already issued, so courts will have to decide how to apply the law to natural sequences that may or may not contain genes. Worries of discrimination based on genetic makeup are being addressed by the project as well. These social fears are seen in science fiction works, such as the movie GATTACA, where genetic testing determines everyone’s role in society.

In the end, backers of the Human Genome Project expected that knowing the mutations that cause genetic diseases would let them create novel drugs and therapies. However, as the project neared completion, the research community and biotech companies discovered that knowing the sequence of the gene, and hence the sequence of the protein, is not enough. To effectively design drugs and alternative therapies, they must know the precise structure and function of the proteins as well as their sequence. Some companies, like Celera, are making money by selling subscriptions to their sequence database, but at the same time, they are transitioning from DNA to protein studies. Publicly funded research is also shifting to protein studies. The Department of Energy began its Genomes to Life project, which focuses on determining protein function and understanding how they interact to form biochemical pathways that are like assembly lines in the cell. Finally, efforts are underway to determine the three dimensional structure of all the new proteins. The techniques for doing so are not new, but now that the Human Genome Project nears completion, the techniques are being adapted for high throughput approaches and automation. Studying the genome is referred to as “genomics” and the next step of studying the proteins is called “proteomics”. Will the dream that began with genomics and the Human Genome Project be fully realized by the recent surge in proteomics? Only time will tell.