The results of a gigantic biology project — called ENCODE — were released today. The project covered ten years of effort by over 400 scientists and has culminated in 30 scientific papers published today.
The estimated price of the project, mostly funded by the National Institutes of Health, is $185 million. It's produced hundreds of terabytes of raw data.
The point of funding all this work? To make sense of the jumbled mess of chemicals which make us human — called the human genome. Our genome is the sequence of letters of a chemical called DNA that live in all of our cells. This DNA is what makes a human a human and not a fish — each species has a different string of DNA that makes them unique.
In 2001 a group of scientists published the DNA code of a person: The first fully sequenced — every letter of the string figured out — human genome. The problem? We had no idea what it meant. Researchers have spent the last decade trying to figure it out.
What's especially confusing about our DNA is that only a tiny fraction (somewhere around 1.5 percent) actually hold instructions for the proteins that make up our cells and perform their functions. So researchers set out to discover what the rest was doing.
A huge study
They did so by studying 147 different kinds of human cells (of the few thousand present in the human body) in the lab to see what parts of the genome were active in each. Because we use the same set of DNA instructions to do all the different things our bodies do every day, one part of the genome is likely to be active in one cell type, but not in others.
"Almost every nucleotide is associated with a function of some sort or another, and we now know where they are, what binds to them, what their associations are, and more," study researcher Tom Gingeras told Ed Yong of Not Exactly Rocket Science in his comprehensive blog post on the new data (I definitely recommend it if you are interested in learning more).
Here's how Yong describes the new data:
Think of the human genome as a city. The basic layout, tallest buildings and most famous sights are visible from a distance. That’s where we got to in 2001. Now, we’ve zoomed in. We can see the players that make the city tick: the cleaners and security guards who maintain the buildings, the sewers and power lines connecting distant parts, the police and politicians who oversee the rest. That’s where we are now: a comprehensive 3-D portrait of a dynamic, changing entity, rather than a static, 2-D map.
Endless findings
Their data set is so gigantic it's mind boggling, so here are some highlights of what they've found from it so far:
- Though most of the genome doesn't actually hold the blueprint of a protein, about 80 percent of it has a function of some kind. The areas of the genome that are active are different for every type of cell — and are what make that cell type unique.
- Many of these previously "junk" areas actually influence the genes that code for proteins -- they turn them on or off, or control when they are copied.
- Some other areas are copied and have actions in the cells as molecules of DNA's sister chemical RNA instead of being turned into proteins.
- Other areas act as handles that other proteins use to fold or unfold the genome from its compact shape when it's being stored or copied.
- There's still 20 percent of a human's DNA that we haven't figured out. This is likely because, even the ENCODE project, as big as it is, has only analyzed a fraction of human cell types.
- The "genes" we thought we knew have been obliterated. The researchers found that they are all butting up against each other, sharing sequences, and what we used to think of as a gene can be made into many different things and used in different ways by different cells.
- All of this information could redefine our idea of a gene. It's no longer a single stretch of DNA that codes for a protein. Instead of thinking of DNA as the single unit, we should think of the ending transcript of RNA that's turned into protein as the essential "unit" of our genome.
- Many sites in the genome that have been linked to the genetic basis of a disease are actually part of the genome that aren't made into proteins. One of the ENCODE research studies, publishing tomorrow in the journal Science, was to link all of this data on thousands of studies with what they saw happening in these 147 different cell types. Many of the elements found in these studies are in the "controlling" areas of the genome, not the genes themselves.
(Some good, more in depth articles about the findings: Not Exactly Rocket Science and Brendan Maher's lengthy analysis at Nature. All of the ENCODE data is freely available to the public at a portal site and at Nature.com, for those who are scientifically inclined.)
While this is an amazing feat, the project is only in its infancy. Researchers will be combing through the data from the ENCODE project for a century and making new connections and understanding. The project is just headed into its third stage and lots more will come from this gigantic data set, Maher's article says.