An explosive growth in knowledge and steep drop in computational costs speed Bucknell researchers' advances
In the 275 years between the first printing of Carl Linnaeus' Systema Naturae, the ur-text of biological taxonomy, and the 2005 publication of the third edition of Mammal Species of the World, the current authoritative text on the classification of mammals, biologists had catalogued some 5,416 members of class Mammalia. Bucknell biologist DeeAnn Reeder, the co-editor of Mammal Species of the World, is now compiling a new edition, which she estimates will include more than 7,500 distinct mammal species — a jump of nearly 30 percent in a little over a decade.
This same explosive growth in knowledge is happening throughout biology, a field that appears to have entered a new age of exploration. This time, however, researchers aren't, like Darwin in the Galápagos, traveling far afield in search of new life, but rather are peering in more closely than ever before, thanks to the increased availability of genomic data.
Though initially driven by medical research, genomics — the study of the structure, evolution and mapping of genes — has allowed biologists such as Reeder to frame their studies in ways that were previously impossible. Professor Ken Field, one of the first Bucknell biologists to incorporate genomics in his research, began using genetic sequencing to examine questions surrounding white-nose disease, the fungal infection that has devastated populations of many species of bats around the United States. It has advanced his quest to learn why some bats were more susceptible than others.
Although genomic sequencing has enhanced biologists' ability to find answers quickly and inexpensively, it comes with challenges, not the least of which is how to interpret the immense load of data it generates.
"What happens when you have hundreds of millions of pieces of data — you can't graph it the same way as if you had a dozen. You'll break Excel if nothing else," Field says. "What big data means is that you have too much data to analyze or to visualize using ordinary tools."
This ever-rising growth curve in data collection has been enabled by two factors: the development in the late 1990s of so-called "next-generation" sequencing technologies, which can scan genomic data much more quickly than older methods, and an equally dramatic decline in the cost of DNA sequencing. One of science's first and most important forays into whole-genome sequencing, the Human Genome Project, which sought to sequence the roughly 3 billion nucleotide pairs in a model human genome, concluded in 2003 at a cost of approximately $2.7 billion. In 2015, the project's sponsor, the National Institutes of Health, estimated that sequencing the same amount of data typically costs between $1,500 and $4,000, with much of that reduction coming in just the last five years.
Reeder's own experience echoes that trend: Just five years ago, she sought grant funding for a small-scale genomic study of 10 animals, requesting a budget of $400,000. Today, she says she could do the same study for $5,000 or less.
Cost Reductions Speed Advances
The dramatic drop in price not only comes from technological leaps, but from old-fashioned market forces. Around the country large universities and research hospitals have built labs dedicated to genomic sequencing, and the increased speed of next-gen sequencing has allowed those facilities to do contract work for researchers at other institutions, with competition driving prices ever lower. While Bucknell doesn't have such a facility, its professors are able to access the same technology at the same rates as institutions that do. Students are able to reap the benefits as well, through close collaboration with their instructors on research projects.
The impact of the price drop has been felt throughout biology, including at Bucknell, where many professors have rushed to incorporate genomics in their work. It has enhanced scientific understanding of the tallest redwood down to the smallest virus. Virologist and Biology Professor Marie Pizzorno says that until recently, her field was only interested in the viruses that make humans and other animals sick. The advent of genomics has shed light on other viruses all around us, including some embedded in our DNA.
"Some of these viruses are ancient," Pizzorno says. "We picked them up millions of years ago when we were still evolving as humans. They left a little footprint of DNA, so we can tell where they inserted themselves into our genome. They are the remnants of retroviruses, which are distantly related to HIV."
Having so much data that it's difficult to properly process it is at the heart of another field that has arisen alongside genomics, that of bioinformatics, which deals with analysis of the complex data biologists now have the ability to collect. Brian King, a computer science professor, has consulted at the Geisinger Health System on several big-data projects, including the MyCode Community Health Initiative, a large-scale effort to correlate patient genomic data with electronic health records to improve individualized, proactive health care for all. He says that "the amount of genomic data available today is daunting. Identifying statistically significant, medically relevant variants in people is akin to finding a needle in a very large haystack."
The White Whale of Genetics
One method of doing so, which King has used in his own work on genetic data, is to break a strand of DNA into segments of a fixed number of nucleotide base pairs, like words in a book, and search for patterns of repetition. Just as a literary scholar might extract meaning by examining each time Melville mentions the white whale in Moby Dick, biology researchers can learn something by examining repeated sequences in the code and the genetic information that surrounds them.
Examining longer sequences, however, requires computing power beyond what the average PC or MacBook can handle. Biology's Steve Jordan, for example, is preparing a study of Hawaiian damselflies that will examine 2 million base pairs for each of 200 individual insects, a data set he estimates will be larger than 80 gigabytes. Jordan and others have turned to the computing resources within the College of Engineering to help, but as more members of the biology department incorporate genomics in their work, the demand for computing power will only increase.
Jordan also contemplates an even more basic question: "Given the ever-increasing amount of data being generated, how do we help Bucknell students develop the skills they will need to use it?"
"We need to teach our students bioinformatics," Jordan continues. "My own daughters are biology majors in college right now, and I've suggested that they study computer science and learn to write and run scripts that can prepare for and manage analyses. It's a skill that is well within the ability of our students to learn if they choose to."
It's a need many students have recognized, says King, noting that enrollment in Computer Science 203, an introductory course, has more than doubled in the last decade.
Field, who is currently on a yearlong sabbatical, is developing such a course, Advanced Data Analysis and Bioinformatics. The course will teach biology and cell biology/biochemistry majors how to work with big data.
"An essential aspect of the course is going to be applying statistics to biology," Field says. "Then we're going to make the jump to using the same tools, but working with big data."
On the Prowl for Phages
One course that already allows Bucknell students to engage directly with genomic techniques is the Phage Hunters class co-taught by Pizzorno and Professor Emily Stowe, biology. Since 2011, Bucknell has been involved in a national program supported and administered by the Howard Hughes Medical Institute (HHMI) and based on research led by Graham Hatfull, a University of Pittsburgh professor who studies bacteriophages — viruses that infect bacteria. In Phage Hunters, a two-semester course, students isolate and analyze phages for the HHMI program.
During the first half of Pizzorno and Stowe's class, students collect soil samples on and around campus, then "hunt" for bacteriophages in those samples by adding a host bacterium to the mix. "You look for these things called plaques, which are basically holes where the virus has landed and killed the bacteria," Pizzorno explains. "Then you pick that out and you purify it so you have a pure population of the phage."
The isolated samples are packed off to Pittsburgh for sequencing — next-generation techniques have made the process fast enough that last year's class was able to have five phages sequenced — then returned to Bucknell for the students to work with. Sequencing produces broken fragments of code, which a computer reassembles to form a complete genome. The students then compare the new phage DNA sequence to the known DNA and protein sequences stored in GenBank, a genetic sequence databank operated by the National Institutes of Health for use in scientific research.
The typical result is for the class to be involved in the classification of several previously unknown viruses, which they add to the NIH database. It's an outcome that feels empowering for students such as Alyssa Benjamin '17, who took the course her sophomore year and represented her class at an annual conference hosted by HHMI.
"We were all excited by the possibility [of discovering] something that nobody had seen before," Benjamin says. "This excitement was contagious, and I looked forward to getting behind the lab bench every Tuesday and Thursday."
Bucknell biologists see the complementary fields of genomics and bioinformatics becoming increasingly critical for student education, so much so that the department now is seeking to hire a new tenure-track faculty member specializing in genomics research to replace a retiring professor.
"This is one place where a single technique is probably going to impact every area of biology from medicine to ecology to everything else," Pizzorno says. "It will be exciting to see what new ways of studying biology this new professor will teach our students."