Graphic illustration with DNA sequences, plants and animals

A Catalog of All Life on Earth

The Earth BioGenome Project will revolutionize biology.

Some people dream of exploring other planets or the ocean depths. Some want to understand how the universe began or the fundamental forces inside an atom. 

Harris Lewin has a different goal: to read and understand the genetic code of every plant, animal, fungus and single-celled organism with a nucleus on Earth. 

Lewin, distinguished professor of evolution and ecology and former vice chancellor for research at UC Davis, chairs the executive working group of the Earth BioGenome Project. With its secretariat at the UC Davis Genome Center, the project’s goal is to sequence, catalog and characterize the DNA of all of Earth’s 1.8 million named eukaryotic species within 10 years. 

It’s also an urgent problem: While perhaps only 10 percent of species on our planet have been identified, many scientists fear we are entering a new mass extinction, as human exploitation of land, oceans and resources, introduced species and climate change upend ecosystems. As species disappear, we lose the opportunity to learn more about the systems that sustain life, as well as possibilities of new crops or medicines. 

A eukaryote is an organism whose genetic material is mostly contained within a separate compartment, the nucleus, within cells. Eukaryotes include ourselves, all animals from a sea sponge to a blue whale, all plants, fungi and yeasts as well as single-celled organisms such as an amoeba or a malaria parasite. 

Lewin likens the impact of this biology moonshot to that of the Hubble or James Webb space telescopes on astronomy. It will revolutionize our view of biology, empower conservation and biodiversity efforts, and create human benefits from new medicines to improved crop yields. 

“This is basic infrastructure for the future of biology,” Lewin said. 

Mapping the tree of life

In 1753, Swedish botanist Carl Linnaeus published Species Plantarum, a book describing thousands of plants each with its own two-part Latin name, genus and species. A few years later he published a similar catalog of animals. This binomial naming system has been used by scientists ever since, enabling us to organize animals and plants into related groups based on morphological similarity. It is the bedrock of any understanding of biology. 

Just over a hundred years later, Charles Darwin laid out the foundations of evolutionary biology in the On the Origin of Species. Now the Tree of Life was not a static collection of species: It constantly puts out new shoots and branches, while others fail and die off. But all trace back through heredity to a common root. 

By the middle of the 20th century, scientists were getting to understand how traits and characteristics are inherited from one generation to the next, written in chemical code in the double helix of DNA. 

The structure of DNA includes four bases: adenosine and thymine, cytosine and guanine, or A, T, C and G, strung along a backbone of sugars and phosphates. The sequence order of these bases can be read by the cell to form the messenger RNAs, which are then translated to form proteins. These proteins are the basic machinery for all cells, acting as structural components or as enzymes, which catalyze biochemical activities such as cell metabolism.

“It started as a crazy idea, and then it wasn’t so crazy.”

British biochemist Frederick Sanger developed a method to read the sequence of DNA and in 1977 published the first complete DNA sequence of a virus called phi X174, with just over 5,000 bases. The first bacterium was sequenced in 1995; Saccharomyces cerevisiae, or brewer’s yeast, in 1996; the laboratory fruit fly Drosophila melanogaster in 2000; and the laboratory mouse in 2002. Lewin’s lab, then at the University of Illinois, played a major role in sequencing the first livestock species, domestic cattle, in 2009. 

Arabidopsis thaliana was the first flowering plant to have its complete genome sequenced, in 2000. Linnaeus would have recognized it; he included it in Species Plantarum as Arabis thaliana. Small and weedy, Arabidopsis has become an important organism for plant biologists. 

In 1990, the U.S. government had launched the Human Genome Project with a goal of identifying all 20,000 to 25,000 genes in humans and sequencing the approximately 3 billion letters of human DNA. The project announced a “first draft” of the human genome in 2001, with a more complete version in 2003 when the project concluded. A complete, end-to-end human genome sequence was published just last year. 

Genome 10K

The forerunner of the Earth BioGenome Project is Genome 10K, founded at UC Santa Cruz in 2009 by David Haussler, Stephen O’Brien and Oliver Ryder. 

“G10K was really the first project with the goal to sequence thousands of species with large genomes, before we even had the technology to do that,” Lewin said. At the time, Lewin was a professor at the University of Illinois, working on cattle genetics and comparative genomics, and was a founding member of the project.

decorative graphic with DNA sequences, plants and animals

“I was one of the people who was using genomes at scale to answer biological questions, and also was generating high-quality genomes,” Lewin said. 

Lewin is interested in questions of evolution and chromosome structure. The problems he wanted to address required high-quality genomes. 

“The great limitation in mammalian comparative genomics was having good genomes,” he said.

The first draft of the human genome completed in 2003 cost $3 billion; the sequence of domestic cattle, published in 2009, cost $50 million. The cost of DNA sequencing was falling rapidly, but the number of organisms sequenced remained low, a tiny fraction of Earth’s biodiversity. By the mid-2010s, only about 15,000 species had been even partially sequenced, mostly bacteria, lab or domestic animals and crop plants. 

“I was flying back from Europe, and I thought, we’re starting to get all these genomes, how much would it cost to get all reference-quality genomes for all the mammals? So I did the calculation and said, ‘wow, we could do a lot of genomes,’” Lewin said.  

Lewin’s mid-Atlantic calculation showed that it would be possible to sequence not just the known mammals, but all known eukaryotes, at least in draft form, for the cost of the original Human Genome Project. 

Lewin started talking to a UI colleague, Gene Robinson, a prominent entomologist who was interested in bee genomes. Robinson was also in conversation with ecologist John Kress, then undersecretary for science at the Smithsonian Institution, who was thinking about sequencing plant genomes. Those conversations led to a 2015 biogenomics workshop at the Smithsonian in Washington, D.C., to discuss the feasibility of sequencing all eukaryotes.  

“We were asking, can we do this, should we do it, is this the right time — everybody was very excited about it,” Lewin said. Over the next two years, the planning group expanded to an international group of experts in genomics, drawing on the expertise acquired by the G10K Consortium. 

In January 2018, the Earth BioGenome Project was announced at the World Economic Forum in Davos, Switzerland, followed by a manifesto published in Proceedings of the National Academy of Sciences in April that year. A pilot phase was launched in November 2020, and sequencing began in earnest in January 2022. 

A network of networks

The Earth BioGenome Project has grown to be a globe-spanning effort relying on breakthroughs in technology, and at the same time brings dozens of smaller efforts under one umbrella. Some are national or regional efforts, such as the Darwin Tree of Life project for the British Isles and similar projects in Norway and Catalonia; others focus on specific groups of animals, such as birds (B10K), bees (the Beenome100 project), crabs or fish.

As it has built on smaller, existing projects, the EBP and its forerunners, G10K and the Vertebrate Genome Project, have created momentum for other big biodiversity genomics projects, Lewin said. In the past year, the African BioGenome Project was launched with a goal of sequencing 105,000 species endemic to the continent. The European Union, in collaboration with the British and Swiss governments, launched Biodiversity Genomics Europe, which aims to use DNA data to characterize and conserve European species.

One of the main challenges for these efforts is getting high-quality specimens. 

“The sequencing technology is there; the bottleneck is in getting high-quality samples,” Lewin said. 

If sequence data is going to be useful, it has to come with information about the specimen, such as where and when it was collected, from what stage in its life cycle, what it looks like and thorough documentation that it represents a known species. This metadata is the key to getting useful knowledge from sequence data, Lewin said. The project is aligning with organizations that collect specimens from the wild, such as natural history museums. 

So far, no comprehensive national biodiversity genomics effort exists in the U.S. Several U.S.-based efforts are affiliated with the Earth BioGenome Project, including the California Conservation Genomics Project, funded by the state of California and including all 10 UC campuses as well as scientists from state and federal agencies. Genome sequencing for that project is being done at the UC Davis Genome Center. There are efforts supported by federal agencies; they include the Ag100 Pest project (U.S. Department of Agriculture) and the Open Green Genomes project of the U.S. Department of Energy’s Joint Genome Institute, working on land plants. 

In September 2022, the White House issued an executive order on the bioeconomy, including establishing a Bioeconomy Data Initiative to ensure that high-quality biological datasets are secure and widely available. In addition, the 2022 CHIPS and Science Act has programs that will support the national collections and prevent biodiversity loss, so the outlook is very optimistic for a broader national program, Lewin said.  

Sharing the wealth

The data from the Earth BioGenome Project will be publicly accessible, just as the data from the Human Genome Project are available to any scientist who wants to access and make use of them. Open access to the human genome sequence has been an overwhelming success: A 2011 study found that for every dollar the U.S. government invested in sequencing the human genome, $141 came back to the U.S. economy. Related industries generate billions in economic activity a year. 

The Earth BioGenome Project will gather data from all over the world, often from places that are poor, remote or disadvantaged. How to ensure that any benefits flow back to those places? 

decorative graphic with DNA sequences, plants and animals

An international agreement known as the Nagoya Protocol, part of the 1992 Convention on Biological Diversity, calls for the benefits of genetic resources to be shared in a fair and equitable way. The Nagoya Protocol puts forward the idea of a multilateral system for sharing DNA sequence in digital form, but this does not yet exist, with countries setting up their own rules. 

“The mechanism isn’t there yet,” Lewin said. “We really need a multilateral agreement.” 

Brazil, which has set up a national registry for its rich genetic resources, could be a good model for other nations, to follow, Lewin said. 

“We’re trying to do this as the project is developing, not as an afterthought,” Lewin said. “There will be no benefits without access to the data, and there will be no access without assurance of shared benefits.” 

Another benefit of these projects is in developing capacity and expertise in genomics and biological sciences in developing countries. For example, the African BioGenome Project specifically aims to build these capabilities on the continent alongside its scientific goals, according to project founder ThankGod Ebenezer. 

Phase 1 launched

The 10-year clock started running a year ago, with the launch of phase 1. Lewin expects projects affiliated with the EBP to release as many as 2,000 genomes this year. The project will have to complete up to nine new genomes a day for the three years of phase 1 and scale to 125 a day for four years in phase 2 to reach its goals. Phase 3, sequencing of all remaining species, will require another 10-fold increase in throughput.  

The challenges — scientific, logistical, ethical, legal and social — remain formidable. But just as a hazy image resolves into a distant galaxy in more and more powerful telescopes, it’s possible to see where the project is going and what it could reveal.  

“It started as a crazy idea, and then it wasn’t so crazy,” he said.