Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
We scored likelihoods from the model using experimental tests of protein function. We found that if a base pair has high likelihood under Evo, then that base pair is likely to preserve or improve the protein’s function. But if that base pair has low likelihood, then putting that base pair into a protein sequence will likely destroy function.
We also compared the model’s results to those of state-of-the-art protein language models. We found that Evo matched the performance of the protein models, despite never having seen a protein sequence. That was the first indication that, OK, maybe we were on to something.
We used it to generate DNA sequences, just as ChatGPT can generate text. One of my students, Brian Kang, helped me fine-tune the Evo model on DNA that coded for a protein as well as at least one RNA molecule; they link together to create a complex called CRISPR-Cas. CRISPR-Cas breaks DNA in specific spots, which helps bacteria defend against viruses. Scientists use them for genome editing.
After training Evo on more than 70,000 DNA natural sequences for the CRISPR-Cas complex, we asked it to generate the complete system in the DNA code. For 11 of its suggestions, we ordered the DNA sequences from a company and used these to create the CRISPR-Cas complexes in the lab and test their function.
One of them worked. We consider that a very successful pilot. With typical protein design workflows, you’d be lucky to find one working protein for every 100 sequences tested.
It does as well as the state-of-the-art Cas system. If you squint a little bit, maybe it has a little bit faster cleavage [cutting of the DNA strand].
This is a very complicated task. The Cas enzyme is too long for current protein language models to process. In addition, a protein model could not generate the RNA.
The model generated a million tokens freely from scratch — essentially, an entire bacterial genome. If you asked ChatGPT to generate a million tokens of text, at some point it would go off the rails. There would be some grammatical structure, but it would not produce Wuthering Heights.
Evo’s genome also had structure. It had a similar density of genes to natural genomes, and proteins that folded like natural proteins. But it fell short of something that could drive an organism because it lacked many genes that we know to be critical to an organism’s survival. To generate a coherent genome, the model needs the ability to edit its product — to correct errors, just as a human writer would do for a longer passage of text.
It’s only the beginning. Evo is trained only on genomes from the simplest organisms, prokaryotes. We want to expand it to eukaryotes — organisms such as animals, plants and fungi whose cells have a nucleus. Their genomes are much more complicated.
Evo also only reads the language of DNA, and DNA is only part of what determines the characteristics of an organism, its phenotype. The environment also plays a role. So, in addition to having a good model of genotype, we would like to build a really good model of the environment and its connection to phenotype.
With ChatGPT, you want it to get the facts right. In biology these hallucinations can almost be a feature and not a bug. If some crazy new sequence works in the cell, then biologists think it’s novel.
But Evo does make mistakes. It may, for example, predict a protein structure from a sequence that turns out to be wrong when we make the protein in the lab. Still, a human would be almost completely worthless on a task like this. No human could write, from scratch, a DNA sequence that would fold into a CRISPR-Cas complex.
We are going to push the boundaries of biological design way beyond individual protein molecules to more complex systems involving many proteins, or to proteins bound to RNA or DNA. That’s the message of the Evo paper. We might engineer a synthetic pathway that produces a small-molecule drug with therapeutic value or that degrades discarded plastic or oil from spills.
I also expect the models to aid biological discovery. When you sequence a new organism from nature, you just get DNA. It’s very hard to identify what parts of the genome correspond to different functions. If the models can learn the concept of, say, a phage defense system or a biosynthetic pathway, they will help us annotate and discover new biological systems in sequencing data. The algorithm is fluent in the language, whereas humans are very much not.
If the model were used to design viruses, maybe those viruses could be used for nefarious purposes. We should have some way of ensuring that these models are used for good. But the level of biotechnology is already sufficient to create dangerous things. What biotechnology can’t do yet is protect us from dangerous things.
Nature is creating deadly viruses all the time. I think that if we raise our level of technological capability, it will have a larger impact on our ability to defend ourselves against biological threats than it does on creating new ones.