The Poetry Fan Who Taught an LLM to Read and Write DNA

We scored likelihoods from the model using experimental tests of protein function. We found that if a base pair has high likelihood under Evo, then that base pair is likely to preserve or improve the protein’s function. But if that base pair has low likelihood, then putting that base pair into a protein sequence will likely destroy function.

We also compared the model’s results to those of state-of-the-art protein language models. We found that Evo matched the performance of the protein models, despite never having seen a protein sequence. That was the first indication that, OK, maybe we were on to something.

What else did you ask Evo to do?

We used it to generate DNA sequences, just as ChatGPT can generate text. One of my students, Brian Kang, helped me fine-tune the Evo model on DNA that coded for a protein as well as at least one RNA molecule; they link together to create a complex called CRISPR-Cas. CRISPR-Cas breaks DNA in specific spots, which helps bacteria defend against viruses. Scientists use them for genome editing.

After training Evo on more than 70,000 DNA natural sequences for the CRISPR-Cas complex, we asked it to generate the complete system in the DNA code. For 11 of its suggestions, we ordered the DNA sequences from a company and used these to create the CRISPR-Cas complexes in the lab and test their function.

One of them worked. We consider that a very successful pilot. With typical protein design workflows, you’d be lucky to find one working protein for every 100 sequences tested.

How well did the successful sequence work?

It does as well as the state-of-the-art Cas system. If you squint a little bit, maybe it has a little bit faster cleavage [cutting of the DNA strand].

Has this ever been done before?

This is a very complicated task. The Cas enzyme is too long for current protein language models to process. In addition, a protein model could not generate the RNA.

What is the longest DNA sequence Evo has generated?

The model generated a million tokens freely from scratch — essentially, an entire bacterial genome. If you asked ChatGPT to generate a million tokens of text, at some point it would go off the rails. There would be some grammatical structure, but it would not produce Wuthering Heights.

Evo’s genome also had structure. It had a similar density of genes to natural genomes, and proteins that folded like natural proteins. But it fell short of something that could drive an organism because it lacked many genes that we know to be critical to an organism’s survival. To generate a coherent genome, the model needs the ability to edit its product — to correct errors, just as a human writer would do for a longer passage of text.

What are Evo’s other limitations?

It’s only the beginning. Evo is trained only on genomes from the simplest organisms, prokaryotes. We want to expand it to eukaryotes — organisms such as animals, plants and fungi whose cells have a nucleus. Their genomes are much more complicated.

Evo also only reads the language of DNA, and DNA is only part of what determines the characteristics of an organism, its phenotype. The environment also plays a role. So, in addition to having a good model of genotype, we would like to build a really good model of the environment and its connection to phenotype.

I have found LLM chatbots to be error-prone. Is Evo more accurate?

With ChatGPT, you want it to get the facts right. In biology these hallucinations can almost be a feature and not a bug. If some crazy new sequence works in the cell, then biologists think it’s novel.

But Evo does make mistakes. It may, for example, predict a protein structure from a sequence that turns out to be wrong when we make the protein in the lab. Still, a human would be almost completely worthless on a task like this. No human could write, from scratch, a DNA sequence that would fold into a CRISPR-Cas complex.

Where do you see this technology leading in five or 10 years?

We are going to push the boundaries of biological design way beyond individual protein molecules to more complex systems involving many proteins, or to proteins bound to RNA or DNA. That’s the message of the Evo paper. We might engineer a synthetic pathway that produces a small-molecule drug with therapeutic value or that degrades discarded plastic or oil from spills.

I also expect the models to aid biological discovery. When you sequence a new organism from nature, you just get DNA. It’s very hard to identify what parts of the genome correspond to different functions. If the models can learn the concept of, say, a phage defense system or a biosynthetic pathway, they will help us annotate and discover new biological systems in sequencing data. The algorithm is fluent in the language, whereas humans are very much not.

Does a model like Evo present any dangers?

If the model were used to design viruses, maybe those viruses could be used for nefarious purposes. We should have some way of ensuring that these models are used for good. But the level of biotechnology is already sufficient to create dangerous things. What biotechnology can’t do yet is protect us from dangerous things.

Nature is creating deadly viruses all the time. I think that if we raise our level of technological capability, it will have a larger impact on our ability to defend ourselves against biological threats than it does on creating new ones.

Source link

The Poetry Fan Who Taught an LLM to Read and Write DNA

What else did you ask Evo to do?

How well did the successful sequence work?

Has this ever been done before?

What is the longest DNA sequence Evo has generated?

What are Evo’s other limitations?

I have found LLM chatbots to be error-prone. Is Evo more accurate?

Where do you see this technology leading in five or 10 years?

Does a model like Evo present any dangers?

Leave a ReplyCancel Reply

What is hyaluronic acid and how does it work in skincare and makeup?

Ulike red light mask overview and experience

Investigating the viral heat protectant test

What else did you ask Evo to do?

How well did the successful sequence work?

Has this ever been done before?

What is the longest DNA sequence Evo has generated?

What are Evo’s other limitations?

I have found LLM chatbots to be error-prone. Is Evo more accurate?

Where do you see this technology leading in five or 10 years?

Does a model like Evo present any dangers?

Leave a ReplyCancel Reply

Trending Now

What is hyaluronic acid and how does it work in skincare and makeup?

Ulike red light mask overview and experience

Investigating the viral heat protectant test