Proteins are essential for cell life by performing complex tasks and catalyzing chemical reactions. Scientists and engineers have long sought to harness this power by creating artificial proteins that can perform new tasks. But many of the processes for making such proteins are slow and complex. As part of a breakthrough that could have implications for the health, agriculture and energy sectors, a team of scientists has developed an AI-driven process that uses big data to develop new proteins. The research is published in the journal Science.

By developing machine learning models that can view protein information gathered from genome databases, scientists have found relatively simple design rules for creating artificial substitutes. When the team constructed artificial proteins in the laboratory, they found that they rivaled those found in nature

We’ve all wondered how a simple process like evolution could lead to a high-performing material like protein. We found that the genome data contained a wealth of information about the basic rules of protein structure and function, and now we were able to create the rules of nature to create proteins ourselves.

Rama Ranganathan, Professor at the Department of Biochemistry and Molecular Biology, Pritzker Molecular Engineering

Proteins are composed of hundreds or thousands of amino acids, and these amino acid sequences determine the structure and function of a protein. But figuring out how to create these sequences to create new proteins hasn’t been easy. Past work has led to methods that can define the structure, but the feature was more elusive.

Over the past 15 years, Ranganathan and his collaborators have realized that the exponentially growing genome databases contain a wealth of information about the basic rules of protein structure and function. His group developed mathematical models based on this data and then began using machine learning techniques to uncover new information about the basic rules for protein design.

For this study, they studied the chorismate mutase family of metabolic enzymes, a type of protein that is essential to the life of many bacteria, fungi and plants. Using machine learning models, the researchers were able to identify simple rules for designing these proteins.

The model shows that only conservation at amino acid positions and correlations in the evolution of amino acid pairs is sufficient to predict new artificial sequences that will have the properties of a protein family.

We usually assume that in order to build something, we must first have a deep understanding of how it works. But if you have enough sample data, you can use deep learning techniques to learn design rules, even if you understand how it works or why it’s built that way.

Rama Ranganathan, Professor at the Department of Biochemistry and Molecular Biology, Pritzker Molecular Engineering

He and his co-workers then created synthetic genes to encode proteins, cloned them into bacteria, and watched the bacteria then produce synthetic proteins using their normal cellular machinery. They found that artificial proteins have the same catalytic function as natural chorismate mutase proteins.

Because the design rules are so simple, the amount of artificial proteins that researchers can potentially create is enormous.

Although artificial intelligence has revealed the design rules, Ranganathan and his staff still don’t fully understand why models work. Scientists will work to understand how the models came to this state.

At the same time, they also hope to use this platform to develop proteins that can address pressing social issues such as climate change. Ranganathan and Assoc. Professor Andrew Ferguson founded Evozyne, which will commercialize this technology in energy, environment, catalysis and agriculture.