Shanghai Jiao Tong University releases "Venus," a protein design model that uses AI to train "hexagonal warriors." | MatwingsVenus™（晓鹜™） | MatwingsVenus™（晓鹜™）

Marked by the awarding of the Nobel Prize in Chemistry to the developers of AlphaFold last year, protein design has become one of the hottest interdisciplinary research areas in AI for Science. However, predicting structure is only the starting point for creating a functional protein product. It typically requires extensive expert experience combined with tens of thousands of experimental trials. For a long time, the time-consuming, costly, and trial-and-error-intensive nature of protein design and modification has been a major challenge for the industry.

On March 22nd, Professor Hong Liang's team at Shanghai Jiao Tong University released their latest findings, offering a solution to these problems. The team combined AI with protein design and modification, establishing the world's largest protein dataset. Models trained on this dataset can accurately and efficiently predict and design protein functions, transforming protein production from "slow trial and error" to "highly efficient and precise design." This achievement, coupled with industry-leading automated equipment, has already been industrialized, transforming protein design from a "complex science" to a "simple engineering" process.

Establishing the World's Largest Protein Sequence Dataset: Proteins are composed of amino acid sequences, ranging in length from hundreds to thousands of characters. In the AI era, data is a core resource driving technological progress. Massive protein sequence datasets provide AI models with abundant "learning material," helping them better understand the sequence, structure, and functional relationships of proteins. The Venus-Pod (Venus-Protein Outsize Dataset), built by Hong Liang's team, contains nearly 9 billion protein sequences and hundreds of millions of functional labels. It is the world's largest dataset in terms of data size and the number of functionally annotated labels, and is four times the size of the 2.1 billion protein sequences used to train another well-known industry model—the US ESM-C model.

This dataset includes 3.62 billion terrestrial microbial protein sequences, 2.64 billion marine microbial protein sequences, 2.43 billion antibody protein sequences, and 60 million viral protein sequences, covering protein sequence information from common surface organisms to microorganisms in extreme environments. In particular, it is equipped with hundreds of millions of functional labels (temperature, pH, pressure, etc., indicating protein function).

What does this data mean? Hong Liang stated that, firstly, this dataset constitutes a vast "protein mine," making it possible for humans to discover new proteins or biocatalysts, thus contributing to the rapid development of biomedicine and synthetic biology. Secondly, large-scale AI models are expected to learn from and master the evolutionary patterns of proteins in nature through massive amounts of data, providing valuable learning materials for AI to design superior protein products.

Aiming at Protein "Function Prediction," Using AI to Train "Hexagonal Warriors"

Proteins are polymer chains composed of 20 amino acids. This polymer chain twists and folds into unique three-dimensional structures, and it is this unique structure that endows specific proteins with biological functions. In 2024, the Nobel Prize in Chemistry was awarded to the Google DeepMind team, which used AI technology to accurately analyze the relationship between protein sequences and three-dimensional structures, solving a fundamental problem that had puzzled biologists for 50 years. However, a real problem is that if we slightly modify the amino acid sequence of a protein, even by just 1%, the overall structure of the protein may not seem to change significantly, but its function is likely to deteriorate or even be completely lost. In other words, to design a successful protein product, we cannot only focus on its three-dimensional structure, but must be able to successfully predict and design its function. Therefore, Professor Hong Liang's team took a different approach, no longer focusing on protein structure, but directly targeting the ultimate goal of "functional prediction," transforming complex protein design into a simple process driven by demand and yielding results with a small number of experiments.

"We trained the Venus series of models. Unlike DeepMind's AlphaFold, which predicts protein structure, this model learns the organization rules of protein sequences in nature and their relationship with function. Its accuracy in predicting protein mutation functions ranks first in the industry," said Hong Liang. He explained that the Venus series of models has two core functions: "AI-directed evolution" and "AI enzyme discovery." "AI-directed evolution" refers to the Venus series models' ability to optimize multiple properties of a less-than-ideal protein product, making it a "hexagonal warrior" that meets application requirements. "AI enzyme discovery," on the other hand, refers to the Venus series models' ability to "select super-powered warriors" based on its massive dataset of proteins with unknown functions, precisely discovering proteins with unconventional functions that meet stringent application requirements, such as extreme heat resistance, extreme acid resistance, extreme alkali resistance, and extreme resistance to gastrointestinal digestion. These proteins with extraordinary functions have enormous application potential in biotechnology, pharmaceutical research and development, and industrial production, bringing innovation and breakthroughs to related fields.

Meanwhile, the world's first automated integrated machine for low-throughput, high-volume protein expression, purification, and functional detection, used in conjunction with the Venus series models, can continuously complete the expression, purification, and detection of over 100 proteins within 24 hours, improving efficiency by nearly 10 times compared to manual labor. This will significantly reduce the human, material, and time costs in the research and development process, and dramatically improve the efficiency of protein engineering and synthetic biology research. Its core principle is "AI-driven design, automated experimentation," freeing researchers from tedious design and experimentation. Researchers only need to formulate questions, and AI and automation will solve them, ultimately transforming complex protein science discovery into a simple, "point-and-shoot" process.

Several products have already been launched, aiding in the diagnosis of diseases such as Alzheimer's.

From basic research to industrial application, it is a challenge to go from 0 to 100 in any scientific field, and AI protein design is no exception: While AI-designed proteins may perform well in the laboratory, this does not guarantee they will maintain excellent performance in pilot-scale or large-scale production. This is because industrial production involves larger-scale raw material processing, more complex process control, and stricter quality standards, all of which can affect protein performance. Several proteins designed using the Venus series models have already been industrialized.

Take the alkali resistance modification of single-domain antibodies by Jinsai Pharmaceutical, a leading domestic growth hormone manufacturer, as an example. Improving the alkali resistance of proteins has always been a highly challenging task. Hong Liang's team, using this model combined with closed-loop iterative verification through small-scale wet experiments, improved the alkali resistance of ordinary single-domain antibodies by four times in less than a year, saving Jinsai Pharmaceutical tens of millions of yuan annually. This achievement has been scaled up to 5000 liters in multiple batches, becoming the world's first protein product designed and mass-produced using a large-scale model.

Another innovative application of the Venus series models is the modification project of alkaline phosphatase (ALP) from a leading in vitro diagnostics company. ALP is widely used as a labeling enzyme due to its high stability and sensitivity; higher activity translates to higher detection sensitivity, enabling the detection of even very low levels of biomarkers. However, improving ALP activity has always been a challenge. The Venus series models have successfully optimized ALP, achieving molecular activity three times higher than that of products from leading international companies, bringing significant value to ultrasensitive detection and diagnosis (such as myocardial infarction and Alzheimer's disease). Currently, the modified ALP has entered the 200L scale-up production stage, marking the successful industrial transformation of the Venus series models.