Back to list

Protein Large Models: From Understanding Protein Language to Building R&D Intelligent Agents

Published on May 26, 2026

Protein Large Models: From Understanding Protein Language to Building R&D Intelligent Agents

If large natural language models have enabled machines to 'understand' human text, then large protein models are attempting something of more fundamental significance—allowing AI to read the 'language' of life.


Proteins are encoded by sequences composed of 20 amino acids. This sequence determines how a protein folds into a three-dimensional structure, what functions it performs, and with which molecules it interacts. In a sense, proteins themselves are a highly refined 'language system' evolved over billions of years. The core task of large protein models is to learn the 'grammar' and 'semantics' of this system from vast amounts of protein sequence data, enabling the understanding, prediction, and even novel design of proteins.

Protein Language From Sequence to Structure

Protein Language From Sequence to Structure


1. What is a large protein model?

A protein LLM is not a single technical concept, but a general term for a class of models based on deep learning architectures that undergo large-scale pretraining on protein sequence or structural data. Its underlying logic is very similar to that of large language models in natural language processing: treating amino acids as "words," protein sequences as "sentences," and capturing contextual dependencies between amino acids through self-supervised learning tasks (such as masked language modeling and sequence generation).

This approach has been continuously validated in recent years. Protein language models represented by the ESM series can achieve accurate predictions of protein structure and function under self-supervised learning conditions, with performance approaching or even surpassing traditional methods that rely on multiple sequence alignment. In 2025, xTrimoPGLM expanded its parameter count to the hundred-billion scale, further demonstrating the gains in depth of protein understanding brought by scale effects.

However, understanding is only the first step. The truly exciting direction for large protein models is moving from "reading" to "writing"—that is, generating entirely new functional protein sequences that do not exist in nature.

Protein Language Models From Reading to Writing

Protein Language Models From Reading to Writing

2. Breakthroughs in Generative Capability: Key Transition of Large Protein Models

From 2025 to early 2026, the generative capabilities of large protein models underwent multiple iterations. Large-scale generative models represented by ProGen3 were trained on 15 trillion amino acid tokens, learning to generate complete protein sequences from scratch. More importantly, they could "fill in the blanks" based on a given context—producing fragments at specific positions in the sequence that comply with local structural and functional requirements. In the same year, ESM3 attempted to unify the three modalities of sequence, structure, and function into a single generative framework, enabling the model to generate proteins with specific functions according to complex combinatorial conditions.


At the same time, the advent of diffusion models provided another technical pathway for protein design. The RFdiffusion series generates entirely new protein backbones by gradually denoising from random noise, and RFdiffusion2 and RFdiffusion3 progressively achieved atomic-level accuracy in active site design and full-atom environment modeling. Notably, these diffusion models are usually used in conjunction with large protein models: the diffusion model generates the three-dimensional backbone, the protein model infers the corresponding sequence, which is then verified in a closed loop using structure prediction tools.


It can be said that the current mainstream protein design process has formed a "generate—predict—screen" closed loop, with large protein models serving as the bridge connecting sequence space and functional space.


3. Industrialization Challenges of Large Protein Models: From Models to Workflows

Although the capabilities of models continue to improve, in real R&D scenarios, the value of large protein models does not exist in isolation. A common misunderstanding is that with a powerful generative model, usable proteins can be obtained directly. In reality, protein design tasks often involve complex requirement decomposition—the user needs not just a sequence, but a solution that "remains stable at specific pH and temperature, has nanomolar affinity to a given target, and avoids non-specific binding."


This requires the model to not only generate sequences but also understand task boundaries, call appropriate prediction tools, screen candidate sequences, and form a closed loop with experimental validation. In other words, large protein models need to evolve from a "generator" into a "workflow engine."


It is against this backdrop that the concept of agents has begun to enter the field of protein design. Agents use large language models as the "brain," responsible for understanding user intent, decomposing tasks, orchestrating underlying tools, and handling feedback. When large protein models are integrated as a core component of the agent, the interaction methods and efficiency of the entire R&D process change.

Workflow for Protein Design

Workflow for Protein Design

IV. Xiaowu Intelligent Agent: An Integration Mode of Protein Large Model Capabilities

In April 2026, Matwings Technology released the conversational protein design intelligent agent MatwingsVenus™ (Xiaowu™). Placed within the development context of protein large models, MatwingsVenus™ (Xiaowu™) can be understood as an integration of the generative capabilities of protein large models with broader workflow capabilities.


MatwingsVenus™ (Xiaowu™) integrates over 200 protein design tools, more than 50 platform-certified experts, and over 30 expert-tuned skills, supported by a protein database with tens of billions of real-labeled proteins. Users can input task goals in natural language, after which the system automatically decomposes the tasks, invoking the appropriate design, prediction, analysis, and screening capabilities to complete the full process from deep research, enzyme mining, and directed evolution to de novo design.

MatwingsVenus Model of protein

MatwingsVenus Model of protein

Even more noteworthy is that MatwingsVenus™ (Xiaowu ™) has opened up a collaborative channel between cloud design and physics experiments. After design completion, the platform uses a self-built communication mechanism to automatically import results into plasmid ordering and experimental orchestration workflows, seamlessly connecting subsequent experimental tasks, driving robots to complete sample preparation, protein purification, and functional testing, ultimately feeding experimental results back to the next round of AI design, forming a closed-loop dry and wet system where "computation-driven wet experiments feed back into computation." This means protein design is no longer a one-time linear process of "output sequence—waiting for validation," but a system that can be continuously iterated and self-optimized.

According to Hong Liang, founder of Matwings Technology, MatwingsVenus™ ™ is not positioned as a single protein large model, but as a one-stop platform—users can search literature, patents, and market information to find general research directions and then implement them with design technology. In this sense, the protein large model is one of the foundations of MatwingsVenus's™ ™ capabilities, and MatwingsVenus™ ™ transforms this foundational capability into a conversational, closed-loop R&D tool.

5. Integration of large protein models with agents

The technological evolution of protein large models from "understanding" to "generation" is opening up new possibilities for protein research and development. However, improving model capabilities is only the first step. How to efficiently and conveniently utilize these capabilities in real, complex R&D scenarios is the next question the industry needs to answer.

Hong Liang, Chief Scientist of Matwings Technology, has shared his thoughts on this issue on multiple occasions. In his view, artificial intelligence is transforming protein engineering from a complex "discipline" highly dependent on experience and luck into a predictable and highly efficient "engineering." The core of this transformation lies in combining the underlying protein large model capabilities with the upper-level workflow organization capabilities—enabling AI not only to "design," but also to "understand user intent," "schedule appropriate tools," and "form a closed loop with experimental data."

Based on this idea, Matwings Technology launched the conversational protein design agent MatwingsVenus™ (Xiaowu ™). Hong Liang summarized the role of AI in three stages: evolving from a "bygone" auxiliary expert tool, evolving into an "ongoing" inclusive platform, and ultimately becoming an "AI Co-scientist" capable of independently proposing scientific hypotheses. From this perspective, MatwingsVenus™ integrates ™ the generation, analysis, and automated experimental capabilities of protein large models into a coherent and easy-to-use workflow.

When model capabilities and workflow capabilities truly work together, the threshold for protein design is expected to be significantly lowered, and the efficiency of the research and design closed-loop will be systematically improved. This is a field worth ongoing observation and anticipation.