Protein engineering, protein design engineering, protein design, protein R&D, protein large model integrated intelligent agent | MatwingsVenus™（晓鹜™）

When it comes to protein engineering, the most common narrative is a triumphant advance from 'random trial and error' to 'AI-powered design.' But the actual history of scientific research is far more complex—it resembles an adventure filled with failures, misunderstandings, and unexpected discoveries. Understanding these 'failures' may give us a clearer picture than merely listing successful cases: when AI begins to design proteins, what vast chasm does it truly cross, and what legacies hidden deep in the logic of proteins does it inherit?

1. The Complexity of Protein Engineering Research Is Far Higher Than Expected

Proteins are composed of 20 types of amino acids linked together, and the length of a protein chain can range from dozens to tens of thousands of amino acids. Taking a protein of medium length as an example—if we were to write out every possible amino acid sequence, the number of sequences would far exceed the total number of atoms in the observable universe. In this sequence space, which cannot even be expressed in 'astronomical numbers,' the sequences that can fold stably and perform specific functions are but a few scattered 'islands.'

Thus, the fundamental question of protein engineering becomes exceptionally clear: Can we find those 'islands' within this sequence space that are stable, efficient, and precise under specific conditions? Going even further—can we design entirely new protein sequences from scratch that have not yet evolved in nature?

Early directed evolution strategies were very simple: simulate Darwinian cycles of mutation and selection. Allow genes to undergo random mutations, express thousands of variants in bacteria or yeast, and then screen to find that one lucky variant with even a slight performance improvement. Repeating this process dozens of times in the laboratory is like shooting at a target in the dark with a shotgun—eventually, you hit it. The 2018 Nobel Prize in Chemistry was awarded to Frances Arnold because she was the first to systematically apply directed evolution to the modification of enzymes, making this method a fundamental tool in protein engineering. What few people mention is that such projects often require screening millions or even hundreds of millions of clones, have a long cycle, and are extremely costly, resembling a contest of both probability and endurance.

Directed Evolution

A more hidden problem is that even in successful directed evolution experiments, the 'paths to success' are often perplexing. Researchers, reviewing the experiments afterward, found that many key mutations that ultimately significantly enhanced enzyme performance occurred in regions far from the active site, or even in seemingly inconsequential positions on the protein surface. How do they work? Sometimes it is through subtle long-range allosteric effects that reshape the active pocket, sometimes simply because they make the protein more thermally stable or less prone to aggregation—but these mechanisms are by no means intuitively predictable by humans. The amino acids within a protein form an extremely complex causal network, where a small change can affect the entire system. This also carries a profound lesson: the common 'disassemble-optimize-reassemble' mindset of human engineers often fails in the face of proteins.

2. Rational Arrogance and the Dark Matter of Folding

With the development of X-ray crystallography and cryo-electron microscopy, scientists have obtained three-dimensional atomic-level structures of more and more proteins. An enticing idea arose: since we can see what proteins look like, why not 'photoshop' them directly? If the active site is too small, make it bigger; if the hydrophobic region is not stable enough, add some strongly hydrophobic amino acids; if new catalytic activity is desired, mimic and transplant known enzyme active sites.

This is called 'rational design.' In a few proteins with clearly defined structures and mechanisms, it has indeed achieved remarkable success. But more often, rational design leads to repeated silent failures. A typical situation is as follows: calculations show that mutating a serine at a certain position to aspartic acid should perfectly stabilize the transition state; crystal structures also validate the predicted binding pattern; yet the synthesized mutant protein either does not express at all or, if expressed, precipitates in the form of inclusion bodies—it fails to fold correctly.

The Trap of Rational Design.

The harsh lesson of this common phenomenon is that sequences determine structure, but how 'sequences determine structure' is something we actually do not truly understand. Protein folding is an extremely complex physico-chemical process, involving a series of interactions such as backbone dihedral angles, side-chain packing, solvation effects, and entropy-enthalpy compensation, any disturbance in which can cause the entire 'folding funnel' to collapse. The energy functions and force field parameters relied upon in rational design are essentially simplified approximations of these complex interactions; they can barely suffice for local tweaks, but once global sequence changes are involved, errors can accumulate to an unacceptable level.

Even more alarming is the existence of the 'folding dark matter.' Recent computational analyses indicate that within the natural sequence space, the sequences capable of stable folding may be separated by vast 'unfoldable regions.' This means that even if we know the sequences of an excellent protein A and an excellent protein B, it does not necessarily mean they are 'connected' in sequence space. Attempting to move from A to B via point mutations is very likely to fall into the abyss of folding failure along the way. What protein engineering faces is not a smooth and continuous optimization surface, but a wild terrain scattered with traps, faults, and cliffs.

3. What AI Brings Is Not Just Faster Calculations

Only by understanding the dilemmas mentioned above can one truly grasp the profound impact AI—especially protein language models—has brought to this field. Traditional methods, whether directed evolution or rational design, all start from "known proteins": taking a naturally occurring sequence and exploring a limited sequence space around it. Directed evolution makes random jumps around it, while rational design fine-tunes within its structural framework. Both are locked in the "vicinity" of evolution, making it difficult to leap into genuinely distant, truly novel regions of sequence space.

The breakthrough of protein language models lies in the fact that after self-supervised learning on billions of natural protein sequences, they internalize a deep knowledge of "what a protein should look like." It doesn’t rote-memorize structural rules but learns high-dimensional co-evolution patterns between amino acids, folding compatibility constraints, and sequence preferences of functional sites. More importantly, this knowledge is no longer anchored to any specific natural sequence but is distilled into a generative capability—the model can, from scratch, "write" an entirely new sequence, nowhere found in natural protein libraries, yet highly "protein-like" in the physicochemical sense, word by word.

This allows researchers, for the first time, to directly sample from the vast sequence space without starting from a known natural protein and following its immediate neighborhood step by step. Given a functional constraint—such as "a loop that can bind a specific target" or "a hydrolase scaffold that remains active at 60 degrees Celsius"—the model can generate thousands of candidate sequences that differ greatly from each other but are all likely to fold stably while meeting the conditions. Humans then only need to synthesize dozens to hundreds for experimental validation, efficiently zeroing in on lead molecules. Protein engineering has moved from "modifying nature" to entering the stage of "parallel creation."

Protein Language Models

In this new field full of possibilities, MatwingsVenus™ (XiaoWu™) agents from Matwings Technology transform the complex capabilities of protein language models into an intuitive, interactive design experience, allowing researchers to conveniently input functional requirements, explore sequence space, and perform comprehensive computational evaluations of generated candidate proteins. This shifts the starting point of protein engineering from "what we have" to "what we need."

4. What It Takes to Truly Achieve 'De Novo Design'

If generating sequences from scratch is the first level of qualitative change in protein engineering, the next level may lie in the genuine understanding and programming of protein dynamic properties.

Proteins are not static sculptures. Actual proteins observed through methods such as nuclear magnetic resonance are more like finely crafted machines that constantly tremble— their structures are not fixed in a single state but continually switch among a series of similar conformations. The realization of many functions—enzyme catalysis, receptor signal transduction, directional movement of molecular motors—depends precisely on this structural flexibility, not just the static structure. Current protein language models mainly learn from static sequence information, and their implicit representations of dynamic properties remain very limited.

Protein Dynamics

Another huge unknown lies in conditional dependence. Whether a sequence can fold depends not only on the sequence itself but also on its environment — pH, temperature, ionic strength, the presence of molecular chaperones, and the crowded cytoplasmic environment. And what protein engineering ultimately aims to deliver is not a beautifully folded work of art in a test tube, but an industrial product or drug that functions stably in cellular factories or human blood. This requires incorporating an understanding of cellular environmental constraints into AI models, which is only just beginning.

But the possibilities are already opening up. Standing at the point of 2026, the union of AI and protein engineering is entering a dizzyingly accelerated phase. Mutational scanning and functional characterization, which once took an entire doctoral career, can now be completed in a few rounds of wet experiments and AI iterations. The previously unattainable goal of 'designing entirely new enzymes to catalyze unnatural reactions' is moving from demonstration cases toward industrial-level applications. More importantly, a new scientific paradigm is taking shape: AI is not just a tool but a 'co-thinker' for exploring the ultimate scientific question of protein sequence-structure-function relationships.

If directed evolution taught us humility — to acknowledge our ignorance and let natural selection find the answers for us — then protein engineering in the AI era adds a layer of ambition on top of humility: we may not need to fully understand every detail of folding; as long as we learn the grammar and semantics of the 'language' of proteins, we can write unprecedented poems with it. And as we illuminate more and more functional islands in this vast sequence space, humans will no longer be mere readers of the code of life — we will become its co-authors.

Protein Engineering: On the Ruins of Failure, We Learned to Design Life