AI Enzyme Mining: The Technological Leap from Traditional Screening to Intelligent Design
Published on May 19, 2026
Whether it's innovative drug development, high-end food ingredients, or green chemicals and bio-based materials, all of these rely on a key component—enzymes. If proteins are the "workers" in biological manufacturing, enzymes are the core "technical workers," determining the efficiency of every step from raw materials to products.
However, finding an enzyme from nature that perfectly meets industrial production needs has long been a highly challenging task. In the past two years, the term "AI enzyme mining" has started to appear frequently within the industry. What exactly does it solve? What is the technical logic behind this? How far have we come so far? This article attempts to provide a systematic review.
01 What exactly makes traditional "enzyme finding" difficult?
In the past, enzymes in industrial applications usually came from two sources: one was to "find" from the natural environment, and the other was to "modify" based on known enzymes.
Let's first look at "Find." Although nature holds vast enzyme resources, natural enzymes often have inherent issues such as low activity, poor stability, and narrow substrate spectra. To screen out a qualified candidate from thousands of unknown enzymes, it is necessary to build a metagenomic library and conduct high-throughput functional screening—even with tens of thousands of mutant samples, the success rate is often below 1%, described in the industry as "looking for a needle in a haystack."
Now let's look at "change." Technical approaches represented by directed evolution approach approximate the optimal solution through random mutation and iterative selection. Since the 1990s, this method has produced several successful cases of industrial enzyme improvement, but its essence remains "casting a net in local sea areas." Screening thousands of variants is just the beginning, and each round of evolution requires rebuilding the library, re-expressing, and re-testing, with extremely high time and financial costs.
Liu Hao, CTO of Matwings Technology, once made a vivid estimate: a protein composed of 361 amino acids could potentially replace just one amino acid in nearly 7,000 ways; Replacing two more will increase the number of varieties to over 23 million; If you replace three of them, the possibilities reach about 53.3 billion. Under traditional directed evolution or high-throughput screening models, researchers often have to build mutant libraries, express, purify, and test activity in the lab round after round, spending months or even years on the samples, yet the sample size they screen is only a drop in the ocean of the entire sequence space.
A deeper issue lies in data silos. China's enzyme protein data is scattered across different research institutions and enterprises, with inconsistent formats and incomplete sharing mechanisms, making it difficult to form large-scale standardized datasets supporting high-precision AI design. This "data soil" situation further increases the difficulty of mining enzymes from scratch.
02 What is "AI Enzyme Digging"? How is it fundamentally different from traditional methods?
Enzyme Protein Design
The so-called "AI enzyme mining" is not simply computer-assisted screening of known enzyme databases—that's a feature that traditional sequence comparison tools like BLAST have long had. The true "AI enzyme mining" refers to using large-scale protein language models or structural prediction models to directly "calculate" new enzyme sequences that meet specific catalytic needs based on the underlying statistical patterns of sequence-function relationships.
Specifically, a typical "AI enzyme mining" process includes the following steps:
Data preparation: Collect massive protein sequences (usually in the hundreds of millions or even tens of billions), and, whenever possible, attach experimentally determined functional labels (such as substrate specificity, thermal stability, pH activity, etc.). These figures form the "textbook."
Model pre-training: Using unsupervised or self-supervised learning, deep neural networks learn the "grammar" and "semantics" of protein sequences—that is, which amino acid arrangement patterns are more likely to occur and which patterns are strongly associated with specific functions.
Fine-tuning and prediction: For target reaction types (such as ester hydrolysis, glycosylation, C-H bond oxidation), fine-tune the pre-trained model using a small amount of known functional data, enabling it to assign a "functional score" to candidate sequences.
Candidate sequencing and validation: The model outputs a batch of highly confident candidate enzyme sequences and conducts experiments to verify their true activity and stability.
Compared with traditional methods, the core difference of "AI enzyme mining" lies in shifting from "experiment-driven" to "data + model-driven." Traditional methods require large-scale screening in wet experiments to try their luck and find good candidates; AI enzyme mining attempts to significantly narrow the candidate space during dry experiments, even directly "designing" entirely new enzyme sequences and confirming them through a small number of precise experiments.
This means that enzyme discovery projects that used to take years and cost millions may be compressed to just a few months or even weeks. This is not just "acceleration," but a shift in R&D parameters.
03 Why is "AI enzyme mining" now becoming feasible?
The concept of "AI enzyme mining" is not new, but it has only recently been truly transitioned from academic demonstrations to industrial use. Behind this is the maturity of multiple technical conditions:
First is the maturity of protein language models: represented by ESM, ProGen, ProtGPT2, and others. After pre-training on hundreds of millions of protein sequences, these models can implicitly learn structural constraints and functional tendencies. They are no longer just for sequence comparison, but can "understand" the evolutionary laws of proteins.
Next is the accumulation of high-throughput data: although data standardization is still imperfect, in recent years, the volume of public databases (such as UniProt, BRENDA, PDB) and enterprise-owned internal data has increased by 2–3 orders of magnitude compared to ten years ago. More labeled data makes model fine-tuning more reliable.
The most critical aspect is building the "wet and dry closed-loop" infrastructure: early "AI enzyme mining" often stuck at the "calculation but not verification" stage due to insufficient automated experimental capabilities. In recent years, a number of industry-university-research institutions have begun building robot workstations integrating AI design, automated plasmid construction, automatic protein expression purification, and functional testing, making high-frequency "design—experiment—redesign" possible.
Taking the currently representative MatwingsVenus™ ™ agent platform as an example:
After the AI agent completes protein design, the platform uses a self-developed communication mechanism to import results into plasmid ordering and experiment scheduling workflows, automatically connecting subsequent experimental tasks, driving the robot to complete sample preparation, protein purification, and functional testing, and ultimately feeding experimental results back to the next round of AI design. It is precisely this closed loop of "design to verify, validation to iterate" that gives "AI enzyme mining" true industrial competitiveness.
04 The currently representative "AI enzyme mining" system has already emerged
Thanks to the maturity of these technologies, a number of representative "AI enzyme mining" systems have emerged worldwide in recent years.
Each has its own focus, but a common trend is moving from a "toolkit" to an "agent"—users no longer need to learn complex code or command lines, but can describe task objectives in natural language, which the system automatically breaks down, schedules tools, completes analysis, and provides candidate solutions.
Take MatwingsVenus™ (Xiaowu ™), released by Matwings Technology in April 2026, as an example. It is a conversational protein R&D agent that integrates tens of billions of real label protein data, over 200 protein design tools, more than 50 platform-certified experts, and more than 30 fine-tuning skills across different fields. Users do not need to master complex bioinformatics command lines; they only need to describe the target in natural language—for example, "I need an enzyme that can catalyze XX reaction and withstand 70°C," and the system will automatically schedule the toolchain and send the final candidate sequence to the automated shared laboratory for validation.
From single models to agent platforms, from predictive tools to dry-wet closed-loop systems, the infrastructure for "AI enzyme mining" is rapidly layering and improving.
Enzyme Design and Experimentation
05 What Does 'AI Enzyme Mining' Mean for the Industry?
For companies and teams engaged in synthetic biology, enzyme engineering, and biopharmaceutical R&D, the core change brought by 'AI enzyme mining' is the exponential improvement in R&D efficiency and the lowering of entry barriers.
In the past, it was difficult for a three-person team to independently carry out industrial enzyme development projects, because multiple experimental capabilities such as molecular biology, protein purification, and high-throughput screening were required. However, with the assistance of AI enzyme mining plus automated experiments, one person or a micro-team may be able to complete tasks that previously required handoffs across multiple departments.
If you are in the biomanufacturing-related industry, 'AI enzyme mining' is no longer an option of 'whether to try or not,' but a competitiveness question of 'how much faster can you be than others.' The marginal cost of traditional paths is increasing, while data-driven intelligent paths are accelerating convergence. Teams that can integrate 'AI enzyme mining' into their R&D processes as quickly as possible are likely to achieve a significant efficiency gap within the next two to three years.
For individual researchers, it is also possible to explore some 'AI enzyme mining' platforms that are already open for trial and personally experience the process of 'input requirements → obtain candidate sequences.' After all, some cognitive changes can only be truly understood by trying them firsthand.