Protein database, ten billions protein database for research | MatwingsVenus™（晓鹜™）

If the 21st century is called the 'Century of Biology,' then proteins are the most mysterious and crucial protagonists of this era. They are the executors of life functions—catalyzing metabolism, transmitting signals, defending against pathogens, and building tissues. Almost every program of life has the intricate operation of proteins behind it. Because of this, understanding proteins, modifying proteins, and even designing proteins from scratch have become core issues in cutting-edge fields such as biomedicine, synthetic biology, new materials, and green chemistry. However, a thought-provoking 'data iceberg' phenomenon stands in the way of this vision: the globally authoritative database UniProt has recorded about 240 million protein sequences, covering more than 1.32 million species, but experimentally determined three-dimensional structures account for less than 1%, and functional annotations are only 0.1%. In other words, for the vast majority of known proteins, we may know 'what amino acids they consist of,' but we cannot answer 'what they look like' and 'what they are actually used for.'

Protein database

It's like having a library with an astonishing number of books, but being able to understand less than 1% of the titles—massive data does not equal usable knowledge, and the existence of information is far from the release of value. Protein databases have arrived at their own model revolution at this critical juncture.

1. From Sequence to Structure: AI Rewriting the Rules

In 2020, a breakthrough in protein structure prediction shook the global scientific community—the 50-year-old 'protein folding problem' that had confounded biologists was solved. Since then, the emergence of large-scale AI-predicted structure databases has expanded the coverage of known protein structures to an unprecedented breadth. Statistics show that more than 2 million researchers worldwide have used AI-predicted protein structure databases, which include predictions for over 200 million protein structures, almost covering all known protein sequences.

This is not just a breakthrough in the academic field. In malaria vaccine development, a university team used AI-predicted antigen protein structures to compress the early discovery phase from 3 years to 8 months; in anti-infective drug research, researchers have also used it to design new compounds capable of overcoming bacterial resistance.

From 'experimentally resolving a structure takes months or even years' to 'high-precision prediction completed in seconds,' the leap in protein database capabilities is profoundly rewriting the underlying logic of life sciences. Researchers point out that a 'double helix' symbiotic relationship is forming between structure databases and AI models: high-quality experimental structures provide benchmarks for AI training, while large-scale AI predictions, in turn, greatly enrich the databases themselves, revealing new sequence-structure-function relationships.

2. From Structure to Function: The Next Broader Territory

As the accuracy of structural prediction gradually approaches experimental levels, the attention of academia and industry is turning to a more fundamental question: how to move from structure to function, from "understanding what proteins look like" to "the ability to modify proteins"?

This is the next step in the evolution of protein databases. Currently, the global market for AI-driven protein prediction tools is expected to grow from $425 million in 2025 to $2.405 billion in 2032, with a compound annual growth rate of 28.5%. Meanwhile, the overall proteomics market is valued at approximately $29.6 billion by 2025 and is expected to grow to $96.3 billion by 2035. Behind these numbers lies the tide of the era shifting from "data accumulation" to "data-driven design."

A more fundamental change is that protein science is moving from static structure prediction to dynamic conformational modeling and generative design. The new generation of AI models is no longer limited to predicting the "most stable folded morphology of proteins," but instead attempts to characterize the dynamic conformational landscape of proteins in real cellular environments—morphology at different pHs, allosteric ligands after binding, and functional regulation through post-translational modifications. This means protein databases are evolving from "snapshot collections" to "dynamic life atlases."

Against this backdrop, the very form of databases is also being redefined. From early PDBs (protein structure databases) centered on experimental structure archiving, to sequence-centric classification systems like Pfam and InterPro, and now to AI-native structural databases, each leap corresponds to a profound shift in life science research models. It is foreseeable that future protein databases will not only be retrieval tools but also knowledge infrastructure integrating prediction, design, and validation.

3. From Database to 'Intelligent Agent': Exploring the Practices of Shanghai Matwings Technology

In this protein data revolution, Shanghai-based Matwings Technology and its latest platform, MatwingsVenus™ (Xiaowu ™), are offering a brand-new solution.

In April 2026, Matwings Technology launched the conversational protein R&D agent MatwingsVenus™ (Xiaowu ™). This is not a traditional database query tool—its foundation is a protein data with tens of billions of real tags, supporting a "super brain"; Its interactive approach completely breaks down the technical barriers of professional software: users only need to describe their R&D needs in natural language, and the system can automatically complete the entire process from literature review, patent search, protein sequence design, to automated experimental validation.

Conversational Protein R&D Agent.

Three Core Competencies Building a Moat:

First, a tens-of-billions-scale labeled database. Unlike traditional public databases, Matwings Technology has constructed an ultra-large-scale protein sequence dataset and label library that is 'unique to us' and used this as a foundation to develop a universal large model for protein design, enabling precise design of functional proteins directly targeting industrial needs. This means that the data driving the platform is not only 'massive,' but also 'labeled, accessible, and design-oriented'—a critical leap from a 'data warehouse' to an 'intelligent engine' in protein engineering.

Second, orchestration of 200 tools and intelligent agents. MatwingsVenus™ (Xiaowu™) integrates over 200 professional protein design tools, more than 50 carefully tuned skills, and over 30 skills fine-tuned by experts from various fields. The intelligent agents do not merely 'stack functions together,' but automatically decompose, schedule, and orchestrate tasks around their goals, completing a series of complex tasks such as deep research, enzyme mining, directed evolution, and de novo design.

Third, a closed-loop iteration of AI design and automated experiments. The platform achieves deep coordination between the digital and physical worlds—once design tasks are completed, results can directly connect to automated shared laboratories, driving robots to carry out sample preparation, protein purification, and functional testing. Experimental data is then fed back into the AI model, forming a 'dialogue-oriented dry-wet loop' driven by computation and reinforced by experiments. In a de novo design project targeting an immune-regulatory receptor, Matwings Technology successfully obtained dozens of novel binding molecules with in vitro cell-blocking activity based on this platform, completing the entire verification process for de novo design.

4. The Secret to Industry Implementation

The best way to understand the value of a technology platform is to see what problems it solves.

Matwings Technology's core technology route has its unique features—its self-developed general artificial intelligence for protein design can directly predict protein functions, breaking through the limitations of traditional methods that "can only predict structure but cannot design function." In the field of biomedicine, this capability can enhance the activity, specificity, and stability of key tool enzymes and proteins; in the field of synthetic biology, it can enable enzyme modification, discovery, and catalytic process development, providing efficient and green AI enzyme catalysis solutions for pharmaceutical intermediates, fragrances, cosmetic ingredients, and more.

Currently, Matwings Technology has successfully delivered over thirty protein design projects, serving leading enterprises in innovative drugs, in vitro diagnostics, nutrition and health, food and beverages, detergents and textiles, and bio-agriculture. By the end of 2025, its first biologic drug pipeline overseas collaboration was successfully implemented, marking international recognition of its technological capabilities. In March 2026, Matwings Technology completed over 200 million RMB Series A financing, jointly led by China National Petroleum Kunlun Capital and Shanghai Future Industry Fund, among others.

Behind these milestones lies a steady leap of a Chinese AI protein company from technology to industry, and from domestic to international markets.

5. Redefining the Future of Protein Databases

If early protein databases were like "libraries," then in the AI prediction era, databases are "digital archives." The new-generation platforms represented by MatwingsVenus™ (Xiaowu™) are evolving into "intelligent laboratories that can interact, design, and validate."

The industrial logic reflected here is clear: the core value of protein databases is shifting from "storing the known" to "creating the unknown." Just like a map app not only shows you where the roads are but can also plan the optimal route and provide real-time navigation, a new-generation protein database platform no longer simply "tells you what a protein looks like," but helps scientists and engineers "design the proteins they want."

Shanghai has included "accelerating the construction of basic biological information databases such as microbial genetic information and protein structure-function information" in its synthetic biology industry development action plan. The practice of Matwings Technology and MatwingsVenus™ (Xiaowu™) is a vivid industrial annotation of this strategy.

In its report, Dragon Satellite TV referred to Matwings Technology as 'a benchmark case in Shanghai for promoting the transformation of basic research into real productive forces,' and praised it for 'giving AI protein design ideas from around the world the opportunity to be realized through a shared laboratory.' From data to knowledge, from knowledge to capability, from capability to products—the evolution of protein databases is building momentum for the next 'singularity moment' in the bioeconomy. And when databases are no longer just 'dictionaries' to be consulted, but become 'intelligent agents' capable of dialogue, reasoning, and experimentation, the boundaries of bio-manufacturing may be even broader than we imagine.

Protein database, hiding the next 'singularity moment' of bio-manufacturing