Яндекс Метрика
Биология и ИИ

SaProt

Zhejiang University (ZJU),Westlake University
Protein or nucleotide language model (pLM/nLM)

SaProt — это инновационная языковая модель для белков, которая учитывает не только последовательности аминокислот, но и их трехмерную структуру. Этот ИИ-инструмент открывает новые горизонты в биоинформатике, позволяя ученым с высокой точностью предсказывать функции протеинов.

Large-scale protein language models (PLMs), such as the ESM family, have achieved remarkable performance in various downstream tasks related to protein structure and function by undergoing unsupervised training on residue sequences. They have become essential tools for researchers and practitioners in biology. However, a limitation of vanilla PLMs is their lack of explicit consideration for protein structure information, which suggests the potential for further improvement. Motivated by this, we introduce the concept of a “structure-aware vocabulary” that integrates residue tokens with structure tokens. The structure tokens are derived by encoding the 3D structure of proteins using Foldseek. We then propose SaProt, a large-scale general-purpose PLM trained on an extensive dataset comprising approximately 40 million protein sequences and structures. Through extensive evaluation, our SaProt model surpasses well-established and renowned baselines across 10 significant downstream tasks, demonstrating its exceptional capacity and broad applicability. We have made the code1, pre-trained model, and all relevant materials available at https://github.com/westlake-repl/SaProt.

Что такое SaProt?+
Кто разработал SaProt?+
Какие задачи решает SaProt?+