Яндекс Метрика
Биология и ИИ

Prot2Token

University of Missouri,Politecnico di Milano
Protein or nucleotide language model (pLM/nLM)

Prot2Token предлагает универсальный метод токенизации для адаптации языковых моделей к сложным биологическим задачам. Модель эффективно предсказывает свойства белков на разных уровнях — от отдельных остатков до межбелковых взаимодействий. Этот ИИ-инструмент значительно упрощает работу ученых с протеомными данными.

This paper proposes a versatile tokenization method and introduces Prot2Token, a model that combines autoregressive language modeling with protein language models (PLMs) to tackle various protein prediction tasks using protein sequences. Leveraging our tokenization method, Prot2Token adapts existing PLMs for multiple tasks such as protein-level prediction, residue-level prediction, and protein-protein interaction prediction through next-token prediction of tokenized target label sequences. By incorporating prompt tokens into the decoder, Prot2Token enables multi-task training in a single end-to-end session. Our results demonstrate that Prot2Token not only matches the performance of specialized models across various tasks but also paves the way for integrating protein tasks with large language models (LLMs), representing an important step towards creating general-purpose PLMs for advanced protein language processing (PLP). Additionally, we use Prot2Token to develop S-ESM, a structure-aware version of the ESM model, which achieves competitive performance with state-of-the-art methods in 3D structure-related tasks using only protein sequences. Code is available at: https://github.com/mahdip72/prot2token.

Что такое Prot2Token?+
Кто разработал Prot2Token?+
Какие задачи решает Prot2Token?+