pKALM — это инновационный метод предсказания pKa белков, использующий глубокое обучение и языковые модели протеинов (PLM). ИИ эффективно связывает структуру и последовательность аминокислот, обеспечивая высокую скорость вычислений в вычислительной биологии.
Protein pKa prediction is a key challenge in computational biology. In this study, we present pKALM, a novel deep learning-based method for high-throughput protein pKa prediction. pKALM uses a protein language model (PLM) to capture the complex sequence-structure relationship of proteins. While traditionally considered a structure-based problem, our results show that a PLM pre-trained on large-scale protein sequence databases can effectively learn this relationship and achieve state-of-the-art performance. pKALM accurately predicts the pKa values of six residues (Asp, Glu, His, Lys, Cys, and Tyr) and two termini with high precision and efficiency. It excels at predicting both exposed and buried residues, which often deviate from standard pKa values measured in solvent. We demonstrate a novel finding that predicted protein isoelectric points (pI) can be used to improve the accuracy of pKa prediction. High-throughput pKa prediction of the human proteome using pKALM achieves a speed of 4,965 pKa predictions per second, which is several orders of magnitude faster than existing state-of-the-art methods. The case studies illustrate the efficacy of pKALM in estimating pKa values and the constraints of the method. pKALM will thus be a valuable tool for researchers in the fields of biochemistry, biophysics, and drug design.