EvoMIL — это нейросеть, способная определить хозяина вируса, анализируя только его генетическую последовательность. Этот ИИ-метод глубокого обучения помогает вирусологам быстрее находить способы лечения новых заболеваний и понимать механизмы их распространения.
Predicting virus-host association is essential to understand how viruses interact with host species, and discovering new therapeutics for viral diseases across humans and animals. Currently, the host of the majority of viruses is unknown. Here, we introduce EvoMIL, a deep learning method that predicts virus-host association at the species level from viral sequence only. The method combines a pre-trained large protein language model and attention-based multiple instance learning to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than traditional handcrafted features, including amino acids and DNA k-mers, and physio-chemical properties. EvoMIL binary classifiers achieve AUC values of over 0.95 for all prokaryotic and nearly 0.8 for almost all eukaryotic hosts. In multi-host prediction tasks, EvoMIL achieved median performance improvements of 8.6% in prokaryotic hosts and 1.8% in eukaryotic hosts. Furthermore, EvoMIL estimates the importance of single proteins in the prediction and maps them to an embedding landscape of all viral proteins, where proteins with similar functions are distinctly clustered together. Author summary Being able to predict which viruses can infect which hosts, and identifying the specific proteins that are involved in these interactions, is crucial for understanding viral diseases and developing more effective treatments. Traditional methods for predicting these interactions rely on handcrafted common features among proteins, overlooking the importance of single proteins. We have developed a new method that combines a protein language model and multiple instance learning to allow host prediction directly from protein sequences, without the need to extract handcrafted features. This method significantly improved multiple host association accuracy and revealed the key proteins involved in virus-host interactions.