LLaVA-OV-72B: мультимодальная ИИ-модель от ByteDance

Q: Кто разработал LLaVA-OV-72B?

Модель LLaVA-OV-72B разработана компанией ByteDance,Nanyang Technological University,Chinese University of Hong Kong (CUHK),Hong Kong University of Science and Technology (HKUST) (China,Singapore,Hong Kong,Hong Kong).

Q: Какие задачи решает LLaVA-OV-72B?

Image captioning, Визуальные ответы на вопросы, Описание видео, Object recognition, Action recognition, Генерация текста

// задачи

Image captioningВизуальные ответы на вопросыОписание видеоObject recognitionAction recognitionГенерация текста

// описание

LLaVA-OV-72B — это мощная мультимодальная нейросеть с открытым исходным кодом, объединяющая возможности компьютерного зрения и обработки текста. Модель отлично справляется с анализом видео, распознаванием объектов и ответами на сложные визуальные вопросы, задавая новые стандарты для открытых LMM.

// abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVAOneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

// faq

Что такое LLaVA-OV-72B?+

Кто разработал LLaVA-OV-72B?+

Какие задачи решает LLaVA-OV-72B?+

// похожие модели