InternViT-6B: Визуально-языковая модель для AGI

Q: Кто разработал InternViT-6B?

Модель InternViT-6B разработана компанией Shanghai AI Lab,Nanjing University,The University of Hong Kong,Tsinghua University,SenseTime,University of Science and Technology of China (USTC) (China,China,Hong Kong,China,Hong Kong,China).

// задачи

Визуальные ответы на вопросы

// описание

Масштабная визуально-языковая модель, ставшая важным элементом в создании мультимодальных систем общего ИИ (AGI). InternViT-6B объединяет компьютерное зрение и обработку естественного языка, позволяя нейросети давать точные ответы на вопросы по изображениям.

// abstract

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at this https URL.

// faq

Что такое InternViT-6B?+

Кто разработал InternViT-6B?+

Какие задачи решает InternViT-6B?+

// похожие модели