InternVL2-40B представляет собой усиленную версию флагманской линейки MLLM, обеспечивающую глубокое понимание мультимодального контента. Эта ИИ-модель стирает границы между открытыми и платными решениями, предлагая профессиональные возможности для анализа сложных диаграмм, документов и видео.
We introduce InternVL2, currently the most powerful open-source Multimodal Large Language Model (MLLM). The InternVL2 family includes models ranging from a 1B model, suitable for edge devices, to a 108B model, which is significantly more powerful. With larger-scale language models, InternVL2-Pro demonstrates outstanding multimodal understanding capabilities, matching the performance of commercial closed-source models across various benchmarks. InternVL2 family is built upon the following designs: Progressive with larger language models: We introduce a progressive alignment training strategy, resulting in the first vision foundation model natively aligned with large language models. By employing the progressive training strategy where the model scales from small to large while the data refines from coarse to fine, we have completed the training of large models at a relatively low cost. This approach has demonstrated excellent performance with limited resources. Multimodal input: With one set of parameters, our model supports multiple modalities of input, including text, images, video, and medical data. Multitask output: Powered by our recent work VisionLLMv2, our model supports various output formats, such as images, bounding boxes, and masks, demonstrating extensive versatility. By connecting the MLLM with multiple downstream task decoders, InternVL2 can be generalized to hundreds of vision-language tasks while achieving performance comparable to expert models.