Fuyu-8B — это компактная мультимодальная модель от Adept, созданная специально для обучения цифровых агентов. Благодаря упрощенной архитектуре, этот ИИ легко масштабируется и эффективно справляется с анализом изображений, отвечая на сложные визуальные вопросы.
We’re releasing Fuyu-8B, a small version of the multimodal1 model that powers our product. The model is available on HuggingFace. We think Fuyu-8B is exciting because: It has a much simpler architecture and training procedure than other multi-modal models, which makes it easier to understand, scale, and deploy. It’s designed from the ground up for digital agents, so it can support arbitrary image resolutions, answer questions about graphs and diagrams, answer UI-based questions, and do fine-grained localization on screen images. It’s fast - we can get responses for large images in less than 100 milliseconds. Despite being optimized for our use-case, it performs well at standard image understanding benchmarks such as visual question-answering and natural-image-captioning.