RFM-1 от Covariant — это мультимодальный ИИ на 8 миллиардов параметров, который буквально дает роботам «мозг». Модель обучалась на текстах, видео и данных с сенсоров, что позволяет ей понимать физический мир и выполнять сложные задачи по манипуляции объектами.
What is RFM-1 Set up as a multimodal any-to-any sequence model, RFM-1 is an 8 billion parameter transformer trained on text, images, videos, robot actions, and a range of numerical sensor readings. By tokenizing all modalities into a common space and performing autoregressive next-token prediction, RFM-1 uses its broad range of input and output modalities to enable diverse applications. For example, it can perform image-to-image learning for scene analysis tasks like segmentation and identification. It can combine text instructions with image observations to generate desired grasp actions or motion sequences. It can pair a scene image with a targeted grasp image to predict outcomes as videos or simulate the numerical sensor readings that would occur along the way.