primus
★ 3
updated 1y ago
A multimodal foundation model for humanoid robotics that integrates multiple input modalities—text, speech, vision (images and videos), and outputs both actions and speech simultaneously like a transformer.
No plain-English explanation yet — one is being written right now. Check back in a minute.