Datasets
Each dataset we release is shaped by a research thesis about where model capabilities are heading — not assembled opportunistically.
One of the largest available first-person industrial datasets. Recorded across real manufacturing, logistics, and field service environments — the kind of data that teaches models how people actually move through and manipulate the world.
Annotated with task boundaries, hand-object interaction labels, environment metadata, and gaze proxies. Designed for training embodied agents, world models, and long-horizon planning systems.
A large-scale collection of natural, multi-speaker conversational audio spanning languages that are chronically underrepresented in frontier model training. Every recording is paired with dialect metadata, speaker diarization, and transcription.
Languages include Hindi, Arabic, Finnish, and others. Structured for conversational AI training, ASR fine-tuning, TTS voice modeling, and dialogue system evaluation.
Licensed in partnership with several leading AAA-quality game studios — physically consistent 3D worlds with perfect ground truth at impossible-to-replicate scale. Designed for training spatial reasoning, physics intuition, and interaction priors for embodied agents and world models.
A structured collection of dexterous manipulation demonstrations across tabletop, assembly, and unstructured environments. Paired teleoperation and autonomous rollouts with proprioceptive, visual, and force-torque streams. Built for imitation learning, sim-to-real transfer, and manipulation policy evaluation.
Currently in collection. If you have specific requirements, we are taking input from prospective partners now.
If none of the above fits what you are building, we design and collect custom datasets. Tell us what capability you are trying to develop.
Get in touch