Blog | Building spatial intelligence

All articles

Building Spatial Intelligence - Part 1

02/10/2024 · Technology

Movement is essential to intelligence. The cycle of perceiving and acting is key to our understanding of the physical world around us. Indoor navigation, moving household items, or operating a vehicle in traffic; These seemingly trivial tasks hide a complex choreography of our sensory-motor abilities in plain sight. We do this with an internal model of the world, which is constantly updated with new information, and with which we make predictions about the future as we navigate safely through our environment and perform tasks.

Pioneers of AI refer to this as spatial intelligence. [1]

Similarly, contemporary spatial intelligence also aims at building world models [2, 12], and directly inferring laws of physics from multi-modal sensor observations without any need for hand-crafted rules.

Fig 1: Kuka Iiwa (Left ), TurtleBot 2 (Middle), WidowX (Right), Source: Open-X Embodiment

Fig 2: Multi-modal data visualization with Nutron (Source: Yaak)

With recent advances in large multi-modal models, AI is at the cusp of going beyond browsers and into the physical world. This presents its own set of challenges around safety requirements for spatial intelligence.

Unlike vision-language models (VLMMs [3] ), which are trained on content from the web, interoperability of multi-modal robotics datasets is undercut by a plethora of platforms, sensors, and vendors (Fig: 1 & 2). This requires rethinking tools and workflows for building and validating spatial intelligence.

Multi-modal datasets

Robotics datasets are coupled deeply to the embodiment they are collected from, and display large variation in sensor configuration and modalities [5]. They capture rich information about their environment as well as expert policies for competing tasks. Table 1, shows a few publicly available robotics datasets as well as Yaak's (Niro100-HQ), their observation and action space size, and sensor configurations.

→ Scroll to see all columns

Robot

#cams (RGB)

#cams (depth)

#Actions

Calibrated

yes

Proprioception

yes

policy

scripted

Human Spacemouse

scripted

Expert

human VR

Expert

control hz

Table 1: Variation in observation (cams + proprioceptions) and action (controls) space

This presents an open question: Do we develop a unique model for each embodiment and environment, or train a joint model [9] where different environments and tasks help the model learn representations transferable between embodiments. Recent research [5, 6, 7] hints that the latter strategy is the one that yields better AI models.

Applying this paradigm beyond research [9, 10] on enterprise scale datasets, like Yaak's (Fig 3), presented us with new challenges in dataset visualization, search, curation and dev-tools for end-to-end AI for robotics. Below we highlight some of the challenges we faced while working with large-scale, multi-modal datasets in the automotive domain.

Recent open-source efforts from the 🤗 team [4] (LeRobot) has lowered the barrier to entry into end-to-end AI for robotics research and education for early adopters of end-to-end spatial intelligence.

Developer tools gap

The last 3+ years in our partnership with driving schools, we've collected real-world, multi-modal datasets in 30 cities. Our dataset contains both expert (driving instructors) and student policies (learner drivers).

Fig 3: 100K+ hours of multi-modal sensor data and expert / student policies

As our dataset grew, we quickly ran into multiple bottlenecks before we could work on spatial intelligence. We found that the landscape of developer tools for end-to-end AI in robotics was fragmented, closed-sourced, or nonexistent. Below, we outline a few of the issues we ran into.

Logs

Monitoring: Sensor log visualization and debugging
Enrichment: Adding new modalities (e.g. audio, commentary)
Tasks: Defining spatial intelligence tasks (e.g. parking)

Discovery

Triage: Auto-tagging policy anomalies and novel scenarios
Metrics: Computing task, anomaly, scenario distribution
Curation: Building balanced datasets through search

Multimodal datasets

Support: Proprietary & open robotics data formats (mcap)
Alignment: Different sampling rates of modalities
Data loaders: From sensor logs to training samples

Spatial intelligence

Scale: End-to-end AI for robotics is resource (data and compute) intensive
Open source: Non-existing hackable sources for training spatial intelligence models

Visualization
- Monitoring: Sensor modality visualization and failure inspection
- Enrichment: Additional modalities (e.g. audio, commentary)
- Tasks: Annotations for spatial intelligence tasks (e.g. parking)
Search
- Discovery: Sifting through petabyte-scale datasets
- Metrics: Understanding task distribution
- Curation: Balancing task distribution in training sets
Multi-modal datasets
- Support: Proprietary vs. open formats (mcap)
- Alignment: Different sampling rates of modalities
- Data-loaders: From raw multi-modal data to PyTorch tensors
Spatial intelligence
- Data is IP: Closed source end-to-end AI for robotics
- Open source: Non-existing hackable sources for training spatial intelligence

As end-to-end AI becomes the dominant paradigm in spatial intelligence [7, 11], including Yaak's first use case in the automotive industry, there is an unmet need for enterprise tools and workflows needed to build it; that is unified, intuitive, and scalable with safety validation at its core. In part 2 of the blog series, we'll share how we built our spatial intelligence platform to succeed at these challenges (Fig 4).

Fig 4: Attention heat map of Yaak's spatial intelligence trained entirely through self-supervision from observations (RGB cameras) and actions (vehicle controls)

References

Spatial intelligence Ted Talk: Fei-Fei Li
World models
VLMM: Vision-Language multimodal models
LeRobot: Making AI for Robotics more accessible with end-to-end learning
Open-X dataset: Open-X Robotics Datasets
Open-X Embodiment: Robotic Learning Datasets and RT-X Models
Multi embodiment: Scaling up learning across many different robot types
rerun.io: Visualize multimodal data over time
Gato: A Generalist Agent
Trajectory transformers: Offline RL as One Big Sequence Modeling Problem
SMART: A Generalized Pre-training Framework for Control Tasks
VISTA: A Generalizable Driving World Model with High Fidelity and Versatile Controllability

All articles