Building Spatial Intelligence - Part 1

23/09/2024 · Technology

Movement is essential to intelligence. The cycle of perceiving and acting is key to our understanding of the physical world around us. Indoor navigation, moving household items, or operating a vehicle in traffic; These seemingly trivial tasks hide a complex choreography of our sensory-motor abilities in plain sight. We do this with an internal model of the world, which is constantly updated with new information, and with which we make predictions about the future as we navigate safely through our environment and perform tasks.

Pioneers of AI refer to this as spatial intelligence. [1]

Similarly, contemporary spatial intelligence also aims at building world models [2, 12], and directly inferring laws of physics from multi-modal sensor observations without any need for hand-crafted rules.

Fig 1: Kuka Iiwa (Left ), TurtleBot 2 (Middle), WidowX (Right), Source: Open-X Embodiment

Fig 2: Multi-modal data visualization with Nutron (Source: Yaak)

With recent advances in large multi-modal models, AI is at the cusp of going beyond browsers and into the physical world. This presents its own set of challenges around safety requirements for spatial intelligence.

Unlike vision-language models (VLMMs [3] ), which are trained on content from the web, interoperability of multi-modal robotics datasets is undercut by a plethora of platforms, sensors, and vendors (Fig: 1 & 2). This requires rethinking tools and workflows for building and validating spatial intelligence.

Multi-modal datasets

Robotics datasets are coupled deeply to the embodiment they are collected from, and display large variation in sensor configuration and modalities [5]. They capture rich information about their environment as well as expert policies for competing tasks. Table 1, shows a few publicly available robotics datasets as well as Yaak's (Niro100-HQ), their observation and action space size, and sensor configurations.

→ Scroll to see all columns

#cams (RGB)

6

2

2

8

1

1

#cams (depth)

-

1

1

-

1

-

#Actions

7

7

5

5

13

13

Calibrated

yes

no

no

yes

no

no

Proprioception

yes

yes

no

yes

yes

yes

policy

scripted

Human Spacemouse

scripted

Expert

human VR

Expert

control hz

30

5

3

50

3

20

Table 1: Variation in observation (cams + proprioceptions) and action (controls) space

This presents an open question: Do we develop a unique model for each embodiment and environment, or train a joint model [9] where different environments and tasks help the model learn representations transferable between embodiments. Recent research [5, 6, 7] hints that the latter strategy is the one that yields better AI models.

Applying this paradigm beyond research [9, 10] on enterprise scale datasets, like Yaak's (Fig 3), presented us with new challenges in dataset visualization, search, curation and dev-tools for end-to-end AI for robotics. Below we highlight some of the challenges we faced while working with large-scale, multi-modal datasets in the automotive domain.

Recent open-source efforts from the 🤗 team [4] (LeRobot) has lowered the barrier to entry into end-to-end AI for robotics research and education for early adopters of end-to-end spatial intelligence.

Developer tools gap

The last 3+ years in our partnership with driving schools, we've collected real-world, multi-modal datasets in 30 cities. Our dataset contains both expert (driving instructors) and student policies (learner drivers).

Fig 3: 100K+ hours of multi-modal sensor data and expert / student policies

As our dataset grew, we quickly ran into multiple bottlenecks before we could work on spatial intelligence. We found that the landscape of developer tools for end-to-end AI in robotics was fragmented, closed-sourced, or nonexistent. Below, we outline a few of the issues we ran into.

  • Visualization

    • Monitoring: Sensor modality visualization and failure inspection

    • Enrichment: Additional modalities (e.g. audio, commentary)

    • Tasks: Annotations for spatial intelligence tasks (e.g. parking)

  • Search

    • Discovery: Sifting through petabyte-scale datasets

    • Metrics: Understanding task distribution

    • Curation: Balancing task distribution in training sets

  • Multi-modal datasets

    • Support: Proprietary vs. open robotics data formats (ros2/mcap)

    • Alignment: Different sampling rates of modalities

    • Data-loaders: From raw, multi-modal data to PyTorch tensors

  • Spatial intelligence

    • Data is IP: Closed source end-to-end AI for robotics

    • Open source: Non-existing hackable sources for training spatial intelligence

  • Visualization

    • Monitoring: Sensor modality visualization and failure inspection

    • Enrichment: Additional modalities (e.g. audio, commentary)

    • Tasks: Annotations for spatial intelligence tasks (e.g. parking)

  • Search

    • Discovery: Sifting through petabyte-scale datasets

    • Metrics: Understanding task distribution

    • Curation: Balancing task distribution in training sets

  • Multi-modal datasets

    • Support: Proprietary vs. open formats (mcap)

    • Alignment: Different sampling rates of modalities

    • Data-loaders: From raw multi-modal data to PyTorch tensors

  • Spatial intelligence

    • Data is IP: Closed source end-to-end AI for robotics

    • Open source: Non-existing hackable sources for training spatial intelligence

As end-to-end AI becomes the dominant paradigm in spatial intelligence [7, 11], including Yaak's first use case in the automotive industry, there is an unmet need for enterprise tools and workflows needed to build it; that is unified, intuitive, and scalable with safety validation at its core. In part 2 of the blog series, we'll share how we built our spatial intelligence platform to succeed at these challenges (Fig 4).

Fig 4: Attention heat map of Yaak's spatial intelligence trained entirely through self-supervision from observations (RGB cameras) and actions (vehicle controls)

References

  1. Spatial intelligence Ted Talk: Fei-Fei Li

  2. World models

  3. VLMM: Vision-Language multimodal models

  4. LeRobot: Making AI for Robotics more accessible with end-to-end learning

  5. Open-X dataset: Open-X Robotics Datasets

  6. Open-X Embodiment: Robotic Learning Datasets and RT-X Models

  7. Multi embodiment: Scaling up learning across many different robot types

  8. rerun.io: Visualize multimodal data over time

  9. Gato: A Generalist Agent

  10. Trajectory transformers: Offline RL as One Big Sequence Modeling Problem

  11. SMART: A Generalized Pre-training Framework for Control Tasks

  12. VISTA: A Generalizable Driving World Model with High Fidelity and Versatile Controllability