Building Spatial Intelligence - Part 2

23/10/2024 · Technology

Fig 1: Attention heat map of Yaak's spatial intelligence trained entirely through self-supervision from observations (RGB cameras) and actions (vehicle controls)

While working with petabyte scale datasets from the automotive domain, we identified four key areas essential to building end-to-end (e2e) AI models for robotics [1].

Logs

  • Monitoring: Sensor log visualization and debugging

  • Enrichment: Adding new modalities (e.g. audio, commentary)

  • Tasks: Defining spatial intelligence tasks (e.g. parking)

Discovery

  • Triage: Auto-tagging policy anomalies and novel scenarios

  • Metrics: Computing task, anomaly, scenario distribution

  • Curation: Building balanced datasets through search

Multimodal data

  • Support: Proprietary and open robotics data [2] formats

  • Alignment: Different sampling rates of modalities

  • Data loaders: From sensor logs to training samples

Spatial intelligence

  • Scale: End-to-end AI for robotics is resource (data and compute) intensive

  • Open source: Non-existing, hackable sources for training spatial intelligence models

logs

Unlike curated academic research datasets, real-world robotics data is ridden with faulty sensors, calibration drift, policy failures, sub-optimal trajectories, and missing modalities (audio/narration). A critical first step towards building e2e AI for robotics is replaying the multimodal logs and additional modalities, e.g. audio or natural language description of the task.

To address this initial step, we've built Nutron (Fig 2) for all things multimodal data.

Fig 2: Log replay, task annotation (map), and auto-triage flags (playback bar)

Discovery

Not all data is created equal. Sensor and policy failures, as well as rare and novel scenarios, contribute to a skewed distribution of tasks within the dataset. As e2e AI for robotics has an appetite for high quality balanced datasets, the second key step is to perform an auto-triage, and discover anomalies and trends within the enterprise-scale datasets.

Fig 3: Embedding similarity for scenarios with yield/stop signs. The foundation model was not provided with any annotations of the traffic signage.

Scenario embeddings from foundation models truly shine at clustering similar scenarios with their inherent world model (Fig 3). They power an important paradigm with Nutron — dataset curation through natural language search or scenario similarity. Within Nutron, robotics teams can sift through petabyte-scale datasets, cherry pick scenarios, and add them to dataset configs.

multimodal data

Data logs from robotics (often called rosbags/mcap) are optimized containers for logging multimodal sensor data captured as protobufs or flatbuffer messages. To build a classic robotics pipeline, the rosbags are unpacked, and sensor data is extracted for training perception, planning, and control modules (or models). This leads to massive data duplication across teams.

As e2e AI for robotics goes mainstream, we are revisiting how to access random samples directly from rosbags (mcap) without unpacking them. To kick off this effort, we've open-sourced rbyte (Fig 5) which provides out-of-the-box support for hdf5 and mcap (default for ros2), leveraging its built-in index and file summary. We are continuously developing rbyte, and adding support for public and enterprise robotics datasets and data formats, f.ex rrd from rerun.io.

Fig 4: Yaak's unified workflow for building e2e AI in robotics (Purple: SaaS, Blue: Open-sourced)

A key aspect of our workflow is that the dataset config built through Nutron works seamlessly with rbyte for training e2e AI for robotics.

Fig 5: rbyte: Multimodal datasets for spatial intelligence (mcap/hdf5/rrd)

spatial intelligence

The last 3+ years, we've been collecting multimodal, automotive datasets with our partners in 30 German cities. Nutron and our open-source libraries were born out of the necessity to work with datasets at this scale and build e2e AI models. An important, final step for building an e2e AI model is a hackable (like [3]) and open-source software that leverages a modern, deep-learning framework, which is customizable and works out-of-the box with public robotics datasets.

In part three of this blog series, we'll introduce tron, our open-source software aimed at building e2e AI for robotics, alongside pre-trained models for popular robotics datasets including Yaak's automotive dataset. Below (Fig 6) is a small preview of what tron learns entirely through self-supervision for image patch location, temporal order, and different modalities.

Fig 6: Embedding similarity from left, image patch column position, global time-step (temporal order), steering angle (-1, 1), gas pedal (0, 1), brake pedal (0, 1).

References

  1. Spatial intelligence Ted Talk: Fei-Fei Li

  2. Open-X dataset: Open-X Robotics Datasets

  3. LeRobot: Making AI for Robotics more accessible with end-to-end learning