Blog | LeRobot goes to driving school

All articles

LeRobot goes to driving school

The world’s largest open-source self-driving dataset

Yaak & Lerobot team · March 11, 2025

State-of-the art Vision Language Models and Large Language Models are trained on open-source image-text corpora sourced from the internet, which spearheaded the recent acceleration of open-source AI. Despite these breakthroughs, the adoption of end-to-end AI within the robotics and automotive community remains low, primarily due to a lack of high quality, large scale multimodal datasets like OXE. To unlock the potential for robotics AI, Yaak teamed up with the LeRobot team at 🤗 and is excited to announce Learning to Drive (L2D) to the robotics AI community. L2D is the world’s largest multimodal dataset aimed at building an open-sourced spatial intelligence for the automotive domain with first class support for 🤗’s LeRobot training pipeline and models. Drawing inspiration from the best practices of source version control, Yaak also invites the AI community to search and discover novel episodes in our entire dataset (> 1 PetaBytes), and queue their collection for review to be merged into future release (R5+).

Dataset

Observation

State

Actions

Task/Instructions

Episodes

Duration (hr)

Size TB

WAYMO

RBG (5x)

—

2030

11.3

—

NuScenes

RBG (6x)

GPS/IMU

—

1000

5.5

—

MAN

RBG (4x)

GPS/IMU

—

747

4.15

—

ZOD

RBG (1x)

GPS/IMU/CAN

☑️

—

1473

8.2

—

COMMA

RBG (1x)

GPS/IMU/CAN

☑️

—

2019

0.1

L2D (R4)

RBG (6x)

GPS/IMU/CAN

☑️

5,000+

90+

Table 1: Open source self-driving datasets. (Source)

L2D was collected with identical sensor suites installed on 60 EVs operated by driving schools in 30 German cities over the span of 3 years. The policies in L2D are divided into two groups — expert policies executed by driving instructors and student policies by learner drivers. Both the policy groups include natural language instruction of the task as context: For example, “When you have the right of way, take the third exit from the roundabout, carefully driving over the pedestrian crossing”.

Expert policy—Driving instructor

Student policy—Learner driver

Fig 1: 3 of 6 cameras in L2D shown here for clarity, Visualization: Nutron.
Task: “When you have the right of way, drive through the roundabout and take the third exit”.

Expert policies have zero driving mistakes and are considered as optimal, whereas student policies have known sub optimality (Fig 2).

Fig 2: Student policy with jerky steering to prevent going into lane of the incoming truck

Both groups cover all driving scenarios that are mandatory for completion to obtain a driving license within the EU (German version), for example, overtaking, roundabouts and train tracks. In the release (See below R3+), for suboptimal student policies, a natural language reasoning for sub-optimality will be included.

Expert: Driving Instructor
Expert policies are collected when driving instructors are operating the vehicle. The driving instructors have at least 10K+ hours of experience in teaching learner drivers. The expert policies group covers the same driving tasks as the student policies group.

Student: Learner driver
Student policies are collected when learner drivers are operating the vehicle. Learner drivers have varying degrees of experience (10 - 50 hours). By design, learner drivers cover all EU mandated driving tasks, from high speed lane changes on highways to navigating narrow pedestrian zones.

l2d: Learning to drive

L2D (R2+) aims to be the largest open-source self-driving dataset that empowers the AI community with unique and diverse ‘episodes’ for training end-to-end spatial intelligence. With the inclusion of a full spectrum of driving policies (student and experts), L2D captures the intricacies of safely operating a vehicle. To fully represent an operational self-driving fleet, we include episodes with diverse environment conditions, sensor failures, construction zones and non-functioning traffic signals.

Both the expert and student policy groups are captured with the identical sensor setup detailed in the table below. Six RGB cameras capture the vehicle’s context in 360o, and on-board GPS captures the vehicle location and heading. An IMU collects the vehicle dynamics, and we read speed, gas/brake pedal, steering angle, turn signal and gear from the vehicle’s CAN interface. We synchronized all modality types with the front left camera (observation.images.front_left) using their respective unix epoch timestamps. We also interpolated data points where feasible to enhance precision (See Table 1.) and finally reduced the sampling rate to 10 hz.

Fig 3: Multimodal data visualization with Nutron (only 3 of 6 cameras shown for clarity)

Modality

LeRobotDataset v2.1 key

Shape

alignment[tol][strategy]

image (x6)

observation.images.front_left[left_forward,..]

N3HW

asof[20ms][nearest]

speed

observation.state.vehicle.speed

interp

heading

observation.state.vehicle.heading[heading_error]

asof[50ms][nearest]

GPS

observation.state.vehicle.latitude[longitude/altitude]

asof[50ms][nearest]

IMU

observation.state.vehicle.acceleration_x[y]

interp

waypoints

observation.state.vehicle.waypoints

N2L

asof[10m][nearest]

timestamp

observation.state.timestamp

observation.images.front_left

gas

action.continous.gas_pedal_normalized

interp

brake

action.continous.brake_pedal_normalized

interp

steering

action.continous.steering_angle_normalized

interp

turn signal

action.discrete.turn_signal

gear

language

—

language

task.instructions

—

Table 2: Modality types, LeRobot v2.1 key, shape and interpolation strategy.

L2D follows the official German driving task catalog (detailed version) definition of driving tasks, driving sub-tasks and task definition. We assign a unique Task ID and a natural language description of the context to all episodes. The task (LeRobot:task) for all episodes is set to “Follow the waypoints adhering to driving rules and regulations”. The table below shows a few sample episodes, their natural language instruction (context), driving tasks and subtasks. Both expert and student policies have an identical Task ID for similar scenarios, whereas the context varies with the episode.

Episode

Instructions

Driving task

Driving sub-task

Task definition

Task ID

Visualization LeRobot

Visualization Nutron

Drive straight through going around the parked delivery truck and yield to the incoming traffic

3 Passing, overtaking

3.1 Passing obstacles and narrow spots

This sub-task involves passing obstacles or navigating narrow roads while following priority rules.

3.1.1.3a Priority regulation without traffic signs (standard)

Visualization LeRobot

Visualization Nutron

Drive through the unprotected left turn yielding to through traffic

4 Intersections, junctions, entering moving traffic

4.1 Crossing intersections & junctions

This sub-task involves crossing intersections and junctions while following priority rules and observing other traffic.

4.3.1.3c With traffic lights

Visualization LeRobot

Visualization Nutron

Drive up to the yield sign and make the unprotected left turn while yielding to bike/pedestrian and vehicle

4 Intersections, junctions, entering moving traffic

4.3 Left turns at intersections & junctions

This sub-task involves making left turns at intersections while observing traffic signals and yielding appropriately.

4.3.1.3c With traffic lights

Table 3: Sample episodes in L2D, their context and Task ID derived from EU driving task catalog

We automate the construction of the instructions and waypoints using the vehicle position (GPS), Open-Source Routing Machine, OpenStreetMap and a Large Language Model (LLM) (See below). The natural language queries are constructed to closely follow the turn-by-turn navigation available in most GPS navigation devices. The waypoints (Fig 4) are computed by map-matching the raw GPS trace to the OSM graph and sampling 10 equidistant points (orange) spanning 100 meters from the vehicles current location (green), and serve as drive-by-waypoints.

Fig 4: L2D 6x RGB cameras, waypoints (orange) and vehicle location (green)

Context: “drive straight up to the stop stop sign and then when you have right of way, merge with the moving traffic from the left”

search & Curation

Expert policies
GPS traces from the export policies collected from the driving school fleet. Click here to see the full extent of expert policies in L2D

Student policies
Student policies cover the same geographical locations as expert policies. Click here to see the full extent of student policies in L2D.

We collected the expert and student policies with a fleet of 60 KIA E-niro driving school vehicles operating in 30 German cities, with an identical sensor suite. The multimodal logs collected with the fleet are unstructured and void of any task or context information. To search and curate for episodes we enrich the raw multimodal logs with information extracted through map matching the GPS traces with OSRM and assigning node and way tags from OSM (See next section). Coupled with a LLM, this enrichment step enables searching for episodes through the natural language description of the task.

openstreetmap

For efficiently searching relevant episodes, we enrich the GPS traces with turn information obtained by map-matching the traces using OSRM. We additionally use the map-matched route and assign route features, route restrictions and route maneuvers, collectively referred to as route tasks, to the trajectory using OSM (See sample Map). Appendix A1-A2 provides for more details on the route tasks we assign to GPS traces.

Fig 5: Tasks assigned to raw GPS trace (map view)

The route tasks which get assigned to the map-matched route, are assigned the beginning and end timestamps (unix epoc), which equates to the time when the vehicle enters and exits the geospatial linestring or point defined by the task (Fig 6).

Fig 6: Pink: GNSS trace, Blue: Matched route, tasks: Yield, Train crossing and Roundabout (map view)

multimodal search

We perform semantic spatiotemporal indexing of our multimodal data with the route tasks as described in Fig 5. This step provides a rich semantic overview of our multimodal data. To search within the semantic space for representative episodes by context, for example, “drive up to the yield sign, and then over the train crossing and through the roundabout”, we built a LLM-powered multimodal natural language search, to search within all our drive data (> 1 PB) and retrieve matching episodes.

We structured the natural language queries (context) to closely resemble turn-by-turn navigation available in GPS navigation devices. To translate context to route tasks, we prompt the LLM with the query and steer its output to a list of route features, route restrictions, route maneuvers and retrieve episodes assigned to these route tasks. We perform a strict validation of the output from the LLM with a pydantic model to minimize hallucinations. Specifically we use llama-3.3-70b and steer the output to the schema defined by the pydantic model. To further improve the quality of the structured output, we used approx 30 pairs of known natural language queries and route tasks for in-context learning. Appendix A.2 provides details on the in-context learning pairs we used.

lerobot

L2D on 🤗 is converted to LeRobotDataset v2.1 format to fully leverage the current and future models supported within LeRobot. The AI community can now build end-to-end self-driving models leveraging the state-of-the-art imitation learning and reinforcement learning models for real world robotics like ACT, Diffusion Policy, and Pi0.

Existing self-driving datasets (table below) focus on intermediate perception and planning tasks like 2D/3D object detection, tracking, segmentation and motion planning, which require high quality annotations making them difficult to scale. Instead L2D is focused on the development of end-to-end AI which learns to predict actions (policy) directly from sensor input (Table 1.). These models leverage internet pre-trained VLM and VLAM.

Dataset

Observation

State

Actions

Task/Instructions

Episodes

Duration (hr)

Size TB

WAYMO

RBG (5x)

—

2030

11.3

—

NuScenes

RBG (6x)

GPS/IMU

—

1000

5.5

—

MAN

RBG (4x)

GPS/IMU

—

747

4.15

—

ZOD

RBG (1x)

GPS/IMU/CAN

☑️

—

1473

8.2

—

COMMA

RBG (1x)

GPS/IMU/CAN

☑️

—

2019

0.1

L2D (R4)

RBG (6x)

GPS/IMU/CAN

☑️

5,000+

90+

Table 4: Open source self-driving datasets. (Source)

releases

Robotics AI models’ performances are bounded by the quality of the episodes within the training set. To ensure the highest quality episodes, we plan a phased release for L2D. With each new release we add additional information about the episodes. Each release R1+ is a superset of the previous releases to ensure clean episode history.

1. instructions: Natural language instruction of the driving task

2. task_id: Mapping of episodes to EU mandated driving tasks (Task ID)

3. observation.state.route : Information about lane count, turn lanes from OSM

4. suboptimal: Natural language description for the sub-optimal policies

Nutron

Date

Episodes

Duration (hr)

Size

instructions

task_id

observation.state.route

suboptimal

March 2025

100

0.5+

9.5 GB

☑️

—

April 2025

95 GB

☑️

—

May 2025

10K

50+

1 TB

☑️

June 2025

100K

500+

10 TB

☑️

July 2025

5,000+

90 TB

☑️

Table 5: L2D release dates (Source)

The entire multimodal dataset collected by Yaak with the driving school fleet is 5x larger than the planned release. To further the growth of L2D beyond R4, we invite the AI community to search and uncover scenarios within our entire data collection and build a community powered open-source L2D. The AI community can now search for episodes through our natural language search and queue their collection for review by the community for merging them into the upcoming releases. With L2D, we hope to unlock an ImageNet moment for spatial intelligence.

Closed Loop Testing

lerobot driver

For real world testing of the AI models trained with L2D and LeRobot, we invite the AI community to submit models for closed loop testing with a safety driver, starting summer of 2025. The AI community will be able to queue their models for closed loop testing, on our fleet and choose the tasks they’d like the model to be evaluated on and, for example, navigating roundabouts or parking. The model would run in inference mode (Jetson AGX or similar) on-board the vehicle. The models will be drive the vehicle with LeRobot driver in two modes

1. drive-by-waypoints: “Follow the waypoints adhering to driving rules and regulations” given observation.state.vehicle.waypoints

2. drive-by-language: “Drive straight and turn right at the pedestrian crossing”

additional resources

Driving task catalog (Fahraufgabenkatalog)
Official German practical driving exam
Groq

appendix

A.1 Route Tasks

List of route restrictions. We consider route tags from OSM a restriction if it imposes restrictions on the policy, for example speed limit, yield or construction. Route features are physical structures along the route, for example inclines, tunnels and pedestrian crossing. Route maneuvers are different scenarios which a driver encounters during a normal operation of the vehicle in an urban environment, for example, multilane left turns and roundabouts.

Type

Name

Assignment

Task ID

Release

Route restriction

CONSTRUCTION

VLM

—

Route restriction

CROSS_TRAFFIC

VLM

4.3.1.3a

4.3.1.3b

4.3.1.3d

4.2.1.3a

4.2.1.3b

4.2.1.3d

Route restriction

INCOMING_TRAFFIC

VLM

—

Route restriction

LIMITED_ACCESS_WAY

OSM

—

Route restriction

LIVING_STREET

OSM

—

Route restriction

LOW_SPEED_REGION (5, 10, 20 kph)

OSM

—

Route restriction

ONE_WAY

OSM

3.2.1.3b

Route restriction

PEDESTRIANS

VLM

7.2.1.3b

Route restriction

PRIORITY_FORWARD_BACKWARD

OSM

3.1.1.3b

Route restriction

ROAD_NARROWS

OSM

—

Route restriction

STOP

OSM

4.1.1.3b,

4.2.1.3b,

4.3.1.3b

Route restriction

YIELD

OSM

4.1.1.3b,

4.2.1.3b,

4.3.1.3b

Route feature

BRIDGE

OSM

—

Route feature

CURVED_ROAD

OSM (derived)

2.1.1.3a,

2.1.1.3b

Route feature

BUS_STOP

OSM

7.1.1.3a,

Route feature

HILL_DRIVE

OSM

—

Route feature

LOWERED_KERB

OSM

—

Route feature

NARROW_ROAD

VLM

—

Route feature

PARKING

OSM

—

Route feature

PEDESTRIAN_CROSSING

OSM

7.2.1.3b

Route feature

TRAFFIC_CALMER

OSM

—

Route feature

TRAIN_CROSSING

OSM

6.1.1.3a,

6.1.1.3b

Route feature

TRAM_TRACKS

OSM

6.2.1.3a

Route feature

TUNNEL

OSM

—

Route feature

UNCONTROLLED_PEDESTRIAN_CROSSING

OSM

7.2.1.3b

Route maneuver

ENTERING_MOVING_TRAFFIC

OSM (derived)

4.4.1.3a

Route maneuver

CUTIN

VLM

—

Route maneuver

LANE_CHANGE

VLM

1.3.1.3a,

1.3.1.3b

Route maneuver

MERGE_IN_OUT_ON_HIGHWAY

OSM

1.1.1.3a,

1.1.1.3b,

1.1.1.3c,

1.2.1.3a,

1.2.1.3b,

1.2.1.3c

Route maneuver

MULTILANE_LEFT

OSM (derived)

4.3.1.3b,

4.3.1.3c,

4.3.1.3d

Route maneuver

MULTILANE_RIGHT

OSM (derived)

4.2.1.3b,

4.2.1.3c,

4.2.1.3d

Route maneuver

PROTECTED_LEFT

OSM (derived)

4.3.1.3c,

4.3.1.3d

Route maneuver

PROTECTED_RIGHT_WITH_BIKE

OSM (derived)

4.2.1.3c,

4.2.1.3d

Route maneuver

RIGHT_BEFORE_LEFT

OSM (derived)

4.1.1.3a,

4.2.1.3a,

4.3.1.3a,

Route maneuver

RIGHT_TURN_ON_RED

OSM

4.2.1.3c

Route maneuver

ROUNDABOUT

OSM

5.1.1.3a,

5.1.1.3b

Route maneuver

STRAIGHT

OSM (derived)

8.1.1.3a

Route maneuver

OVER_TAKE

VLM

3.2.1.3a,

3.2.1.3b

Route maneuver

UNPROTECTED_LEFT

OSM (derived)

4.3.1.3a,

4.3.1.3b

Route maneuver

UNPROTECTED_RIGHT_WITH_BIKE

OSM

4.2.1.3a,

4.2.1.3b

OSM = Openstreetmap, VLM= Vision Language Model, derived: Hand crafted rules with OSM data

A.2 llm prompts

Prompt template and pseudo code for configuring the LLM using groq to parse natural language queries into structured prediction for route features, restrictions and maneuvers with a pydantic model. The natural language queries are constructed to closely follow the turn-by-turn navigation available in most GPS navigation devices.

Example pairs (showing 3 / 30) for in-context learning to steer the structured prediction of LLM, where ParsedInstructionModel is a pydantic model.

prompt_template: "You are parsing natural language driving instructions into PyDantic Model's output=model_dump_json(exclude_none=True) as JSON. Here are a few example pairs of instructions and structured output: {examples}. Based on these examples parse the instructions. The JSON must use the schema: {schema}"
groq:
model: llama-3.3-70b-versatile
temperature: 0.0
seed: 1334
response_format: json_object
max_sequence_len: 60000

PROMPT_PAIRS = [
(
    		"Its snowing. Go straight through the intersection, following the right before left rule at unmarked intersection",
    	ParsedInstructionModel(
       	 	eventSequence=[
            		EventType(speed=FloatValue(value=10.0, operator="LT", unit="kph")),
            	EventType(osmRouteManeuver="RIGHT_BEFORE_LEFT"),
      	      		EventType(speed=FloatValue(value=25.0, operator="LT", unit="kph")),
       	 	],
        	turnSignal="OFF",
        	weatherCondition="Snow",
		),
    	),
(
    		"stop at the stop sign, give way to the traffic and then turn right",
    	ParsedInstructionModel(
        	eventSequence=[
       	     	EventType(osmRouteRestriction="STOP"),
            	EventType(turnSignal="RIGHT"),
       	     	EventType(speed=FloatValue(value=5.0, operator="LT", unit="kph")),
            	EventType(osmRouteManeuver="RIGHT"),
        	],
    		),
	),
	(
    		"parking on a hill in the rain on a two lane road",
    	ParsedInstructionModel(
       	 	osmLaneCount=[IntValue(value=2, operator="EQ")],
        		osmRouteFeature=["PARKING", "HILL_DRIVE"],
        	weatherCondition="Rain",
    		),
	),
]

EXAMPLES = ""
for idx, (instructions, parsed) in enumerate(PROMPT_PAIRS):
	parsed_json = parsed.model_dump_json(exclude_none=True)
	update = f"instructions: {instructions.lower()} output: {parsed_json}"
	EXAMPLES += update

from groq import Groq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

chat_completion = client.chat.completions.create(
            	messages=[
                	{
                    	"role": "system",
                    	"content": prompt_template.format(examples=EXAMPLES, schema=json.dumps(ParsedInstructionModel.model_json_schema(), indent=2))
                	},
                	{
                    	"role": "user",
                    	"content": f"instructions : its daytime. drive to the traffic lights and when it turns green make a left turn",
                	},
            	],
            	model=config["groq"]["model"],
            	temperature=config["groq"]['temperature'],
            	stream=False,
            	seed=config["groq"]['seed'],
            	response_format={"type": config['groq']['response_format']},
        	)

        	parsed_obj = ParsedInstructionModel.model_validate_json(chat_completion.choices[0].message.content)
        	parsed_obj = parsed_obj.model_dump(exclude_none=True)

A.3 data collection hardware

Onboard compute: NVIDIA Jetson AGX Xavier

8 cores @ 2/2.2 GHz, 16/64 GB DDR5
100 TOPS , 8 lanes MIPI CSI-2 D-PHY 2.1 (up to 20Gbps)
8x 1080p30 video encoder (H.265)
Power: 10-15V DC input, ~90W power consumption
Storage: SSD M.2 (4gen PCIe 1x4)
Video input 8 cameras:
- 2x Fakra MATE-AX with 4x GMSL2 with Power-over-Coax support

Onboard compute: Connectivity

Multi-band, Centimeter-level accuracy RTK module
5G connectivity: M.2 USB3 module with maximum downlink rates of 3.5Gbps and uplink rates of 900Mbps, dual SIM

Component

Vendor

Specs

RGB: Camera

RGB: Rugged Camera

GNSS