Preparing Datasets for Training

Download a dataset from Azure Blob Storage or HuggingFace, inspect its structure, validate format compliance, and connect it to a LeRobot training workflow. By the end of this recipe, you will have a training-ready dataset on your local machine or in a cloud-accessible location.

[!NOTE] This recipe covers dataset preparation. For training with the prepared dataset, continue to Your First LeRobot Training Job.

📋 Prerequisites

Requirement	Details
Python	3.12+ with `uv` or `pip`
Azure CLI	Authenticated (`az login`) — for Azure Blob datasets
Azure Storage	Storage account with dataset container — for Azure Blob datasets
HuggingFace CLI	`pip install huggingface-hub` — for HuggingFace datasets

🚀 Steps

Step 1: Choose a dataset source

LeRobot datasets come from two sources:

Source	When to use	Example
HuggingFace Hub	Public community datasets, quick experimentation	`lerobot/aloha_sim_insertion_human`
Azure Blob Storage	Private datasets, recorded edge data uploaded to Azure	Custom organization datasets

Step 2a: Download from HuggingFace

For public datasets, use the HuggingFace CLI:

pip install huggingface-hub
huggingface-cli download \
  lerobot/aloha_sim_insertion_human \
  --repo-type dataset \
  --local-dir ./datasets/lerobot/aloha_sim_insertion_human

Step 2b: Download from Azure Blob Storage

[!NOTE] Steps 2b and 3 are only required for local usage. When training via OSMO, the submit-osmo-lerobot-training.sh script automatically downloads the dataset when --blob-url is used.

For datasets stored in Azure, use the download utility. Create a .env file in training/il/scripts/lerobot/ with the required environment variables:

BLOB_URLS='["https://<your-storage-account>.blob.core.windows.net/datasets/my-dataset/v1"]'
DATASET_ROOT=./datasets
DATASET_REPO_ID=my-org/my-dataset

Run the download script:

cd training/il/scripts/lerobot
set -a && source .env && set +a
python download_dataset.py

The script uses DefaultAzureCredential for authentication. It downloads all dataset files, skipping cache and lock files, and preserves the directory structure.

Step 3: Inspect the dataset structure

A valid LeRobot dataset follows this directory layout:

datasets/lerobot/aloha_sim_insertion_human/
├── meta/
│   └── info.json              # Dataset metadata (features, shapes, fps)
├── data/
│   ├── chunk-000/
│   │   ├── episode_000000.parquet
│   │   ├── episode_000001.parquet
│   │   └── ...
├── videos/                    # Optional video observations
│   └── chunk-000/
│       ├── episode_000000.mp4
│       └── ...
└── stats.json                 # Feature statistics for normalization

Verify the structure:

# Check info.json exists and has expected fields
python -c "
import json
from pathlib import Path

info = json.loads(Path('datasets/lerobot/aloha_sim_insertion_human/meta/info.json').read_text())
print(f'Dataset: {info.get(\"repo_id\", \"unknown\")}')
print(f'Episodes: {info.get(\"total_episodes\", \"unknown\")}')
print(f'Frames: {info.get(\"total_frames\", \"unknown\")}')
print(f'FPS: {info.get(\"fps\", \"unknown\")}')
"

Step 4: Validate episode files

Check that parquet episode files are readable and contain expected columns:

python -c "
import pyarrow.parquet as pq
from pathlib import Path

data_dir = Path('datasets/lerobot/aloha_sim_insertion_human/data/chunk-000')
episodes = sorted(data_dir.glob('episode_*.parquet'))
print(f'Found {len(episodes)} episode files')

# Inspect the first episode
table = pq.read_table(episodes[0])
print(f'Columns: {table.column_names}')
print(f'Rows: {table.num_rows}')
"

Step 5: Browse with the Dataset Viewer (optional)

Launch the Dataset Analysis Tool for visual episode inspection:

cd data-management/viewer
./start.sh

Open http://localhost:5173 in a browser. The viewer provides episode browsing, frame-level annotation, trajectory visualization, and data quality metrics.

Step 6: Connect to training

With the dataset validated, submit a training job using the repository ID or direct Azure Blob URL:

# From HuggingFace (dataset downloaded on-the-fly by the training container)
cd training/il/scripts
./submit-osmo-lerobot-training.sh -d lerobot/aloha_sim_insertion_human

# From Azure Blob (dataset downloaded at job start)
./submit-osmo-lerobot-training.sh \
  --blob-url https://<your-storage-account>.blob.core.windows.net/datasets/my-dataset/v1

See Your First LeRobot Training Job for the full training recipe.

✅ Verify

The recipe succeeded when:

Dataset directory contains meta/info.json with valid metadata
Episode parquet files are readable with expected columns
info.json reports expected episode and frame counts
(Optional) Dataset Viewer displays episodes without errors

⚙️ Configuration Reference

download_dataset.py environment variables:

Variable	Required	Default	Description
`BLOB_URLS`	yes	—	Non-empty JSON array of direct Azure Blob URLs
`DATASET_ROOT`	no	`/workspace/data`	Local root directory for datasets
`DATASET_REPO_ID`	yes	—	Dataset identifier relative to `DATASET_ROOT`

Configuring Edge Data Recording — capture your own training data
Your First LeRobot Training Job — train with the prepared dataset
End-to-End LeRobot Pipeline — automated train → evaluate → register

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.

📋 Prerequisites​

🚀 Steps​

Step 1: Choose a dataset source​

Step 2a: Download from HuggingFace​

Step 2b: Download from Azure Blob Storage​

Step 3: Inspect the dataset structure​

Step 4: Validate episode files​

Step 5: Browse with the Dataset Viewer (optional)​

Step 6: Connect to training​

✅ Verify​

⚙️ Configuration Reference​

🔗 Related Recipes​