Skip to main content

Preparing Datasets for Training

Download a dataset from Azure Blob Storage or HuggingFace, inspect its structure, validate format compliance, and connect it to a LeRobot training workflow. By the end of this recipe, you will have a training-ready dataset on your local machine or in a cloud-accessible location.

[!NOTE] This recipe covers dataset preparation. For training with the prepared dataset, continue to Your First LeRobot Training Job.

📋 Prerequisites

RequirementDetails
Python3.12+ with uv or pip
Azure CLIAuthenticated (az login) — for Azure Blob datasets
Azure StorageStorage account with dataset container — for Azure Blob datasets
HuggingFace CLIpip install huggingface-hub — for HuggingFace datasets

🚀 Steps

Step 1: Choose a dataset source

LeRobot datasets come from two sources:

SourceWhen to useExample
HuggingFace HubPublic community datasets, quick experimentationlerobot/aloha_sim_insertion_human
Azure Blob StoragePrivate datasets, recorded edge data uploaded to AzureCustom organization datasets

Step 2a: Download from HuggingFace

For public datasets, use the HuggingFace CLI:

pip install huggingface-hub
huggingface-cli download \
lerobot/aloha_sim_insertion_human \
--repo-type dataset \
--local-dir ./datasets/lerobot/aloha_sim_insertion_human

Step 2b: Download from Azure Blob Storage

[!NOTE] Steps 2b and 3 are only required for local usage. When training via OSMO, the submit-osmo-lerobot-training.sh script automatically downloads the dataset when --blob-url is used.

For datasets stored in Azure, use the download utility. Create a .env file in training/il/scripts/lerobot/ with the required environment variables:

BLOB_URLS='["https://<your-storage-account>.blob.core.windows.net/datasets/my-dataset/v1"]'
DATASET_ROOT=./datasets
DATASET_REPO_ID=my-org/my-dataset

Run the download script:

cd training/il/scripts/lerobot
set -a && source .env && set +a
python download_dataset.py

The script uses DefaultAzureCredential for authentication. It downloads all dataset files, skipping cache and lock files, and preserves the directory structure.

Step 3: Inspect the dataset structure

A valid LeRobot dataset follows this directory layout:

datasets/lerobot/aloha_sim_insertion_human/
├── meta/
│ └── info.json # Dataset metadata (features, shapes, fps)
├── data/
│ ├── chunk-000/
│ │ ├── episode_000000.parquet
│ │ ├── episode_000001.parquet
│ │ └── ...
├── videos/ # Optional video observations
│ └── chunk-000/
│ ├── episode_000000.mp4
│ └── ...
└── stats.json # Feature statistics for normalization

Verify the structure:

# Check info.json exists and has expected fields
python -c "
import json
from pathlib import Path

info = json.loads(Path('datasets/lerobot/aloha_sim_insertion_human/meta/info.json').read_text())
print(f'Dataset: {info.get(\"repo_id\", \"unknown\")}')
print(f'Episodes: {info.get(\"total_episodes\", \"unknown\")}')
print(f'Frames: {info.get(\"total_frames\", \"unknown\")}')
print(f'FPS: {info.get(\"fps\", \"unknown\")}')
"

Step 4: Validate episode files

Check that parquet episode files are readable and contain expected columns:

python -c "
import pyarrow.parquet as pq
from pathlib import Path

data_dir = Path('datasets/lerobot/aloha_sim_insertion_human/data/chunk-000')
episodes = sorted(data_dir.glob('episode_*.parquet'))
print(f'Found {len(episodes)} episode files')

# Inspect the first episode
table = pq.read_table(episodes[0])
print(f'Columns: {table.column_names}')
print(f'Rows: {table.num_rows}')
"

Step 5: Browse with the Dataset Viewer (optional)

Launch the Dataset Analysis Tool for visual episode inspection:

cd data-management/viewer
./start.sh

Open http://localhost:5173 in a browser. The viewer provides episode browsing, frame-level annotation, trajectory visualization, and data quality metrics.

Step 6: Connect to training

With the dataset validated, submit a training job using the repository ID or direct Azure Blob URL:

# From HuggingFace (dataset downloaded on-the-fly by the training container)
cd training/il/scripts
./submit-osmo-lerobot-training.sh -d lerobot/aloha_sim_insertion_human

# From Azure Blob (dataset downloaded at job start)
./submit-osmo-lerobot-training.sh \
--blob-url https://<your-storage-account>.blob.core.windows.net/datasets/my-dataset/v1

See Your First LeRobot Training Job for the full training recipe.

✅ Verify

The recipe succeeded when:

  • Dataset directory contains meta/info.json with valid metadata
  • Episode parquet files are readable with expected columns
  • info.json reports expected episode and frame counts
  • (Optional) Dataset Viewer displays episodes without errors

⚙️ Configuration Reference

download_dataset.py environment variables:

VariableRequiredDefaultDescription
BLOB_URLSyesNon-empty JSON array of direct Azure Blob URLs
DATASET_ROOTno/workspace/dataLocal root directory for datasets
DATASET_REPO_IDyesDataset identifier relative to DATASET_ROOT

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.