Skip to main content

Preparing Datasets for Training

Download a dataset from Azure Blob Storage or HuggingFace, inspect its structure, validate format compliance, and connect it to a LeRobot training workflow. By the end of this recipe, you will have a training-ready dataset on your local machine or in a cloud-accessible location.

[!NOTE] This recipe covers dataset preparation. For training with the prepared dataset, continue to Your First LeRobot Training Job.

📋 Prerequisites

RequirementDetails
Python3.11+ with uv or pip
Azure CLIAuthenticated (az login) — for Azure Blob datasets
Azure StorageStorage account with dataset container — for Azure Blob datasets
HuggingFace CLIpip install huggingface-hub — for HuggingFace datasets

🚀 Steps

Step 1: Choose a dataset source

LeRobot datasets come from two sources:

SourceWhen to useExample
HuggingFace HubPublic community datasets, quick experimentationlerobot/aloha_sim_insertion_human
Azure Blob StoragePrivate datasets, recorded edge data uploaded to AzureCustom organization datasets

Step 2a: Download from HuggingFace

For public datasets, use the HuggingFace CLI:

pip install huggingface-hub
huggingface-cli download \
lerobot/aloha_sim_insertion_human \
--repo-type dataset \
--local-dir ./datasets/lerobot/aloha_sim_insertion_human

Step 2b: Download from Azure Blob Storage

[!NOTE] Steps 2b and 3 are only required for local usage. When training via OSMO, the submit-osmo-lerobot-training.sh script automatically downloads the dataset when the --from-blob argument is used.

For datasets stored in Azure, use the download utility. Create a .env file in training/il/scripts/lerobot/ with the required environment variables:

STORAGE_ACCOUNT=<your-storage-account>
STORAGE_CONTAINER=datasets
BLOB_PREFIX=my-dataset/v1
DATASET_ROOT=./datasets
DATASET_REPO_ID=my-org/my-dataset

Run the download script:

cd training/il/scripts/lerobot
set -a && source .env && set +a
python download_dataset.py

The script uses DefaultAzureCredential for authentication. It downloads all dataset files, skipping cache and lock files, and preserves the directory structure.

Step 3: Inspect the dataset structure

A valid LeRobot dataset follows this directory layout:

datasets/lerobot/aloha_sim_insertion_human/
├── meta/
│ └── info.json # Dataset metadata (features, shapes, fps)
├── data/
│ ├── chunk-000/
│ │ ├── episode_000000.parquet
│ │ ├── episode_000001.parquet
│ │ └── ...
├── videos/ # Optional video observations
│ └── chunk-000/
│ ├── episode_000000.mp4
│ └── ...
└── stats.json # Feature statistics for normalization

Verify the structure:

# Check info.json exists and has expected fields
python -c "
import json
from pathlib import Path

info = json.loads(Path('datasets/lerobot/aloha_sim_insertion_human/meta/info.json').read_text())
print(f'Dataset: {info.get(\"repo_id\", \"unknown\")}')
print(f'Episodes: {info.get(\"total_episodes\", \"unknown\")}')
print(f'Frames: {info.get(\"total_frames\", \"unknown\")}')
print(f'FPS: {info.get(\"fps\", \"unknown\")}')
"

Step 4: Validate episode files

Check that parquet episode files are readable and contain expected columns:

python -c "
import pyarrow.parquet as pq
from pathlib import Path

data_dir = Path('datasets/lerobot/aloha_sim_insertion_human/data/chunk-000')
episodes = sorted(data_dir.glob('episode_*.parquet'))
print(f'Found {len(episodes)} episode files')

# Inspect the first episode
table = pq.read_table(episodes[0])
print(f'Columns: {table.column_names}')
print(f'Rows: {table.num_rows}')
"

Step 5: Browse with the Dataset Viewer (optional)

Launch the Dataset Analysis Tool for visual episode inspection:

cd data-management/viewer
./start.sh

Open http://localhost:5173 in a browser. The viewer provides episode browsing, frame-level annotation, trajectory visualization, and data quality metrics.

Step 6: Connect to training

With the dataset validated, submit a training job using the dataset path or repository ID:

# From HuggingFace (dataset downloaded on-the-fly by the training container)
cd training/il/scripts
./submit-osmo-lerobot-training.sh -d lerobot/aloha_sim_insertion_human

# From Azure Blob (dataset downloaded at job start)
./submit-osmo-lerobot-training.sh \
-d my-org/my-dataset \
--from-blob \
--storage-account <your-storage-account> \
--blob-prefix my-dataset/v1

See Your First LeRobot Training Job for the full training recipe.

✅ Verify

The recipe succeeded when:

  • Dataset directory contains meta/info.json with valid metadata
  • Episode parquet files are readable with expected columns
  • info.json reports expected episode and frame counts
  • (Optional) Dataset Viewer displays episodes without errors

⚙️ Configuration Reference

download_dataset.py environment variables:

VariableRequiredDefaultDescription
STORAGE_ACCOUNTyesAzure Storage account name
STORAGE_CONTAINERnodatasetsBlob container name
BLOB_PREFIXyesBlob path prefix for dataset files
DATASET_ROOTno/workspace/dataLocal root directory for datasets
DATASET_REPO_IDyesDataset identifier (e.g., user/dataset)

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.