Datasets
BenchmarkQED offers two datasets to facilitate the development and evaluation of Retrieval-Augmented Generation (RAG) systems:
- Podcast Transcripts: Contains transcripts from 70 episodes of the Behind the Tech podcast series. This is an updated version of the dataset featured in the GraphRAG paper.
- AP News: Includes 1,397 health-related news articles from the Associated Press.
Downloading to Local Filesystem
To download these datasets programmatically, use the following commands:
- Podcast Transcripts:
- AP News:
Replace OUTPUT_DIR with the path to the directory where you want the dataset to be saved.
Downloading to Azure Blob Storage
You can download datasets directly into Azure Blob Storage by providing storage options:
-
Using managed identity:
-
Using a connection string:
The OUTPUT_DIR argument (e.g., input) becomes the prefix path within the blob container. The dataset zip is downloaded from GitHub, extracted in memory, and each file is uploaded directly to the storage backend.
Storage Options Reference
!!! note "Supported cloud backends"
Only Azure Blob Storage (--storage-type blob) is currently supported.
Azure Cosmos DB and other backends are not supported.
| Option | Description |
|---|---|
--storage-type |
Storage backend: blob for Azure Blob Storage. Omit for local filesystem. |
--container-name |
The blob container name. |
--account-url |
The storage account URL (uses managed identity for authentication). |
--connection-string |
The storage connection string (alternative to --account-url). |
You can also find these datasets in the datasets directory.