How To Configure Data¶
Olive Data config organizes the data loading, preprocessing, batching and post processing into a single json config and defines several popular templates to serve Olive optimization.
The configuration of Olive data config is positioned under Olive run config with the field of data_configs, and the data config is defined as a list of data config items. Here is an example of data config: open_llama_sparsegpt_gpu .
"data_configs": {
"dataset_1": {...},
"dataset_2": {...},
}
Then if there is any requirement to leverage the data config in Olive passes/evaluator, we can simply refer to the data config key name. For above open_llama_sparsegpt_gpu case, the passes/evaluator data config is: open_llama_sparsegpt_gpu data_config reference .
"evaluators": {
"common_evaluator": {
...,
"data_config": "dataset_1"
},
...
},
"passes": {
"common_pass": {
...,
"data_config": "dataset_2"
},
...
}
Before deep dive to the generic data config, let’s first take a look at the data config template.
Supported Data Config Template¶
The data config template is defined in olive.data.template module which is used to create data_config easily.
Currently, we support the following data container which can be generated from olive.data.template:
1. DummyDataContainer : Convert the dummy data config to the data container.
{
"name": "dummy_data_config_template",
"type": "DummyDataContainer",
"params_config": {
"input_shapes": [[1, 128], [1, 128], [1, 128]],
"input_names": ["input_ids", "attention_mask", "token_type_ids"],
"input_types": ["int64", "int64", "int64"],
},
}
from olive.data.config import DataConfig
data_config = DataConfig(
name="dummy_data_config_template",
type="DummyDataContainer",
params_config={
"input_shapes": [[1, 128], [1, 128], [1, 128]],
"input_names": ["input_ids", "attention_mask", "token_type_ids"],
"input_types": ["int64", "int64", "int64"],
},
)
2. HuggingfaceContainer : Convert the huggingface data config to the data container.
{
"name": "huggingface_data_config_template",
"type": "HuggingfaceContainer",
"params_config": {
"model_name": "Intel/bert-base-uncased-mrpc",
"task_type": "text-classification",
"batch_size": 1,
"data_name": "glue",
"input_cols": ["sentence1", "sentence2"],
"label_cols": ["label"],
"split": "validation",
"subset": "mrpc",
},
}
from olive.data.config import DataConfig
data_config = DataConfig(
name="huggingface_data_config_template",
type="HuggingfaceContainer",
params_config={
"model_name": "Intel/bert-base-uncased-mrpc",
"task_type": "text-classification",
"batch_size": 1,
"data_name": "glue",
"input_cols": ["sentence1", "sentence2"],
"label_cols": ["label"],
"split": "validation",
"subset": "mrpc",
},
)
Note
If the input model for Olive is huggingface model, we can update above config under input_model:
{
"input_model":{
"type": "PyTorchModel",
"config": {
"hf_config": {
"model_name": "Intel/bert-base-uncased-mrpc",
"task": "text-classification",
"dataset": {
"data_name":"glue",
"subset": "mrpc",
"split": "validation",
"input_cols": ["sentence1", "sentence2"],
"label_cols": ["label"],
"batch_size": 1
}
}
}
}
}
3. RawDataContainer : Convert the raw data config to the data container.
{
"name": "raw_data",
"type": "RawDataContainer",
"params_config": {
"data_dir": "data",
"input_names": ["data"],
"input_shapes": [[1, 3, 224, 224]],
"input_dirs": ["."],
"input_suffix": ".raw",
"input_order_file": "input_order.txt"
}
}
from olive.data.config import DataConfig
data_config = DataConfig(
name="raw_data",
type="RawDataContainer",
params_config={
"data_dir": "data",
"input_names": ["data"],
"input_shapes": [[1, 3, 224, 224]],
"input_dirs": ["."],
"input_suffix": ".raw",
"input_order_file": "input_order.txt"
}
)
Generic Data Config¶
- If no data config template can meet the requirement, we can also define the data config directly. The data config is defined as a dictionary which includes the following fields:
name: the name of the data config.
- type: the type name of the data config. Available type:
DataContainer : the base class of all data config.
DummyDataContainer in above section.
HuggingfaceContainer in above section.
RawDataContainer in above section.
- components: the dictionary of four components which contain:
Title¶ Components
Available component type
local_dataset(default), simple_dataset, huggingface_dataset, raw_dataset
pre_process(default), huggingface_pre_process, ner_huggingface_preprocess, text_generation_huggingface_pre_process
post_process(default), text_classification_post_process, ner_post_process, text_generation_post_process
default_dataloader(default), skip_dataloader, no_auto_batch_dataloader
- each component can be customized by the following fields:
name: the name of the component.
type: the type name of the available component type. Besides the above available type in above table, user can also define their own component type in user_script with the way Olive does for huggingface_dataset. In this way, they need to provide user_script and script_dir. There is an example with customized component type.
params: the dictionary of component function parameters. The key is the parameter name for given component type and the value is the parameter value.
user_script: the user script path which contains the customized component type.
script_dir: the user script directory path which contains the customized script.
Configs with built-in component:¶
Then the complete config would be like:
{
"name": "data",
"type": "DataContainer",
"components": {
"load_dataset": {
"name": "_huggingface_dataset",
"type": "huggingface_dataset",
"params": {
"data_dir": null,
"data_name": "glue",
"subset": "mrpc",
"split": "validation",
"data_files": null
}
},
"pre_process_data": {
"name": "_huggingface_pre_process",
"type": "huggingface_pre_process",
"params": {
"model_name": "Intel/bert-base-uncased-mrpc",
"input_cols": [
"sentence1",
"sentence2"
],
"label_cols": [
"label"
],
"max_samples": null
}
},
"post_process_data": {
"name": "_text_classification_post_process",
"type": "text_classification_post_process",
"params": {}
},
"dataloader": {
"name": "_default_dataloader",
"type": "default_dataloader",
"params": {
"batch_size": 1
}
}
},
}
from olive.data.config import DataConfig
data_config = DataConfig(
name="data",
type="DataContainer",
components={
"load_dataset": {
"name": "_huggingface_dataset",
"type": "huggingface_dataset",
"params": {
"data_dir": null,
"data_name": "glue",
"subset": "mrpc",
"split": "validation",
"data_files": null
}
},
"pre_process_data": {
"name": "_huggingface_pre_process",
"type": "huggingface_pre_process",
"params": {
"model_name": "Intel/bert-base-uncased-mrpc",
"input_cols": [
"sentence1",
"sentence2"
],
"label_cols": [
"label"
],
"max_samples": null
}
},
"post_process_data": {
"name": "_text_classification_post_process",
"type": "text_classification_post_process",
"params": {}
},
"dataloader": {
"name": "_default_dataloader",
"type": "default_dataloader",
"params": {
"batch_size": 1
}
}
},
)
Configs with customized component:¶
The above case shows to rewrite all the components in data config. But sometime, there is no need to rewrite all the components. For example, if we only want to customize the load_dataset component for DataContainer, we can just rewrite the load_dataset component in the data config and ignore the other default components.
{
"name": "data",
"type": "DataContainer",
"user_script": "user_script.py",
"script_dir": "user_dir",
"components": {
"load_dataset": {
"name": "_huggingface_dataset",
"type": "customized_huggingface_dataset",
"params": {
"data_dir": null,
"data_name": "glue",
"subset": "mrpc",
}
},
},
}
from olive.data.registry import Registry
@Registry.register_dataset()
def customized_huggingface_dataset(output):
...
from olive.data.config import DataConfig
data_config = DataConfig(
name="data",
type="DataContainer",
user_script="user_script.py",
script_dir="user_dir",
components={
"load_dataset": {
"name": "_huggingface_dataset",
"type": "customized_huggingface_dataset",
"params": {
"data_dir": null,
"data_name": "glue",
"subset": "mrpc",
}
},
},
)
Note
User should provide the user_script and script_dir if they want to customize the component type. The user_script should be a python script which contains the customized component type. The script_dir should be the directory path which contains the user_script. Here is an example for user_script:
from olive.data.registry import Registry
@Registry.register_dataset()
def customized_huggingface_dataset(dataset):
...
@Registry.register_pre_process()
def customized_huggingface_pre_process(dataset):
...
@Registry.register_post_process()
def customized_post_process(output):
...
@Registry.register_dataloader()
def customized_dataloader(dataset):
...
- More examples:
- dummy_dataset_dataroot: