AutoML - NLP

Requirements

This example requires GPU. Install the [automl,hf] option:

pip install "flaml[automl,hf]"

A simple sequence classification example

from flaml import AutoML
from datasets import load_dataset

train_dataset = load_dataset("glue", "mrpc", split="train").to_pandas()
dev_dataset = load_dataset("glue", "mrpc", split="validation").to_pandas()
test_dataset = load_dataset("glue", "mrpc", split="test").to_pandas()
custom_sent_keys = ["sentence1", "sentence2"]
label_key = "label"
X_train, y_train = train_dataset[custom_sent_keys], train_dataset[label_key]
X_val, y_val = dev_dataset[custom_sent_keys], dev_dataset[label_key]
X_test = test_dataset[custom_sent_keys]

automl = AutoML()
automl_settings = {
    "time_budget": 100,
    "task": "seq-classification",
    "fit_kwargs_by_estimator": {
        "transformer": {
            "output_dir": "data/output/"  # if model_path is not set, the default model is facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base
        }
    },  # setting the huggingface arguments: output directory
    "gpu_per_trial": 1,  # set to 0 if no GPU is available
}
automl.fit(
    X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings
)
automl.predict(X_test)

Notice that after you run automl.fit, the intermediate checkpoints are saved under the specified output_dir data/output. You can use the following code to clean these outputs if they consume a large storage space:

if os.path.exists("data/output/"):
    shutil.rmtree("data/output/")

Sample output

[flaml.automl: 12-06 08:21:39] {1943} INFO - task = seq-classification
[flaml.automl: 12-06 08:21:39] {1945} INFO - Data split method: stratified
[flaml.automl: 12-06 08:21:39] {1949} INFO - Evaluation method: holdout
[flaml.automl: 12-06 08:21:39] {2019} INFO - Minimizing error metric: 1-accuracy
[flaml.automl: 12-06 08:21:39] {2071} INFO - List of ML learners in AutoML Run: ['transformer']
[flaml.automl: 12-06 08:21:39] {2311} INFO - iteration 0, current learner transformer
{'data/output/train_2021-12-06_08-21-53/train_8947b1b2_1_n=1e-06,s=9223372036854775807,e=1e-05,s=-1,s=0.45765,e=32,d=42,o=0.0,y=0.0_2021-12-06_08-21-53/checkpoint-53': 53}
[flaml.automl: 12-06 08:22:56] {2424} INFO - Estimated sufficient time budget=766860s. Estimated necessary time budget=767s.
[flaml.automl: 12-06 08:22:56] {2499} INFO -  at 76.7s, estimator transformer's best error=0.1740,      best estimator transformer's best error=0.1740
[flaml.automl: 12-06 08:22:56] {2606} INFO - selected model: <flaml.nlp.huggingface.trainer.TrainerForAuto object at 0x7f49ea8414f0>
[flaml.automl: 12-06 08:22:56] {2100} INFO - fit succeeded
[flaml.automl: 12-06 08:22:56] {2101} INFO - Time taken to find the best model: 76.69802761077881
[flaml.automl: 12-06 08:22:56] {2112} WARNING - Time taken to find the best model is 77% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.

A simple sequence regression example

from flaml import AutoML
from datasets import load_dataset

train_dataset = load_dataset("glue", "stsb", split="train").to_pandas()
dev_dataset = load_dataset("glue", "stsb", split="train").to_pandas()
custom_sent_keys = ["sentence1", "sentence2"]
label_key = "label"
X_train = train_dataset[custom_sent_keys]
y_train = train_dataset[label_key]
X_val = dev_dataset[custom_sent_keys]
y_val = dev_dataset[label_key]

automl = AutoML()
automl_settings = {
    "gpu_per_trial": 0,
    "time_budget": 20,
    "task": "seq-regression",
    "metric": "rmse",
}
automl_settings["fit_kwargs_by_estimator"] = {  # setting the huggingface arguments
    "transformer": {
        "model_path": "google/electra-small-discriminator",  # if model_path is not set, the default model is facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base
        "output_dir": "data/output/",  # setting the output directory
        "fp16": False,
    }  # setting whether to use FP16
}
automl.fit(
    X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings
)

Sample output

[flaml.automl: 12-20 11:47:28] {1965} INFO - task = seq-regression
[flaml.automl: 12-20 11:47:28] {1967} INFO - Data split method: uniform
[flaml.automl: 12-20 11:47:28] {1971} INFO - Evaluation method: holdout
[flaml.automl: 12-20 11:47:28] {2063} INFO - Minimizing error metric: rmse
[flaml.automl: 12-20 11:47:28] {2115} INFO - List of ML learners in AutoML Run: ['transformer']
[flaml.automl: 12-20 11:47:28] {2355} INFO - iteration 0, current learner transformer

A simple summarization example

from flaml import AutoML
from datasets import load_dataset

train_dataset = load_dataset("xsum", split="train").to_pandas()
dev_dataset = load_dataset("xsum", split="validation").to_pandas()
custom_sent_keys = ["document"]
label_key = "summary"

X_train = train_dataset[custom_sent_keys]
y_train = train_dataset[label_key]

X_val = dev_dataset[custom_sent_keys]
y_val = dev_dataset[label_key]

automl = AutoML()
automl_settings = {
    "gpu_per_trial": 1,
    "time_budget": 20,
    "task": "summarization",
    "metric": "rouge1",
}
automl_settings["fit_kwargs_by_estimator"] = {  # setting the huggingface arguments
    "transformer": {
        "model_path": "t5-small",  # if model_path is not set, the default model is t5-small: https://huggingface.co/t5-small
        "output_dir": "data/output/",  # setting the output directory
        "fp16": False,
    }  # setting whether to use FP16
}
automl.fit(
    X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings
)

Sample Output

[flaml.automl: 12-20 11:44:03] {1965} INFO - task = summarization
[flaml.automl: 12-20 11:44:03] {1967} INFO - Data split method: uniform
[flaml.automl: 12-20 11:44:03] {1971} INFO - Evaluation method: holdout
[flaml.automl: 12-20 11:44:03] {2063} INFO - Minimizing error metric: -rouge
[flaml.automl: 12-20 11:44:03] {2115} INFO - List of ML learners in AutoML Run: ['transformer']
[flaml.automl: 12-20 11:44:03] {2355} INFO - iteration 0, current learner transformer
loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /home/xliu127/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.14.1",
  "use_cache": true,
  "vocab_size": 32128
}

A simple token classification example

There are two ways to define the label for a token classification task. The first is to define the token labels:

from flaml import AutoML
import pandas as pd

train_dataset = {
    "id": ["0", "1"],
    "ner_tags": [
        ["B-ORG", "O", "B-MISC", "O", "O", "O", "B-MISC", "O", "O"],
        ["B-PER", "I-PER"],
    ],
    "tokens": [
        [
            "EU",
            "rejects",
            "German",
            "call",
            "to",
            "boycott",
            "British",
            "lamb",
            ".",
        ],
        ["Peter", "Blackburn"],
    ],
}
dev_dataset = {
    "id": ["0"],
    "ner_tags": [
        ["O"],
    ],
    "tokens": [["1996-08-22"]],
}
test_dataset = {
    "id": ["0"],
    "ner_tags": [
        ["O"],
    ],
    "tokens": [["."]],
}
custom_sent_keys = ["tokens"]
label_key = "ner_tags"

train_dataset = pd.DataFrame(train_dataset)
dev_dataset = pd.DataFrame(dev_dataset)
test_dataset = pd.DataFrame(test_dataset)

X_train, y_train = train_dataset[custom_sent_keys], train_dataset[label_key]
X_val, y_val = dev_dataset[custom_sent_keys], dev_dataset[label_key]
X_test = test_dataset[custom_sent_keys]

automl = AutoML()
automl_settings = {
    "time_budget": 10,
    "task": "token-classification",
    "fit_kwargs_by_estimator": {
        "transformer": {
            "output_dir": "data/output/"
            # if model_path is not set, the default model is facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base
        }
    },  # setting the huggingface arguments: output directory
    "gpu_per_trial": 1,  # set to 0 if no GPU is available
    "metric": "seqeval:overall_f1",
}

automl.fit(
    X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings
)
automl.predict(X_test)

The second is to define the id labels + a token label list:

from flaml import AutoML
import pandas as pd

train_dataset = {
    "id": ["0", "1"],
    "ner_tags": [
        [3, 0, 7, 0, 0, 0, 7, 0, 0],
        [1, 2],
    ],
    "tokens": [
        [
            "EU",
            "rejects",
            "German",
            "call",
            "to",
            "boycott",
            "British",
            "lamb",
            ".",
        ],
        ["Peter", "Blackburn"],
    ],
}
dev_dataset = {
    "id": ["0"],
    "ner_tags": [
        [0],
    ],
    "tokens": [["1996-08-22"]],
}
test_dataset = {
    "id": ["0"],
    "ner_tags": [
        [0],
    ],
    "tokens": [["."]],
}
custom_sent_keys = ["tokens"]
label_key = "ner_tags"

train_dataset = pd.DataFrame(train_dataset)
dev_dataset = pd.DataFrame(dev_dataset)
test_dataset = pd.DataFrame(test_dataset)

X_train, y_train = train_dataset[custom_sent_keys], train_dataset[label_key]
X_val, y_val = dev_dataset[custom_sent_keys], dev_dataset[label_key]
X_test = test_dataset[custom_sent_keys]

automl = AutoML()
automl_settings = {
    "time_budget": 10,
    "task": "token-classification",
    "fit_kwargs_by_estimator": {
        "transformer": {
            "output_dir": "data/output/",
            # if model_path is not set, the default model is facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base
            "label_list": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC",
                "B-MISC",
                "I-MISC",
            ],
        }
    },  # setting the huggingface arguments: output directory
    "gpu_per_trial": 1,  # set to 0 if no GPU is available
    "metric": "seqeval:overall_f1",
}

automl.fit(
    X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings
)
automl.predict(X_test)

Sample Output

[flaml.automl: 06-30 03:10:02] {2423} INFO - task = token-classification
[flaml.automl: 06-30 03:10:02] {2425} INFO - Data split method: stratified
[flaml.automl: 06-30 03:10:02] {2428} INFO - Evaluation method: holdout
[flaml.automl: 06-30 03:10:02] {2497} INFO - Minimizing error metric: seqeval:overall_f1
[flaml.automl: 06-30 03:10:02] {2637} INFO - List of ML learners in AutoML Run: ['transformer']
[flaml.automl: 06-30 03:10:02] {2929} INFO - iteration 0, current learner transformer

For tasks that are not currently supported, use flaml.tune for customized tuning.

Link to Jupyter notebook

To run more examples, especially examples using Ray Tune, please go to:

Link to notebook | Open in colab

Requirements​

A simple sequence classification example​

Sample output​

A simple sequence regression example​

Sample output​

A simple summarization example​

Sample Output​

A simple token classification example​

Sample Output​

Link to Jupyter notebook​

Requirements

A simple sequence classification example

Sample output

A simple sequence regression example

Sample output

A simple summarization example

Sample Output

A simple token classification example

Sample Output

Link to Jupyter notebook