Bert Embedding

This tutorial shows the benchmark of applying batch-inference on Bert embedding scenario, and code is provided.

Two batching methods Sequence Batcher and Bulk Sequence Batcher were tested. Sequence Batcher will pad input sequences to same length within a batch. Additionally, Bulk Sequence Batcher divides inputs sequences of different lengths into four buckes [1, 16], [17, 32], [33, 64], [64,) based on given bucket setting.

As shown in the table, Bulk Sequence Batcher can achieve about 4.7 times throughput and Sequence Batcher can achieve about 3.6 times throughput comparing to baseline.

The experiments were run on NVIDIA V100 GPU.

Method

Query Count

Execution Time

Throughput comparing to Baseline

Max Batch Size Setting

Avg Batch Size

Baseline

2000

37.75s

1x

/

1

Sequence Batcher

2000

14.88s

2.5x

4

3.99

Sequence Batcher

2000

10.54s

3.5x

32

31.25

Sequence Batcher

2000

10.38s

3.6x

64

48.78

Bulk Sequence Batcher

2000

7.93s

4.7x

32

15.74

Bert embedding with Bulk Sequence Batcher:

from typing import Tuple

import torch
from transformers import BertModel

from batch_inference import batching
from batch_inference.batcher.bucket_seq_batcher import BucketSeqBatcher


@batching(batcher=BucketSeqBatcher(padding_tokens=[0, 0], buckets=[16, 32, 64], tensor='pt'), max_batch_size=32)
class BertEmbeddingModel:
    def __init__(self):
        self.model = BertModel.from_pretrained("bert-base-uncased")

    def predict_batch(
        self, input_ids: torch.Tensor, attention_mask: torch.Tensor
    ) -> Tuple[torch.Tensor]:

        with torch.no_grad():
            outputs = self.model(input_ids, attention_mask)
            embedding = outputs[0]
        return embedding

Benchmark code