Bert Embedding
This tutorial shows the benchmark of applying batch-inference on Bert embedding scenario, and code is provided.
Two batching methods Sequence Batcher and Bulk Sequence Batcher were tested. Sequence Batcher will pad input sequences to same length within a batch. Additionally, Bulk Sequence Batcher divides inputs sequences of different lengths into four buckes [1, 16], [17, 32], [33, 64], [64,) based on given bucket setting.
As shown in the table, Bulk Sequence Batcher can achieve about 4.7 times throughput and Sequence Batcher can achieve about 3.6 times throughput comparing to baseline.
The experiments were run on NVIDIA V100 GPU.
Method |
Query Count |
Execution Time |
Throughput comparing to Baseline |
Max Batch Size Setting |
Avg Batch Size |
Baseline |
2000 |
37.75s |
1x |
/ |
1 |
Sequence Batcher |
2000 |
14.88s |
2.5x |
4 |
3.99 |
Sequence Batcher |
2000 |
10.54s |
3.5x |
32 |
31.25 |
Sequence Batcher |
2000 |
10.38s |
3.6x |
64 |
48.78 |
Bulk Sequence Batcher |
2000 |
7.93s |
4.7x |
32 |
15.74 |
Bert embedding with Bulk Sequence Batcher:
from typing import Tuple
import torch
from transformers import BertModel
from batch_inference import batching
from batch_inference.batcher.bucket_seq_batcher import BucketSeqBatcher
@batching(batcher=BucketSeqBatcher(padding_tokens=[0, 0], buckets=[16, 32, 64], tensor='pt'), max_batch_size=32)
class BertEmbeddingModel:
def __init__(self):
self.model = BertModel.from_pretrained("bert-base-uncased")
def predict_batch(
self, input_ids: torch.Tensor, attention_mask: torch.Tensor
) -> Tuple[torch.Tensor]:
with torch.no_grad():
outputs = self.model(input_ids, attention_mask)
embedding = outputs[0]
return embedding