Receipt PDF Parsing with DuckDB + Multimodal¶
Parse multiple receipt PDF files into a structured DuckDB table using SQL.
This example demonstrates:
duckdb_ext.responses_udfwith Pydanticresponse_format→ DuckDBSTRUCT- Multimodal mode — PDFs are uploaded via the Files API automatically
glob()for file discovery + pure SQL for field access and aggregation
1. Setup¶
In [1]:
Copied!
import tempfile
from pathlib import Path
import duckdb
from pydantic import BaseModel
import openaivec
from openaivec.duckdb_ext import responses_udf
openaivec.set_responses_model("gpt-5.4")
import tempfile
from pathlib import Path
import duckdb
from pydantic import BaseModel
import openaivec
from openaivec.duckdb_ext import responses_udf
openaivec.set_responses_model("gpt-5.4")
2. Define receipt schema¶
The Pydantic model is automatically converted to a DuckDB STRUCT type,
so you can access fields directly in SQL.
In [2]:
Copied!
class ReceiptItem(BaseModel):
"""A single line item on a receipt."""
description: str
quantity: int
unit_price: float
amount: float
class Receipt(BaseModel):
"""Structured representation of a receipt."""
store_name: str
date: str
items: list[ReceiptItem]
subtotal: float
tax: float
total: float
payment_method: str
class ReceiptItem(BaseModel):
"""A single line item on a receipt."""
description: str
quantity: int
unit_price: float
amount: float
class Receipt(BaseModel):
"""Structured representation of a receipt."""
store_name: str
date: str
items: list[ReceiptItem]
subtotal: float
tax: float
total: float
payment_method: str
3. Generate receipt PDFs and discover with glob()¶
We generate minimal PDFs programmatically (no external dependencies).
DuckDB’s glob() table function scans the directory for matching files.
In [3]:
Copied!
tmpdir = tempfile.mkdtemp()
def make_pdf(text: str) -> bytes:
"""Generate a minimal valid PDF containing the given text."""
lines = text.strip().split("\n")
stream_parts, y = [], 750
for line in lines:
escaped = line.replace("\\", "\\\\").replace("(", "\\(").replace(")", "\\)")
stream_parts.append(f"BT /F1 10 Tf 50 {y} Td ({escaped}) Tj ET")
y -= 14
stream = "\n".join(stream_parts).encode("latin-1")
objs = [
b"1 0 obj\n<< /Type /Catalog /Pages 2 0 R >>\nendobj",
b"2 0 obj\n<< /Type /Pages /Kids [3 0 R] /Count 1 >>\nendobj",
b"3 0 obj\n<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792]"
b" /Contents 4 0 R /Resources << /Font << /F1 5 0 R >> >> >>\nendobj",
b"4 0 obj\n<< /Length " + str(len(stream)).encode() + b" >>\nstream\n"
+ stream + b"\nendstream\nendobj",
b"5 0 obj\n<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>\nendobj",
]
body = b"%PDF-1.4\n"
offsets = []
for obj in objs:
offsets.append(len(body))
body += obj + b"\n"
xref_pos = len(body)
xref = b"xref\n0 6\n0000000000 65535 f \n"
for off in offsets:
xref += f"{off:010d} 00000 n \n".encode()
return body + xref + f"trailer\n<< /Size 6 /Root 1 0 R >>\nstartxref\n{xref_pos}\n%%EOF".encode()
receipts_raw = {
"receipt_coffee_shop.pdf": """
SUNNY CAFE
123 Main Street, Tokyo
Date: 2025-03-15
Latte x2 550 1,100
Croissant x1 380 380
Blueberry Muffin x3 420 1,260
Subtotal: 2,740
Tax (10%): 274
Total: 3,014
Payment: Credit Card (Visa)
""",
"receipt_electronics.pdf": """
TECH WORLD AKIHABARA
456 Electric Town, Tokyo
Date: 2025-03-16
USB-C Cable x2 980 1,960
Wireless Mouse x1 3,500 3,500
Screen Protector x3 650 1,950
Subtotal: 7,410
Tax (10%): 741
Total: 8,151
Payment: Cash
""",
"receipt_grocery.pdf": """
FRESH MART SHIBUYA
789 Center Gai, Tokyo
Date: 2025-03-17
Organic Milk 1L x2 320 640
Whole Wheat Bread x1 280 280
Avocado x4 198 792
Salmon Fillet 200g x2 580 1,160
Subtotal: 2,872
Tax (8%): 230
Total: 3,102
Payment: IC Card (Suica)
""",
}
for name, content in receipts_raw.items():
(Path(tmpdir) / name).write_bytes(make_pdf(content))
conn = duckdb.connect()
conn.sql(f"SELECT file FROM glob('{tmpdir}/receipt_*.pdf')").to_df()
tmpdir = tempfile.mkdtemp()
def make_pdf(text: str) -> bytes:
"""Generate a minimal valid PDF containing the given text."""
lines = text.strip().split("\n")
stream_parts, y = [], 750
for line in lines:
escaped = line.replace("\\", "\\\\").replace("(", "\\(").replace(")", "\\)")
stream_parts.append(f"BT /F1 10 Tf 50 {y} Td ({escaped}) Tj ET")
y -= 14
stream = "\n".join(stream_parts).encode("latin-1")
objs = [
b"1 0 obj\n<< /Type /Catalog /Pages 2 0 R >>\nendobj",
b"2 0 obj\n<< /Type /Pages /Kids [3 0 R] /Count 1 >>\nendobj",
b"3 0 obj\n<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792]"
b" /Contents 4 0 R /Resources << /Font << /F1 5 0 R >> >> >>\nendobj",
b"4 0 obj\n<< /Length " + str(len(stream)).encode() + b" >>\nstream\n"
+ stream + b"\nendstream\nendobj",
b"5 0 obj\n<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>\nendobj",
]
body = b"%PDF-1.4\n"
offsets = []
for obj in objs:
offsets.append(len(body))
body += obj + b"\n"
xref_pos = len(body)
xref = b"xref\n0 6\n0000000000 65535 f \n"
for off in offsets:
xref += f"{off:010d} 00000 n \n".encode()
return body + xref + f"trailer\n<< /Size 6 /Root 1 0 R >>\nstartxref\n{xref_pos}\n%%EOF".encode()
receipts_raw = {
"receipt_coffee_shop.pdf": """
SUNNY CAFE
123 Main Street, Tokyo
Date: 2025-03-15
Latte x2 550 1,100
Croissant x1 380 380
Blueberry Muffin x3 420 1,260
Subtotal: 2,740
Tax (10%): 274
Total: 3,014
Payment: Credit Card (Visa)
""",
"receipt_electronics.pdf": """
TECH WORLD AKIHABARA
456 Electric Town, Tokyo
Date: 2025-03-16
USB-C Cable x2 980 1,960
Wireless Mouse x1 3,500 3,500
Screen Protector x3 650 1,950
Subtotal: 7,410
Tax (10%): 741
Total: 8,151
Payment: Cash
""",
"receipt_grocery.pdf": """
FRESH MART SHIBUYA
789 Center Gai, Tokyo
Date: 2025-03-17
Organic Milk 1L x2 320 640
Whole Wheat Bread x1 280 280
Avocado x4 198 792
Salmon Fillet 200g x2 580 1,160
Subtotal: 2,872
Tax (8%): 230
Total: 3,102
Payment: IC Card (Suica)
""",
}
for name, content in receipts_raw.items():
(Path(tmpdir) / name).write_bytes(make_pdf(content))
conn = duckdb.connect()
conn.sql(f"SELECT file FROM glob('{tmpdir}/receipt_*.pdf')").to_df()
Out[3]:
| file | |
|---|---|
| 0 | /var/folders/09/9cq3p9sn3hg5s3xc0q87xgth0000gn... |
| 1 | /var/folders/09/9cq3p9sn3hg5s3xc0q87xgth0000gn... |
| 2 | /var/folders/09/9cq3p9sn3hg5s3xc0q87xgth0000gn... |
4. Register the parsing UDF¶
responses_udf with a Pydantic response_format creates a UDF that returns
a DuckDB STRUCT. With multimodal=True, PDF files are automatically
uploaded via the Files API.
In [4]:
Copied!
responses_udf(
conn,
"parse_receipt",
instructions=(
"You are a receipt parser. Extract all information from the receipt "
"into the structured format. Use numeric values without currency symbols or commas. "
"Dates should be in YYYY-MM-DD format."
),
response_format=Receipt,
batch_size=10,
multimodal=True,
)
responses_udf(
conn,
"parse_receipt",
instructions=(
"You are a receipt parser. Extract all information from the receipt "
"into the structured format. Use numeric values without currency symbols or commas. "
"Dates should be in YYYY-MM-DD format."
),
response_format=Receipt,
batch_size=10,
multimodal=True,
)
5. Parse all receipts with SQL¶
Call the UDF on each file from . The result is a — access fields with dot notation.
In [5]:
Copied!
conn.sql(f"""
SELECT
parse_receipt(file).store_name AS store,
parse_receipt(file).date AS date,
parse_receipt(file).total AS total,
parse_receipt(file).tax AS tax,
parse_receipt(file).payment_method AS payment
FROM glob('{tmpdir}/receipt_*.pdf')
""").to_df()
conn.sql(f"""
SELECT
parse_receipt(file).store_name AS store,
parse_receipt(file).date AS date,
parse_receipt(file).total AS total,
parse_receipt(file).tax AS tax,
parse_receipt(file).payment_method AS payment
FROM glob('{tmpdir}/receipt_*.pdf')
""").to_df()
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Out[5]:
| store | date | total | tax | payment | |
|---|---|---|---|---|---|
| 0 | SUNNY CAFE | 2025-03-15 | 3014.0 | 274.0 | Credit Card (Visa) |
| 1 | TECH WORLD AKIHABARA | 2025-03-16 | 8151.0 | 741.0 | Cash |
| 2 | FRESH MART SHIBUYA | 2025-03-17 | 3102.0 | 230.0 | IC Card (Suica) |
6. Unnest line items¶
Use UNNEST to flatten the items array into individual rows.
In [6]:
Copied!
conn.sql(f"""
WITH parsed AS (
SELECT parse_receipt(file) AS r
FROM glob('{tmpdir}/receipt_*.pdf')
),
items AS (
SELECT
r.store_name AS store,
r.date AS date,
UNNEST(r.items) AS item,
r.payment_method AS payment
FROM parsed
)
SELECT
store, date,
item.description,
item.quantity AS qty,
item.unit_price,
item.amount,
payment
FROM items
""").to_df()
conn.sql(f"""
WITH parsed AS (
SELECT parse_receipt(file) AS r
FROM glob('{tmpdir}/receipt_*.pdf')
),
items AS (
SELECT
r.store_name AS store,
r.date AS date,
UNNEST(r.items) AS item,
r.payment_method AS payment
FROM parsed
)
SELECT
store, date,
item.description,
item.quantity AS qty,
item.unit_price,
item.amount,
payment
FROM items
""").to_df()
Out[6]:
| store | date | description | qty | unit_price | amount | payment | |
|---|---|---|---|---|---|---|---|
| 0 | SUNNY CAFE | 2025-03-15 | Latte | 2 | 550.0 | 1100.0 | Credit Card (Visa) |
| 1 | SUNNY CAFE | 2025-03-15 | Croissant | 1 | 380.0 | 380.0 | Credit Card (Visa) |
| 2 | SUNNY CAFE | 2025-03-15 | Blueberry Muffin | 3 | 420.0 | 1260.0 | Credit Card (Visa) |
| 3 | TECH WORLD AKIHABARA | 2025-03-16 | USB-C Cable | 2 | 980.0 | 1960.0 | Cash |
| 4 | TECH WORLD AKIHABARA | 2025-03-16 | Wireless Mouse | 1 | 3500.0 | 3500.0 | Cash |
| 5 | TECH WORLD AKIHABARA | 2025-03-16 | Screen Protector | 3 | 650.0 | 1950.0 | Cash |
| 6 | FRESH MART SHIBUYA | 2025-03-17 | Organic Milk 1L | 2 | 320.0 | 640.0 | IC Card (Suica) |
| 7 | FRESH MART SHIBUYA | 2025-03-17 | Whole Wheat Bread | 1 | 280.0 | 280.0 | IC Card (Suica) |
| 8 | FRESH MART SHIBUYA | 2025-03-17 | Avocado | 4 | 198.0 | 792.0 | IC Card (Suica) |
| 9 | FRESH MART SHIBUYA | 2025-03-17 | Salmon Fillet 200g | 2 | 580.0 | 1160.0 | IC Card (Suica) |
7. Aggregate by store¶
In [7]:
Copied!
conn.sql(f"""
WITH parsed AS (
SELECT parse_receipt(file) AS r
FROM glob('{tmpdir}/receipt_*.pdf')
)
SELECT
r.store_name AS store,
len(r.items) AS num_items,
r.subtotal,
r.tax,
r.total,
r.payment_method AS payment
FROM parsed
ORDER BY r.total DESC
""").to_df()
conn.sql(f"""
WITH parsed AS (
SELECT parse_receipt(file) AS r
FROM glob('{tmpdir}/receipt_*.pdf')
)
SELECT
r.store_name AS store,
len(r.items) AS num_items,
r.subtotal,
r.tax,
r.total,
r.payment_method AS payment
FROM parsed
ORDER BY r.total DESC
""").to_df()
Out[7]:
| store | num_items | subtotal | tax | total | payment | |
|---|---|---|---|---|---|---|
| 0 | TECH WORLD AKIHABARA | 3 | 7410.0 | 741.0 | 8151.0 | Cash |
| 1 | FRESH MART SHIBUYA | 4 | 2872.0 | 230.0 | 3102.0 | IC Card (Suica) |
| 2 | SUNNY CAFE | 3 | 2740.0 | 274.0 | 3014.0 | Credit Card (Visa) |
8. Cleanup¶
In [8]:
Copied!
conn.close()
import shutil
shutil.rmtree(tmpdir)
print("Done!")
conn.close()
import shutil
shutil.rmtree(tmpdir)
print("Done!")
Done!
Notes¶
- PDF files are uploaded via the Files API and processed individually
when
multimodal=Trueis set. - Text files (
.txt,.csv,.md,.py, etc.) would be read as strings and processed through the batched text path with deduplication. - The Pydantic model maps to a DuckDB
STRUCT— access nested fields with dot notation andUNNESTarrays, all in SQL. glob()eliminates the need to manually list file paths.