Microsoft Fabric × DuckDB × openaivec — Structured PDF Extraction (Inventory Transfer Slips)¶
This notebook is designed to run inside a Microsoft Fabric Notebook and builds the following pipeline:
- Read inventory transfer slip PDFs from
Files/transfer/*.pdfof a Fabric Lakehouse via DuckDB. - Use openaivec's
responses_udfto parse each PDF into a Pydantic schema with an Azure OpenAI multimodal model. - Convert the result to a pandas DataFrame and write it to Delta Lake on OneLake.
responses_udf takes care of batching, deduplication, retry, and structured output (Pydantic response_format), so your code is essentially just SQL + a Pydantic schema.
Prerequisites¶
1. Fabric / OneLake¶
- This notebook is placed in a Fabric workspace.
- A Lakehouse (e.g.
bronze.Lakehouse) is created and attached as the default Lakehouse of this notebook. - Input PDFs are uploaded to
Files/transfer/of that Lakehouse (accessible from the kernel via the POSIX path/lakehouse/default/Files/transfer/*.pdf). - The Delta write target (e.g.
abfss://<workspace>@onelake.dfs.fabric.microsoft.com/<lakehouse>.Lakehouse/Tables/...) is writable.
2. Azure OpenAI / AI Foundry¶
- An Azure OpenAI or AI Foundry endpoint in v1 API format:
- Example:
https://<your-resource>.services.ai.azure.com/openai/v1/ - Foundry project variant:
https://<your-resource>.services.ai.azure.com/api/projects/<project>/openai/v1/
- Example:
- A multimodal-capable deployment (e.g. a
gpt-4.1-minifamily model) is deployed under that endpoint.
3. Entra ID authentication (Service Principal + Key Vault, accessed via the Fabric Workspace identity)¶
Fabric Notebooks do not support DefaultAzureCredential directly, so the Service Principal's clientSecret is retrieved from Key Vault and a ClientSecretCredential is constructed. openaivec's _provider.py performs this automatically when KEY_VAULT_URL and KEY_VAULT_SECRET_NAME are set.
Important — Key Vault access uses the Fabric Workspace identity: under the hood, notebookutils.credentials.getSecret(...) calls Key Vault as the Fabric workspace (identified by its Workspace ID), not as the Service Principal you are about to use against Azure OpenAI, and not as the interactive user. You therefore need to grant Key Vault read permission to the workspace itself.
| Item | Details |
|---|---|
| Service Principal (SP) | Create an App Registration; note its tenantId and clientId. |
| SP role on Azure OpenAI / Foundry | Grant the SP the Cognitive Services OpenAI User role on the AI resource. |
| Key Vault — store secret | Store the SP's clientSecret as a Key Vault secret. |
| Key Vault — RBAC (workspace) | Grant Key Vault Secrets User to the Fabric workspace identity (object id = Fabric Workspace ID). This is the principal that notebookutils.credentials.getSecret impersonates. |
| Key Vault access model | Use Azure RBAC (not legacy access policies) on the Key Vault so the workspace assignment is honored. |
How to find the Fabric Workspace ID: open the workspace in the Fabric portal, click Workspace settings → About, and copy the Workspace ID (GUID). Assign roles against this object id in Key Vault's Access control (IAM).
Security: all Tenant ID / Client ID / Key Vault URI / endpoint values in the cells below are placeholders. Replace them with your own values and never commit real secrets to a repository.
1. Install dependencies¶
Uninstall the older openai shipped with the Fabric runtime, then install openaivec (which pulls in the latest openai) and azure-identity.
%pip uninstall -y openai
%pip install -U openaivec
%pip install -U azure-identity
Found existing installation: openai 1.109.1 Uninstalling openai-1.109.1: Successfully uninstalled openai-1.109.1 Note: you may need to restart the kernel to use updated packages. Collecting openaivec Downloading openaivec-2.3.1-py3-none-any.whl.metadata (21 kB) Requirement already satisfied: aiohttp>=3.9.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openaivec) (3.12.15) Requirement already satisfied: azure-identity>=1.25.1 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openaivec) (1.25.1) Requirement already satisfied: duckdb>=1.0.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openaivec) (1.4.4) Requirement already satisfied: ipywidgets>=8.1.7 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openaivec) (8.1.7) Collecting openai>=2.0.0 (from openaivec) Downloading openai-2.38.0-py3-none-any.whl.metadata (31 kB) Collecting pandas>=3.0.0 (from openaivec) Downloading pandas-3.0.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (79 kB) Requirement already satisfied: pyarrow>=19.0.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openaivec) (22.0.0) Collecting tiktoken>=0.9.0 (from openaivec) Downloading tiktoken-0.13.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.7 kB) Requirement already satisfied: tqdm>=4.67.1 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openaivec) (4.67.1) Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from aiohttp>=3.9.0->openaivec) (2.6.1) Requirement already satisfied: aiosignal>=1.4.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from aiohttp>=3.9.0->openaivec) (1.4.0) Requirement already satisfied: attrs>=17.3.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from aiohttp>=3.9.0->openaivec) (25.4.0) Requirement already satisfied: frozenlist>=1.1.1 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from aiohttp>=3.9.0->openaivec) (1.8.0) Requirement already satisfied: multidict<7.0,>=4.5 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from aiohttp>=3.9.0->openaivec) (6.7.1) Requirement already satisfied: propcache>=0.2.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from aiohttp>=3.9.0->openaivec) (0.4.1) Requirement already satisfied: yarl<2.0,>=1.17.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from aiohttp>=3.9.0->openaivec) (1.22.0) Requirement already satisfied: idna>=2.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from yarl<2.0,>=1.17.0->aiohttp>=3.9.0->openaivec) (3.11) Requirement already satisfied: typing-extensions>=4.2 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from aiosignal>=1.4.0->aiohttp>=3.9.0->openaivec) (4.15.0) Requirement already satisfied: azure-core>=1.31.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from azure-identity>=1.25.1->openaivec) (1.36.0) Requirement already satisfied: cryptography>=2.5 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from azure-identity>=1.25.1->openaivec) (46.0.4) Requirement already satisfied: msal>=1.30.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from azure-identity>=1.25.1->openaivec) (1.34.0) Requirement already satisfied: msal-extensions>=1.2.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from azure-identity>=1.25.1->openaivec) (1.3.1) Requirement already satisfied: requests>=2.21.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from azure-core>=1.31.0->azure-identity>=1.25.1->openaivec) (2.32.5) Requirement already satisfied: cffi>=2.0.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from cryptography>=2.5->azure-identity>=1.25.1->openaivec) (2.0.0) Requirement already satisfied: pycparser in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from cffi>=2.0.0->cryptography>=2.5->azure-identity>=1.25.1->openaivec) (2.23) Requirement already satisfied: comm>=0.1.3 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipywidgets>=8.1.7->openaivec) (0.2.3) Requirement already satisfied: ipython>=6.1.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipywidgets>=8.1.7->openaivec) (9.7.0) Requirement already satisfied: traitlets>=4.3.1 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipywidgets>=8.1.7->openaivec) (5.14.3) Requirement already satisfied: widgetsnbextension~=4.0.14 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipywidgets>=8.1.7->openaivec) (4.0.14) Requirement already satisfied: jupyterlab_widgets~=3.0.15 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipywidgets>=8.1.7->openaivec) (3.0.16) Requirement already satisfied: decorator>=4.3.2 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (5.2.1) Requirement already satisfied: ipython-pygments-lexers>=1.0.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (1.1.1) Requirement already satisfied: jedi>=0.18.1 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (0.19.2) Requirement already satisfied: matplotlib-inline>=0.1.5 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (0.2.1) Requirement already satisfied: pexpect>4.3 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (4.9.0) Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (3.0.52) Requirement already satisfied: pygments>=2.11.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (2.19.2) Requirement already satisfied: stack_data>=0.6.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (0.6.3) Requirement already satisfied: wcwidth in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (0.2.14) Requirement already satisfied: parso<0.9.0,>=0.8.4 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from jedi>=0.18.1->ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (0.8.5) Requirement already satisfied: PyJWT<3,>=1.0.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from PyJWT[crypto]<3,>=1.0.0->msal>=1.30.0->azure-identity>=1.25.1->openaivec) (2.10.1) Requirement already satisfied: charset_normalizer<4,>=2 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from requests>=2.21.0->azure-core>=1.31.0->azure-identity>=1.25.1->openaivec) (3.4.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from requests>=2.21.0->azure-core>=1.31.0->azure-identity>=1.25.1->openaivec) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from requests>=2.21.0->azure-core>=1.31.0->azure-identity>=1.25.1->openaivec) (2025.10.5) Requirement already satisfied: anyio<5,>=3.5.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openai>=2.0.0->openaivec) (4.10.0) Requirement already satisfied: distro<2,>=1.7.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openai>=2.0.0->openaivec) (1.9.0) Requirement already satisfied: httpx<1,>=0.23.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openai>=2.0.0->openaivec) (0.28.1) Requirement already satisfied: jiter<1,>=0.10.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openai>=2.0.0->openaivec) (0.12.0) Requirement already satisfied: pydantic<3,>=1.9.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openai>=2.0.0->openaivec) (2.12.4) Requirement already satisfied: sniffio in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from openai>=2.0.0->openaivec) (1.3.1) Requirement already satisfied: httpcore==1.* in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from httpx<1,>=0.23.0->openai>=2.0.0->openaivec) (1.0.9) Requirement already satisfied: h11>=0.16 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai>=2.0.0->openaivec) (0.16.0) Requirement already satisfied: annotated-types>=0.6.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from pydantic<3,>=1.9.0->openai>=2.0.0->openaivec) (0.6.0) Requirement already satisfied: pydantic-core==2.41.5 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from pydantic<3,>=1.9.0->openai>=2.0.0->openaivec) (2.41.5) Requirement already satisfied: typing-inspection>=0.4.2 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from pydantic<3,>=1.9.0->openai>=2.0.0->openaivec) (0.4.2) Requirement already satisfied: numpy>=1.26.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from pandas>=3.0.0->openaivec) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from pandas>=3.0.0->openaivec) (2.9.0.post0) Requirement already satisfied: ptyprocess>=0.5 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from pexpect>4.3->ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (0.7.0) Requirement already satisfied: six>=1.5 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas>=3.0.0->openaivec) (1.17.0) Requirement already satisfied: executing>=1.2.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from stack_data>=0.6.0->ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (2.2.1) Requirement already satisfied: asttokens>=2.1.0 in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from stack_data>=0.6.0->ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (3.0.0) Requirement already satisfied: pure_eval in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from stack_data>=0.6.0->ipython>=6.1.0->ipywidgets>=8.1.7->openaivec) (0.2.3) Requirement already satisfied: regex in /home/trusted-service-user/jupyter-env/python3.12/lib/python3.12/site-packages (from tiktoken>=0.9.0->openaivec) (2025.9.1) Downloading openaivec-2.3.1-py3-none-any.whl (124 kB) Downloading openai-2.38.0-py3-none-any.whl (1.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 38.2 MB/s 0:00:00 Downloading pandas-3.0.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (10.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.9/10.9 MB 112.3 MB/s 0:00:00 Downloading tiktoken-0.13.0-cp312-cp312-manylinux_2_28_x86_64.whl (1.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 30.4 MB/s 0:00:00 Installing collected packages: tiktoken, pandas, openai, openaivec Attempting uninstall: pandas Found existing installation: pandas 2.3.3 Uninstalling pandas-2.3.3:0m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 [pandas] ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 [pandas]
2. Configure environment variables (placeholders)¶
openaivec evaluates the following environment variables to decide how to authenticate:
| Variable | Purpose |
|---|---|
AZURE_TENANT_ID |
Tenant ID of the Service Principal used against Azure OpenAI. |
AZURE_CLIENT_ID |
Client ID of that Service Principal. |
KEY_VAULT_URL |
URL of the Key Vault that stores the SP's client secret. |
KEY_VAULT_SECRET_NAME |
Name of the secret inside that Key Vault. |
AZURE_OPENAI_BASE_URL |
Azure OpenAI / Foundry base URL in v1 format (must end with /openai/v1/). |
Inside Fabric, openaivec detects notebookutils, calls notebookutils.credentials.getSecret(KEY_VAULT_URL, KEY_VAULT_SECRET_NAME) as the Fabric workspace identity, and then builds a ClientSecretCredential (or its async equivalent from azure.identity.aio) from the resulting secret. Make sure the Fabric workspace itself has been granted Key Vault Secrets User on the vault — see the prerequisites above.
Replace every
<...>placeholder with your own value before running the next cell.
import os
# --- Service Principal used against Azure OpenAI / Foundry ---
os.environ["AZURE_TENANT_ID"] = "<YOUR_TENANT_ID>"
os.environ["AZURE_CLIENT_ID"] = "<YOUR_SP_CLIENT_ID>"
# --- Key Vault that stores the SP client secret.
# The Fabric workspace identity (Workspace ID) must have
# `Key Vault Secrets User` on this vault. ---
os.environ["KEY_VAULT_URL"] = "https://<your-keyvault>.vault.azure.net/"
os.environ["KEY_VAULT_SECRET_NAME"] = "<your-secret-name>"
# --- Azure OpenAI / Foundry endpoint (v1 format) ---
os.environ["AZURE_OPENAI_BASE_URL"] = (
"https://<your-resource>.services.ai.azure.com/openai/v1/"
)
3. Import libraries and initialize openaivec¶
All symbols used by the notebook are imported here in one place. Importing pandas_ext registers the .ai / .aio accessors on pandas Series and DataFrame. The final line runs a tiny smoke test (English → French translation) to confirm that authentication, model selection, and network reachability all work end-to-end.
Replace the model name with one of your multimodal-capable deployments — it is required because the pipeline feeds PDF bytes directly to the model.
# Standard library
from enum import Enum
# Third-party
import duckdb
import pandas as pd
from deltalake.writer import write_deltalake
from pydantic import BaseModel
# openaivec
import openaivec
from openaivec import pandas_ext # noqa: F401 registers the .ai / .aio accessor
from openaivec.duckdb_ext import responses_udf
# Name of a deployed multimodal-capable model
openaivec.set_responses_model("<YOUR_MULTIMODAL_DEPLOYMENT>")
# Smoke test: if this returns a French translation,
# auth + model + network are all working.
pd.Series(["apple", "banana"]).ai.responses("Translate to French.")
0 pomme 1 banane dtype: str
4. Open an in-memory DuckDB connection¶
DuckDB is a single-process OLAP engine. It runs entirely inside the Fabric Notebook's Python kernel and can read OneLake files directly via the POSIX path that Fabric mounts under /lakehouse/default/Files/....
We create one connection that will host the UDF registration and all subsequent SQL.
conn = duckdb.connect()
5. Define the extraction schema (Pydantic)¶
We declare what to extract from each PDF as a Pydantic model. responses_udf forwards this schema to Azure OpenAI as a Structured Output (response_format), which gives us:
- the LLM can only return JSON that conforms to the schema,
- values are automatically validated and converted to Python types,
Enumfields constrain allowed values to a fixed set.
For this sample we model an inventory transfer slip as InventoryTransferSlip, along with its line items (TransferItem) and two enums.
class InventoryCategory(str, Enum):
"""Inventory item category."""
RAW_MATERIAL = "原材料"
PARTS = "部品"
FINISHED_GOODS = "製品"
PACKAGING = "梱包資材"
CONSUMABLES = "消耗品"
class TransferStatus(str, Enum):
"""Slip status."""
PENDING = "申請中"
APPROVED = "承認済"
IN_TRANSIT = "移動中"
COMPLETED = "完了"
PARTIAL = "一部完了"
class TransferItem(BaseModel):
"""A single line item on a transfer slip."""
sku: str
name: str
category: InventoryCategory
unit: str
quantity: int
unit_price: float
amount: float
class InventoryTransferSlip(BaseModel):
"""Structured representation of an inventory transfer slip."""
slip_number: str
date: str
source_warehouse_code: str
source_warehouse_name: str
destination_warehouse_code: str
destination_warehouse_name: str
department: str
requester: str
approver: str
reason: str
status: TransferStatus
items: list[TransferItem]
total_quantity: int
total_amount: float
6. Register parse_transfer_slip as a DuckDB UDF¶
openaivec.duckdb_ext.responses_udf registers a scalar function on a DuckDB connection that calls the Azure OpenAI Responses API. Key parameters:
instructions: system prompt describing what to extract and how to format it.response_format: the Pydantic model from the previous cell — LLM output is guaranteed to match it.batch_size=1: one PDF per request (multimodal inputs are most reliable when sent individually).max_concurrency=36: number of parallel workers; tune against your Azure OpenAI rate limit.multimodal=True: required flag telling the UDF to treat the input column as files (PDF / images).
After registration you can simply call SELECT parse_transfer_slip(file) FROM glob(...) — it behaves like any other SQL function.
responses_udf(
conn,
"parse_transfer_slip",
instructions=(
"You are a parser for inventory transfer slips. "
"Extract all information from the slip and return it in the structured format. "
"Numbers must not contain currency symbols or thousand separators. "
"Dates must be formatted as YYYY-MM-DD. "
"Split warehouse codes (e.g. 'WH-TK01') and warehouse names into separate fields. "
"Classify each line item into the appropriate inventory category."
),
response_format=InventoryTransferSlip,
batch_size=1,
max_concurrency=36,
multimodal=True,
)
7. List the input PDF files¶
The glob() table function enumerates PDFs under Files/transfer/ of the attached Lakehouse. OneLake is mounted on the Fabric kernel's local filesystem, so the POSIX path /lakehouse/default/Files/transfer/*.pdf works directly.
conn.sql("""
select
*
from glob('/lakehouse/default/Files/transfer/*.pdf')
""")
┌────────────────────────────────────────────────────┐ │ file │ │ varchar │ ├────────────────────────────────────────────────────┤ │ /lakehouse/default/Files/transfer/transfer_001.pdf │ │ /lakehouse/default/Files/transfer/transfer_002.pdf │ │ /lakehouse/default/Files/transfer/transfer_003.pdf │ │ /lakehouse/default/Files/transfer/transfer_004.pdf │ │ /lakehouse/default/Files/transfer/transfer_005.pdf │ │ /lakehouse/default/Files/transfer/transfer_006.pdf │ │ /lakehouse/default/Files/transfer/transfer_007.pdf │ │ /lakehouse/default/Files/transfer/transfer_008.pdf │ │ /lakehouse/default/Files/transfer/transfer_009.pdf │ │ /lakehouse/default/Files/transfer/transfer_010.pdf │ │ · │ │ · │ │ · │ │ /lakehouse/default/Files/transfer/transfer_041.pdf │ │ /lakehouse/default/Files/transfer/transfer_042.pdf │ │ /lakehouse/default/Files/transfer/transfer_043.pdf │ │ /lakehouse/default/Files/transfer/transfer_044.pdf │ │ /lakehouse/default/Files/transfer/transfer_045.pdf │ │ /lakehouse/default/Files/transfer/transfer_046.pdf │ │ /lakehouse/default/Files/transfer/transfer_047.pdf │ │ /lakehouse/default/Files/transfer/transfer_048.pdf │ │ /lakehouse/default/Files/transfer/transfer_049.pdf │ │ /lakehouse/default/Files/transfer/transfer_050.pdf │ ├────────────────────────────────────────────────────┤ │ 50 rows (20 shown) │ └────────────────────────────────────────────────────┘
8. Parse the PDFs and flatten line items¶
This is the core of the notebook. The SQL is structured as three CTEs:
parsed: callparse_transfer_slip(file)on each PDF, producing oneInventoryTransferSlipSTRUCT column.items:unnest()theitemslist (list[TransferItem]) so each line item becomes its own row.- Final SELECT: join header columns with item columns on
slip_numberto build a flat one-row-per-item table.
The result is materialized as a pandas DataFrame via .to_df() and handed off to the Delta writer. responses_udf handles deduplication, auto-batching, and retries underneath — your code only needs to express the SQL.
pandas_dataframe = conn.sql("""
with parsed as (
select
parse_transfer_slip(file) as r
from glob('/lakehouse/default/Files/transfer/*.pdf')
), items as (
select
r.slip_number,
unnest(r.items) as item
from parsed
)
select
r.slip_number,
r.date,
r.source_warehouse_code,
r.source_warehouse_name,
r.destination_warehouse_code,
r.destination_warehouse_name,
r.department,
r.requester,
r.approver,
r.reason,
r.status,
items.item.sku as item_sku,
items.item.name as item_name,
items.item.category as item_category,
items.item.unit as item_unit,
items.item.quantity as item_quantity,
items.item.unit_price as item_unit_price,
items.item.amount as item_amount
from parsed
left join items
on parsed.r.slip_number = items.slip_number
""").to_df()
pandas_dataframe
| slip_number | date | source_warehouse_code | source_warehouse_name | destination_warehouse_code | destination_warehouse_name | department | requester | approver | reason | status | item_sku | item_name | item_category | item_unit | item_quantity | item_unit_price | item_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | TR-202505-5966 | 2025-05-26 | WH-CB01 | 千葉物流センター | WH-FK01 | 福岡倉庫 | 購買部 | 伊藤 直子 | 佐藤 花子 | 拠点統合に伴う移管 | 承認済 | PT-2006 | スプリング SWP φ10 | 部品 | 個 | 120 | 89.0 | 10680.0 |
| 1 | TR-202412-3100 | 2025-07-15 | WH-TK02 | 東京第二倉庫 | WH-OS01 | 大阪中央倉庫 | 購買部 | 伊藤 直子 | 山田 真理 | 拠点統合に伴う移管 | 申請中 | FG-3002 | センサーモジュール SM-120 | 製品 | 個 | 173 | 15000.0 | 2595000.0 |
| 2 | TR-202509-0094 | 2026-01-17 | WH-TK01 | 東京第一倉庫 | WH-NG01 | 名古屋倉庫 | 物流管理部 | 渡辺 健太 | 山本 大輔 | 拠点統合に伴う移管 | 承認済 | PK-4001 | 段ボール箱 60サイズ | 梱包資材 | 箱 | 81 | 85.0 | 6885.0 |
| 3 | TR-202604-9017 | 2026-01-21 | WH-CB01 | 千葉物流センター | WH-FK01 | 福岡倉庫 | 営業部 | 田中 太郎 | 小林 誠 | 定期補充 | 完了 | FG-3002 | センサーモジュール SM-120 | 製品 | 個 | 160 | 15000.0 | 2400000.0 |
| 4 | TR-202410-7873 | 2026-02-25 | WH-NG01 | 名古屋倉庫 | WH-FK01 | 福岡倉庫 | 品質管理部 | 田中 太郎 | 加藤 恵 | 棚卸差異是正 | 完了 | CS-5001 | 作業手袋 Lサイズ | 消耗品 | 双 | 125 | 180.0 | 22500.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 225 | TR-202611-6250 | 2025-11-03 | WH-NG01 | 名古屋倉庫 | WH-YK01 | 横浜倉庫 | 出荷管理課 | 伊藤 直子 | 吉田 拓也 | 棚卸差異是正 | 移動中 | FG-3004 | 表示パネル DP-7inch | 製品 | 個 | 10 | 9500.0 | 95000.0 |
| 226 | TR-202510-2948 | 2025-04-21 | WH-YK01 | 横浜倉庫 | WH-ST01 | 埼玉配送センター | 製造部 | 佐藤 花子 | 小林 誠 | 拠点統合に伴う移管 | 一部完了 | CS-5004 | 潤滑油 タービン VG32 | 消耗品 | L | 173 | 650.0 | 112450.0 |
| 227 | TR-202404-6557 | 2026-01-19 | WH-TK02 | 東京第二倉庫 | WH-NG01 | 名古屋倉庫 | 製造部 | 鈴木 一郎 | 中村 由美 | 返品処理 | 完了 | FG-3002 | センサーモジュール SM-120 | 製品 | 個 | 140 | 15000.0 | 2100000.0 |
| 228 | TR-202609-3895 | 2025-06-02 | WH-TK01 | 東京第一倉庫 | WH-YK01 | 横浜倉庫 | 製造部 | 高橋 美咲 | 中村 由美 | 生産計画に基づく移動 | 完了 | RM-1002 | ステンレス鋼 SUS304 丸棒 | 原材料 | 本 | 90 | 3500.0 | 315000.0 |
| 229 | TR-202612-9948 | 2026-03-30 | WH-NG01 | 名古屋倉庫 | WH-TK01 | 東京第一倉庫 | 出荷管理課 | 中村 由美 | 山田 真理 | 定期補充 | 申請中 | RM-1004 | ポリカーボネート樹脂ペレット | 原材料 | kg | 160 | 450.0 | 72000.0 |
230 rows × 18 columns
9. Write to Delta Lake on OneLake¶
Save the extracted DataFrame to the attached Lakehouse under Tables/transfers as a Delta table. Substitute your own workspace and Lakehouse names into the abfss:// URI. Once written, the table appears in the Lakehouse Explorer and is queryable from the SQL endpoint, Power BI, and other notebooks.
write_deltalake(
"abfss://<your-workspace>@onelake.dfs.fabric.microsoft.com/<your-lakehouse>.Lakehouse/Tables/transfers",
pandas_dataframe,
mode="overwrite",
)