Pandas Extension¶
openaivec.pandas_ext ¶
Pandas Series / DataFrame extension for OpenAI.
Setup¶
from openai import OpenAI, AzureOpenAI, AsyncOpenAI, AsyncAzureOpenAI
from openaivec import pandas_ext
# Option 1: Use environment variables (automatic detection)
# Set OPENAI_API_KEY or Azure OpenAI environment variables
# (AZURE_OPENAI_API_KEY, AZURE_OPENAI_BASE_URL, AZURE_OPENAI_API_VERSION)
# No explicit setup needed - clients are automatically created
# Option 2: Use an existing OpenAI client instance
client = OpenAI(api_key="your-api-key")
pandas_ext.use(client)
# Option 3: Use an existing Azure OpenAI client instance
azure_client = AzureOpenAI(
api_key="your-azure-key",
base_url="https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/",
api_version="preview"
)
pandas_ext.use(azure_client)
# Option 4: Use async Azure OpenAI client instance
async_azure_client = AsyncAzureOpenAI(
api_key="your-azure-key",
base_url="https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/",
api_version="preview"
)
pandas_ext.use_async(async_azure_client)
# Set up model names (optional, defaults shown)
pandas_ext.responses_model("gpt-4.1-mini")
pandas_ext.embeddings_model("text-embedding-3-small")
This module provides .ai
and .aio
accessors for pandas Series and DataFrames
to easily interact with OpenAI APIs for tasks like generating responses or embeddings.
Classes¶
OpenAIVecSeriesAccessor ¶
pandas Series accessor (.ai
) that adds OpenAI helpers.
Source code in src/openaivec/pandas_ext.py
Functions¶
responses_with_cache ¶
responses_with_cache(
instructions: str,
cache: BatchingMapProxy[str, ResponseFormat],
response_format: type[ResponseFormat] = str,
**api_kwargs,
) -> pd.Series
Call an LLM once for every Series element using a provided cache.
This is a lower-level method that allows explicit cache management for advanced
use cases. Most users should use the standard responses
method instead.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
cache
|
BatchingMapProxy[str, ResponseFormat]
|
Explicit cache instance for batching and deduplication control. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built-in
type the assistant should return. Defaults to |
str
|
**api_kwargs
|
Arbitrary OpenAI Responses API parameters (e.g. |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are instances of |
Source code in src/openaivec/pandas_ext.py
responses ¶
responses(
instructions: str,
response_format: type[ResponseFormat] = str,
batch_size: int | None = None,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Call an LLM once for every Series element.
Example
animals = pd.Series(["cat", "dog", "elephant"])
# Basic usage
animals.ai.responses("translate to French")
# With progress bar in Jupyter notebooks
large_series = pd.Series(["data"] * 1000)
large_series.ai.responses(
"analyze this data",
batch_size=32,
show_progress=True
)
# With custom temperature
animals.ai.responses(
"translate creatively",
temperature=0.8
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
batch_size
|
int | None
|
Number of prompts grouped into a single
request. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
**api_kwargs
|
Additional OpenAI API parameters (temperature, top_p, etc.). |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are instances of |
Source code in src/openaivec/pandas_ext.py
embeddings_with_cache ¶
Compute OpenAI embeddings for every Series element using a provided cache.
This method allows external control over caching behavior by accepting a pre-configured BatchingMapProxy instance, enabling cache sharing across multiple operations or custom batch size management.
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cache
|
BatchingMapProxy[str, ndarray]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
**api_kwargs
|
Additional keyword arguments to pass to the OpenAI API. |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are |
Source code in src/openaivec/pandas_ext.py
embeddings ¶
embeddings(
batch_size: int | None = None,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Compute OpenAI embeddings for every Series element.
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
int | None
|
Number of inputs grouped into a
single request. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
**api_kwargs
|
Additional OpenAI API parameters (e.g., dimensions for text-embedding-3 models). |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are |
Source code in src/openaivec/pandas_ext.py
task_with_cache ¶
task_with_cache(
task: PreparedTask[ResponseFormat],
cache: BatchingMapProxy[str, ResponseFormat],
) -> pd.Series
Execute a prepared task on every Series element using a provided cache.
This mirrors responses_with_cache
but uses the task's stored instructions,
response format, and API parameters. A supplied BatchingMapProxy
enables
cross‑operation deduplicated reuse and external batch size / progress control.
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task
|
PreparedTask
|
Prepared task (instructions + response_format + sampling params). |
required |
cache
|
BatchingMapProxy[str, ResponseFormat]
|
Pre‑configured cache instance. |
required |
Note
The task's stored API parameters are used. Core routing keys (model
, system
instructions, user input) are managed internally and cannot be overridden.
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Task results aligned with the original Series index. |
Source code in src/openaivec/pandas_ext.py
task ¶
task(
task: PreparedTask,
batch_size: int | None = None,
show_progress: bool = False,
) -> pd.Series
Execute a prepared task on every Series element.
Example
from openaivec._model import PreparedTask
# Assume you have a prepared task for sentiment analysis
sentiment_task = PreparedTask(...)
reviews = pd.Series(["Great product!", "Not satisfied", "Amazing quality"])
# Basic usage
results = reviews.ai.task(sentiment_task)
# With progress bar for large datasets
large_reviews = pd.Series(["review text"] * 2000)
results = large_reviews.ai.task(
sentiment_task,
batch_size=50,
show_progress=True
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format, and other parameters for processing the inputs. |
required |
batch_size
|
int | None
|
Number of prompts grouped into a single
request to optimize API usage. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
Note
The task's stored API parameters are used. Core batching / routing keys
(model
, instructions
/ system message, user input
) are managed by the
library and cannot be overridden.
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format. |
Source code in src/openaivec/pandas_ext.py
parse_with_cache ¶
parse_with_cache(
instructions: str,
cache: BatchingMapProxy[str, ResponseFormat],
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
**api_kwargs,
) -> pd.Series
Parse Series values using an LLM with a provided cache.
This method allows external control over caching behavior while parsing Series content into structured data. If no response format is provided, the method automatically infers an appropriate schema by analyzing the data patterns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language description of what information to extract (e.g., "Extract customer information including name and contact details"). This guides both the extraction process and schema inference. |
required |
cache
|
BatchingMapProxy[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for the parsed data. Can be a Pydantic model class, built-in type (str, int, float, bool, list, dict), or None. If None, the method infers an appropriate schema based on the instructions and data. Defaults to None. |
None
|
max_examples
|
int
|
Maximum number of Series values to analyze when inferring the schema. Only used when response_format is None. Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters (temperature, top_p, frequency_penalty, presence_penalty, seed, etc.) forwarded to the underlying API calls. |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series containing parsed structured data. Each value is an instance of the specified response_format or the inferred schema model, aligned with the original Series index. |
Source code in src/openaivec/pandas_ext.py
parse ¶
parse(
instructions: str,
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
batch_size: int | None = None,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Parse Series values into structured data using an LLM.
This method extracts structured information from unstructured text in the Series. When no response format is provided, it automatically infers an appropriate schema by analyzing patterns in the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language description of what information to extract (e.g., "Extract product details including price, category, and availability"). This guides both the extraction process and schema inference. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for the parsed data. Can be a Pydantic model class, built-in type (str, int, float, bool, list, dict), or None. If None, automatically infers a schema. Defaults to None. |
None
|
max_examples
|
int
|
Maximum number of Series values to analyze when inferring schema. Only used when response_format is None. Defaults to 100. |
100
|
batch_size
|
int | None
|
Number of requests to process per batch. None enables automatic optimization. Defaults to None. |
None
|
show_progress
|
bool
|
Display progress bar in Jupyter notebooks. Defaults to False. |
False
|
**api_kwargs
|
Additional OpenAI API parameters (temperature, top_p, frequency_penalty, presence_penalty, seed, etc.). |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series containing parsed structured data as instances of response_format or the inferred schema model. |
Example
# With explicit schema
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
in_stock: bool
descriptions = pd.Series([
"iPhone 15 Pro - $999, available now",
"Samsung Galaxy S24 - $899, out of stock"
])
products = descriptions.ai.parse(
"Extract product information",
response_format=Product
)
# With automatic schema inference
reviews = pd.Series([
"Great product! 5 stars. Fast shipping.",
"Poor quality. 2 stars. Slow delivery."
])
parsed = reviews.ai.parse(
"Extract review rating and shipping feedback"
)
Source code in src/openaivec/pandas_ext.py
infer_schema ¶
Infer a structured data schema from Series content using AI.
This method analyzes a sample of Series values to automatically generate a Pydantic model that captures the relevant information structure. The inferred schema supports both flat and hierarchical (nested) structures, making it suitable for complex data extraction tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language description of the extraction goal (e.g., "Extract customer information for CRM system", "Parse event details for calendar integration"). This guides which fields to include and their purpose. |
required |
max_examples
|
int
|
Maximum number of Series values to analyze for pattern detection. The method samples randomly up to this limit. Higher values may improve schema quality but increase inference time. Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters for fine-tuning the inference process. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
InferredSchema |
InferredSchema
|
A comprehensive schema object containing: - instructions: Refined extraction objective statement - fields: Hierarchical field specifications with names, types, descriptions, and nested structures where applicable - inference_prompt: Optimized prompt for consistent extraction - model: Dynamically generated Pydantic model class supporting both flat and nested structures - task: PreparedTask configured for batch extraction using the inferred schema |
Example
# Simple flat structure
reviews = pd.Series([
"5 stars! Great product, fast shipping to NYC.",
"2 stars. Product broke, slow delivery to LA."
])
schema = reviews.ai.infer_schema(
"Extract review ratings and shipping information"
)
# Hierarchical structure
orders = pd.Series([
"Order #123: John Doe, 123 Main St, NYC. Items: iPhone ($999), Case ($29)",
"Order #456: Jane Smith, 456 Oak Ave, LA. Items: iPad ($799)"
])
schema = orders.ai.infer_schema(
"Extract order details including customer and items"
)
# Inferred schema may include nested structures like:
# - customer: {name: str, address: str, city: str}
# - items: [{product: str, price: float}]
# Apply the schema for extraction
extracted = orders.ai.task(schema.task)
Note
The inference process uses multiple AI iterations to ensure schema validity. Nested structures are automatically detected when the data contains hierarchical relationships. The generated Pydantic model ensures type safety and validation for all extracted data.
Source code in src/openaivec/pandas_ext.py
count_tokens ¶
Count tiktoken
tokens per row.
Example
This method uses thetiktoken
library to count tokens based on the
model name set by responses_model
.
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Token counts for each element. |
Source code in src/openaivec/pandas_ext.py
extract ¶
Expand a Series of Pydantic models/dicts into columns.
Example
This method returns a DataFrame with the same index as the Series, where each column corresponds to a key in the dictionaries. If the Series has a name, extracted columns are prefixed with it.Returns:
Type | Description |
---|---|
DataFrame
|
pandas.DataFrame: Expanded representation. |
Source code in src/openaivec/pandas_ext.py
OpenAIVecDataFrameAccessor ¶
pandas DataFrame accessor (.ai
) that adds OpenAI helpers.
Source code in src/openaivec/pandas_ext.py
Functions¶
responses_with_cache ¶
responses_with_cache(
instructions: str,
cache: BatchingMapProxy[str, ResponseFormat],
response_format: type[ResponseFormat] = str,
**api_kwargs,
) -> pd.Series
Generate a response for each row after serializing it to JSON using a provided cache.
This method allows external control over caching behavior by accepting a pre-configured BatchingMapProxy instance, enabling cache sharing across multiple operations or custom batch size management.
Example
from openaivec._proxy import BatchingMapProxy
# Create a shared cache with custom batch size
shared_cache = BatchingMapProxy(batch_size=64)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
result = df.ai.responses_with_cache(
"what is the animal's name?",
cache=shared_cache
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
System prompt for the assistant. |
required |
cache
|
BatchingMapProxy[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat]
|
Desired Python type of the
responses. Defaults to |
str
|
**api_kwargs
|
Additional OpenAI API parameters (temperature, top_p, etc.). |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Responses aligned with the DataFrame's original index. |
Source code in src/openaivec/pandas_ext.py
responses ¶
responses(
instructions: str,
response_format: type[ResponseFormat] = str,
batch_size: int | None = None,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Generate a response for each row after serializing it to JSON.
Example
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Basic usage
df.ai.responses("what is the animal's name?")
# With progress bar for large datasets
large_df = pd.DataFrame({"id": list(range(1000))})
large_df.ai.responses(
"generate a name for this ID",
batch_size=20,
show_progress=True
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
System prompt for the assistant. |
required |
response_format
|
type[ResponseFormat]
|
Desired Python type of the
responses. Defaults to |
str
|
batch_size
|
int | None
|
Number of requests sent in one batch.
Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
**api_kwargs
|
Additional OpenAI API parameters (temperature, top_p, etc.). |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Responses aligned with the DataFrame's original index. |
Source code in src/openaivec/pandas_ext.py
task_with_cache ¶
task_with_cache(
task: PreparedTask[ResponseFormat],
cache: BatchingMapProxy[str, ResponseFormat],
) -> pd.Series
Execute a prepared task on each DataFrame row after serializing it to JSON using a provided cache.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task
|
PreparedTask
|
Prepared task (instructions + response_format + sampling params). |
required |
cache
|
BatchingMapProxy[str, ResponseFormat]
|
Pre‑configured cache instance. |
required |
Note
The task's stored API parameters are used. Core routing keys are managed internally.
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Task results aligned with the DataFrame's original index. |
Source code in src/openaivec/pandas_ext.py
task ¶
task(
task: PreparedTask,
batch_size: int | None = None,
show_progress: bool = False,
) -> pd.Series
Execute a prepared task on each DataFrame row after serializing it to JSON.
Example
from openaivec._model import PreparedTask
# Assume you have a prepared task for data analysis
analysis_task = PreparedTask(...)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Basic usage
results = df.ai.task(analysis_task)
# With progress bar for large datasets
large_df = pd.DataFrame({"id": list(range(1000))})
results = large_df.ai.task(
analysis_task,
batch_size=50,
show_progress=True
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format, and other parameters for processing the inputs. |
required |
batch_size
|
int | None
|
Number of requests sent in one batch
to optimize API usage. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
Note
The task's stored API parameters are used. Core batching / routing
keys (model
, instructions
/ system message, user input
) are managed by the
library and cannot be overridden.
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the DataFrame's original index. |
Source code in src/openaivec/pandas_ext.py
parse_with_cache ¶
parse_with_cache(
instructions: str,
cache: BatchingMapProxy[str, ResponseFormat],
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
**api_kwargs,
) -> pd.Series
Parse DataFrame rows into structured data using an LLM with a provided cache.
This method processes each DataFrame row (converted to JSON) and extracts structured information using an LLM. External cache control enables deduplication across operations and custom batch management.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language description of what information to extract from each row (e.g., "Extract shipping details and order status"). Guides both extraction and schema inference. |
required |
cache
|
BatchingMapProxy[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None for automatic optimization. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for parsed data. Can be a Pydantic model, built-in type, or None for automatic schema inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum rows to analyze when inferring schema (only used when response_format is None). Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters (temperature, top_p, frequency_penalty, presence_penalty, seed, etc.). |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series containing parsed structured data as instances of response_format or the inferred schema model, indexed like the original DataFrame. |
Source code in src/openaivec/pandas_ext.py
parse ¶
parse(
instructions: str,
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
batch_size: int | None = None,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Parse DataFrame rows into structured data using an LLM.
Each row is converted to JSON and processed to extract structured information. When no response format is provided, the method automatically infers an appropriate schema from the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language description of extraction goals (e.g., "Extract transaction details including amount, date, and merchant"). Guides extraction and schema inference. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for parsed data. Can be a Pydantic model, built-in type, or None for automatic inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum rows to analyze for schema inference (when response_format is None). Defaults to 100. |
100
|
batch_size
|
int | None
|
Rows per API batch. None enables automatic optimization. Defaults to None. |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to False. |
False
|
**api_kwargs
|
Additional OpenAI API parameters. |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Parsed structured data indexed like the original DataFrame. |
Example
df = pd.DataFrame({
'log': [
'2024-01-01 10:00 ERROR Database connection failed',
'2024-01-01 10:05 INFO Service started successfully'
]
})
# With automatic schema inference
parsed = df.ai.parse("Extract timestamp, level, and message")
# Returns Series with inferred structure like:
# {timestamp: str, level: str, message: str}
Source code in src/openaivec/pandas_ext.py
infer_schema ¶
Infer a structured data schema from DataFrame rows using AI.
This method analyzes a sample of DataFrame rows to automatically infer a structured schema that can be used for consistent data extraction. Each row is converted to JSON format and analyzed to identify patterns, field types, and potential categorical values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language description of how the extracted structured data will be used (e.g., "Extract operational metrics for dashboard", "Parse customer attributes for segmentation"). This guides field relevance and helps exclude irrelevant information. |
required |
max_examples
|
int
|
Maximum number of rows to analyze from the DataFrame. The method will sample randomly up to this limit. Defaults to 100. |
100
|
Returns:
Name | Type | Description |
---|---|---|
InferredSchema |
InferredSchema
|
An object containing: - instructions: Normalized statement of the extraction objective - fields: List of field specifications with names, types, and descriptions - inference_prompt: Reusable prompt for future extractions - model: Dynamically generated Pydantic model for parsing - task: PreparedTask for batch extraction operations |
Example
df = pd.DataFrame({
'text': [
"Order #123: Shipped to NYC, arriving Tuesday",
"Order #456: Delayed due to weather, new ETA Friday",
"Order #789: Delivered to customer in LA"
],
'timestamp': ['2024-01-01', '2024-01-02', '2024-01-03']
})
# Infer schema for logistics tracking
schema = df.ai.infer_schema(
instructions="Extract shipping status and location data for logistics tracking"
)
# Apply the schema to extract structured data
extracted_df = df.ai.task(schema.task)
Note
Each row is converted to JSON before analysis. The inference process automatically detects hierarchical relationships and creates appropriate nested structures when present. The generated Pydantic model ensures type safety and validation.
Source code in src/openaivec/pandas_ext.py
extract ¶
Flatten one column of Pydantic models/dicts into top‑level columns.
Example
This method returns a DataFrame with the same index as the original, where each column corresponds to a key in the dictionaries. The source column is dropped.Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
Column to expand. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pandas.DataFrame: Original DataFrame with the extracted columns; the source column is dropped. |
Source code in src/openaivec/pandas_ext.py
fillna ¶
fillna(
target_column_name: str,
max_examples: int = 500,
batch_size: int | None = None,
show_progress: bool = False,
) -> pd.DataFrame
Fill missing values in a DataFrame column using AI-powered inference.
This method uses machine learning to intelligently fill missing (NaN) values in a specified column by analyzing patterns from non-missing rows in the DataFrame. It creates a prepared task that provides examples of similar rows to help the AI model predict appropriate values for the missing entries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_column_name
|
str
|
The name of the column containing missing values that need to be filled. |
required |
max_examples
|
int
|
The maximum number of example rows to use for context when predicting missing values. Higher values may improve accuracy but increase API costs and processing time. Defaults to 500. |
500
|
batch_size
|
int | None
|
Number of requests sent in one batch
to optimize API usage. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pandas.DataFrame: A new DataFrame with missing values filled in the target column. The original DataFrame is not modified. |
Example
df = pd.DataFrame({
'name': ['Alice', 'Bob', None, 'David'],
'age': [25, 30, 35, None],
'city': ['Tokyo', 'Osaka', 'Kyoto', 'Tokyo']
})
# Fill missing values in the 'name' column
filled_df = df.ai.fillna('name')
# With progress bar for large datasets
large_df = pd.DataFrame({'name': [None] * 1000, 'age': list(range(1000))})
filled_df = large_df.ai.fillna('name', batch_size=32, show_progress=True)
Note
If the target column has no missing values, the original DataFrame is returned unchanged.
Source code in src/openaivec/pandas_ext.py
1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 |
|
similarity ¶
Compute cosine similarity between two columns containing embedding vectors.
This method calculates the cosine similarity between vectors stored in two columns of the DataFrame. The vectors should be numpy arrays or array-like objects that support dot product operations.
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col1
|
str
|
Name of the first column containing embedding vectors. |
required |
col2
|
str
|
Name of the second column containing embedding vectors. |
required |
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series containing cosine similarity scores between corresponding vectors in col1 and col2, with values ranging from -1 to 1, where 1 indicates identical direction. |
Source code in src/openaivec/pandas_ext.py
AsyncOpenAIVecSeriesAccessor ¶
pandas Series accessor (.aio
) that adds OpenAI helpers.
Source code in src/openaivec/pandas_ext.py
Functions¶
responses_with_cache
async
¶
responses_with_cache(
instructions: str,
cache: AsyncBatchingMapProxy[str, ResponseFormat],
response_format: type[ResponseFormat] = str,
**api_kwargs,
) -> pd.Series
Call an LLM once for every Series element using a provided cache (asynchronously).
This method allows external control over caching behavior by accepting a pre-configured AsyncBatchingMapProxy instance, enabling cache sharing across multiple operations or custom batch size management. The concurrency is controlled by the cache instance itself.
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
cache
|
AsyncBatchingMapProxy[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
**api_kwargs
|
Additional keyword arguments forwarded verbatim to
|
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are instances of |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
responses
async
¶
responses(
instructions: str,
response_format: type[ResponseFormat] = str,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Call an LLM once for every Series element (asynchronously).
Example
animals = pd.Series(["cat", "dog", "elephant"])
# Must be awaited
results = await animals.aio.responses("translate to French")
# With progress bar for large datasets
large_series = pd.Series(["data"] * 1000)
results = await large_series.aio.responses(
"analyze this data",
batch_size=32,
max_concurrency=4,
show_progress=True
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
batch_size
|
int | None
|
Number of prompts grouped into a single
request. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent
requests. Defaults to |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
**api_kwargs
|
Additional keyword arguments forwarded verbatim to
|
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are instances of |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
embeddings_with_cache
async
¶
Compute OpenAI embeddings for every Series element using a provided cache (asynchronously).
This method allows external control over caching behavior by accepting a pre-configured AsyncBatchingMapProxy instance, enabling cache sharing across multiple operations or custom batch size management. The concurrency is controlled by the cache instance itself.
Example
from openaivec._proxy import AsyncBatchingMapProxy
import numpy as np
# Create a shared cache with custom batch size and concurrency
shared_cache = AsyncBatchingMapProxy[str, np.ndarray](
batch_size=64, max_concurrency=4
)
animals = pd.Series(["cat", "dog", "elephant"])
# Must be awaited
embeddings = await animals.aio.embeddings_with_cache(cache=shared_cache)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cache
|
AsyncBatchingMapProxy[str, ndarray]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
**api_kwargs
|
Additional OpenAI API parameters (e.g., dimensions for text-embedding-3 models). |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
embeddings
async
¶
embeddings(
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Compute OpenAI embeddings for every Series element (asynchronously).
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
int | None
|
Number of inputs grouped into a
single request. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent
requests. Defaults to |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
**api_kwargs
|
Additional OpenAI API parameters (e.g., dimensions for text-embedding-3 models). |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
task_with_cache
async
¶
task_with_cache(
task: PreparedTask[ResponseFormat],
cache: AsyncBatchingMapProxy[str, ResponseFormat],
) -> pd.Series
Execute a prepared task on every Series element using a provided cache (asynchronously).
This method allows external control over caching behavior by accepting a pre-configured AsyncBatchingMapProxy instance, enabling cache sharing across multiple operations or custom batch size management. The concurrency is controlled by the cache instance itself.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format, and other parameters for processing the inputs. |
required |
cache
|
AsyncBatchingMapProxy[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
Example
from openaivec._model import PreparedTask
from openaivec._proxy import AsyncBatchingMapProxy
# Create a shared cache with custom batch size and concurrency
shared_cache = AsyncBatchingMapProxy(batch_size=64, max_concurrency=4)
# Assume you have a prepared task for sentiment analysis
sentiment_task = PreparedTask(...)
reviews = pd.Series(["Great product!", "Not satisfied", "Amazing quality"])
# Must be awaited
results = await reviews.aio.task_with_cache(sentiment_task, cache=shared_cache)
Additional Keyword Args
Arbitrary OpenAI Responses API parameters (e.g. frequency_penalty
, presence_penalty
,
seed
, etc.) are forwarded verbatim to the underlying client. Core batching / routing
keys (model
, instructions
/ system message, user input
) are managed by the
library and cannot be overridden.
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the original Series index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
task
async
¶
task(
task: PreparedTask,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = False,
) -> pd.Series
Execute a prepared task on every Series element (asynchronously).
Example
from openaivec._model import PreparedTask
# Assume you have a prepared task for sentiment analysis
sentiment_task = PreparedTask(...)
reviews = pd.Series(["Great product!", "Not satisfied", "Amazing quality"])
# Must be awaited
results = await reviews.aio.task(sentiment_task)
# With progress bar for large datasets
large_reviews = pd.Series(["review text"] * 2000)
results = await large_reviews.aio.task(
sentiment_task,
batch_size=50,
max_concurrency=4,
show_progress=True
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format, and other parameters for processing the inputs. |
required |
batch_size
|
int | None
|
Number of prompts grouped into a single
request to optimize API usage. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
Note
The task's stored API parameters are used. Core batching / routing
keys (model
, instructions
/ system message, user input
) are managed by the
library and cannot be overridden.
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the original Series index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
parse_with_cache
async
¶
parse_with_cache(
instructions: str,
cache: AsyncBatchingMapProxy[str, ResponseFormat],
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
**api_kwargs,
) -> pd.Series
Parse Series values into structured data using an LLM with a provided cache (asynchronously).
This async method provides external cache control while parsing Series content into structured data. Automatic schema inference is performed when no response format is specified.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language description of what to extract (e.g., "Extract dates, amounts, and descriptions from receipts"). Guides both extraction and schema inference. |
required |
cache
|
AsyncBatchingMapProxy[str, ResponseFormat]
|
Pre-configured async cache for managing concurrent API calls and deduplication. Set cache.batch_size=None for automatic optimization. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for parsed data. Can be a Pydantic model, built-in type, or None for automatic inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum values to analyze for schema inference (when response_format is None). Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters. |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series containing parsed structured data aligned with the original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
parse
async
¶
parse(
instructions: str,
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Parse Series values into structured data using an LLM (asynchronously).
Async version of the parse method, extracting structured information from unstructured text with automatic schema inference when needed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language extraction goals (e.g., "Extract product names, prices, and categories from descriptions"). |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure. None triggers automatic schema inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum values for schema inference. Defaults to 100. |
100
|
batch_size
|
int | None
|
Requests per batch. None for automatic optimization. Defaults to None. |
None
|
max_concurrency
|
int
|
Maximum concurrent API requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar. Defaults to False. |
False
|
**api_kwargs
|
Additional OpenAI API parameters. |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Parsed structured data indexed like the original Series. |
Example
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
AsyncOpenAIVecDataFrameAccessor ¶
pandas DataFrame accessor (.aio
) that adds OpenAI helpers.
Source code in src/openaivec/pandas_ext.py
Functions¶
responses_with_cache
async
¶
responses_with_cache(
instructions: str,
cache: AsyncBatchingMapProxy[str, ResponseFormat],
response_format: type[ResponseFormat] = str,
**api_kwargs,
) -> pd.Series
Generate a response for each row after serializing it to JSON using a provided cache (asynchronously).
This method allows external control over caching behavior by accepting a pre-configured AsyncBatchingMapProxy instance, enabling cache sharing across multiple operations or custom batch size management. The concurrency is controlled by the cache instance itself.
Example
from openaivec._proxy import AsyncBatchingMapProxy
# Create a shared cache with custom batch size and concurrency
shared_cache = AsyncBatchingMapProxy(batch_size=64, max_concurrency=4)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Must be awaited
result = await df.aio.responses_with_cache(
"what is the animal's name?",
cache=shared_cache
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
System prompt for the assistant. |
required |
cache
|
AsyncBatchingMapProxy[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat]
|
Desired Python type of the
responses. Defaults to |
str
|
**api_kwargs
|
Additional OpenAI API parameters (temperature, top_p, etc.). |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Responses aligned with the DataFrame's original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
responses
async
¶
responses(
instructions: str,
response_format: type[ResponseFormat] = str,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Generate a response for each row after serializing it to JSON (asynchronously).
Example
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Must be awaited
results = await df.aio.responses("what is the animal's name?")
# With progress bar for large datasets
large_df = pd.DataFrame({"id": list(range(1000))})
results = await large_df.aio.responses(
"generate a name for this ID",
batch_size=20,
max_concurrency=4,
show_progress=True
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
System prompt for the assistant. |
required |
response_format
|
type[ResponseFormat]
|
Desired Python type of the
responses. Defaults to |
str
|
batch_size
|
int | None
|
Number of requests sent in one batch.
Defaults to |
None
|
**api_kwargs
|
Additional OpenAI API parameters (temperature, top_p, etc.). |
{}
|
|
max_concurrency
|
int
|
Maximum number of concurrent
requests. Defaults to |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Responses aligned with the DataFrame's original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
task_with_cache
async
¶
task_with_cache(
task: PreparedTask[ResponseFormat],
cache: AsyncBatchingMapProxy[str, ResponseFormat],
) -> pd.Series
Execute a prepared task on each DataFrame row using a provided cache (asynchronously).
After serializing each row to JSON, this method executes the prepared task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task
|
PreparedTask
|
Prepared task (instructions + response_format + sampling params). |
required |
cache
|
AsyncBatchingMapProxy[str, ResponseFormat]
|
Pre‑configured async cache instance. |
required |
Note
The task's stored API parameters are used. Core routing keys are managed internally.
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Task results aligned with the DataFrame's original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
task
async
¶
task(
task: PreparedTask,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = False,
) -> pd.Series
Execute a prepared task on each DataFrame row after serializing it to JSON (asynchronously).
Example
from openaivec._model import PreparedTask
# Assume you have a prepared task for data analysis
analysis_task = PreparedTask(...)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Must be awaited
results = await df.aio.task(analysis_task)
# With progress bar for large datasets
large_df = pd.DataFrame({"id": list(range(1000))})
results = await large_df.aio.task(
analysis_task,
batch_size=50,
max_concurrency=4,
show_progress=True
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format, and other parameters for processing the inputs. |
required |
batch_size
|
int | None
|
Number of requests sent in one batch
to optimize API usage. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
Note
The task's stored API parameters are used. Core batching / routing
keys (model
, instructions
/ system message, user input
) are managed by the
library and cannot be overridden.
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the DataFrame's original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 |
|
parse_with_cache
async
¶
parse_with_cache(
instructions: str,
cache: AsyncBatchingMapProxy[str, ResponseFormat],
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
**api_kwargs,
) -> pd.Series
Parse DataFrame rows into structured data using an LLM with cache (asynchronously).
Async method for parsing DataFrame rows (as JSON) with external cache control, enabling deduplication across operations and concurrent processing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language extraction goals (e.g., "Extract invoice details including items, quantities, and totals"). |
required |
cache
|
AsyncBatchingMapProxy[str, ResponseFormat]
|
Pre-configured async cache for concurrent API call management. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure. None triggers automatic schema inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum rows for schema inference. Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters. |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Parsed structured data indexed like the original DataFrame. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
parse
async
¶
parse(
instructions: str,
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = False,
**api_kwargs,
) -> pd.Series
Parse DataFrame rows into structured data using an LLM (asynchronously).
Async version for extracting structured information from DataFrame rows, with automatic schema inference when no format is specified.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instructions
|
str
|
Plain language extraction goals (e.g., "Extract customer details, order items, and payment information"). |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure. None triggers automatic inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum rows for schema inference. Defaults to 100. |
100
|
batch_size
|
int | None
|
Rows per batch. None for automatic optimization. Defaults to None. |
None
|
max_concurrency
|
int
|
Maximum concurrent requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar. Defaults to False. |
False
|
**api_kwargs
|
Additional OpenAI API parameters. |
{}
|
Returns:
Type | Description |
---|---|
Series
|
pandas.Series: Parsed structured data indexed like the original DataFrame. |
Example
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
pipe
async
¶
Apply a function to the DataFrame, supporting both synchronous and asynchronous functions.
This method allows chaining operations on the DataFrame, similar to pandas' pipe
method,
but with support for asynchronous functions.
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func
|
Callable[[DataFrame], Awaitable[T] | T]
|
A function that takes a DataFrame as input and returns either a result or an awaitable result. |
required |
Returns:
Name | Type | Description |
---|---|---|
T |
T
|
The result of applying the function, either directly or after awaiting it. |
Note
This is an asynchronous method and must be awaited if the function returns an awaitable.
Source code in src/openaivec/pandas_ext.py
assign
async
¶
Asynchronously assign new columns to the DataFrame, evaluating sequentially.
This method extends pandas' assign
method by supporting asynchronous
functions as column values and evaluating assignments sequentially, allowing
later assignments to refer to columns created earlier in the same call.
For each key-value pair in kwargs
:
- If the value is a callable, it is invoked with the current state of the DataFrame
(including columns created in previous steps of this assign
call).
If the result is awaitable, it is awaited; otherwise, it is used directly.
- If the value is not callable, it is assigned directly to the new column.
Example
async def compute_column(df):
# Simulate an asynchronous computation
await asyncio.sleep(1)
return df["existing_column"] * 2
async def use_new_column(df):
# Access the column created in the previous step
await asyncio.sleep(1)
return df["new_column"] + 5
df = pd.DataFrame({"existing_column": [1, 2, 3]})
# Must be awaited
df = await df.aio.assign(
new_column=compute_column,
another_column=use_new_column
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs
|
Column names as keys and either static values or callables (synchronous or asynchronous) as values. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
pandas.DataFrame: A new DataFrame with the assigned columns. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext.py
fillna
async
¶
fillna(
target_column_name: str,
max_examples: int = 500,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = False,
) -> pd.DataFrame
Fill missing values in a DataFrame column using AI-powered inference (asynchronously).
This method uses machine learning to intelligently fill missing (NaN) values in a specified column by analyzing patterns from non-missing rows in the DataFrame. It creates a prepared task that provides examples of similar rows to help the AI model predict appropriate values for the missing entries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_column_name
|
str
|
The name of the column containing missing values that need to be filled. |
required |
max_examples
|
int
|
The maximum number of example rows to use for context when predicting missing values. Higher values may improve accuracy but increase API costs and processing time. Defaults to 500. |
500
|
batch_size
|
int | None
|
Number of requests sent in one batch
to optimize API usage. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pandas.DataFrame: A new DataFrame with missing values filled in the target column. The original DataFrame is not modified. |
Example
df = pd.DataFrame({
'name': ['Alice', 'Bob', None, 'David'],
'age': [25, 30, 35, None],
'city': ['Tokyo', 'Osaka', 'Kyoto', 'Tokyo']
})
# Fill missing values in the 'name' column (must be awaited)
filled_df = await df.aio.fillna('name')
# With progress bar for large datasets
large_df = pd.DataFrame({'name': [None] * 1000, 'age': list(range(1000))})
filled_df = await large_df.aio.fillna(
'name',
batch_size=32,
max_concurrency=4,
show_progress=True
)
Note
This is an asynchronous method and must be awaited. If the target column has no missing values, the original DataFrame is returned unchanged.
Source code in src/openaivec/pandas_ext.py
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 |
|
Functions¶
use ¶
Register a custom OpenAI‑compatible client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
OpenAI
|
A pre‑configured |
required |
Source code in src/openaivec/pandas_ext.py
use_async ¶
Register a custom asynchronous OpenAI‑compatible client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
AsyncOpenAI
|
A pre‑configured |
required |
Source code in src/openaivec/pandas_ext.py
responses_model ¶
Override the model used for text responses.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
For Azure OpenAI, use your deployment name. For OpenAI, use the model name
(for example, |
required |
Source code in src/openaivec/pandas_ext.py
embeddings_model ¶
Override the model used for text embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
For Azure OpenAI, use your deployment name. For OpenAI, use the model name,
e.g. |
required |