Pandas Extension¶
Configuration¶
openaivec.pandas_ext._config ¶
Deprecated configuration helpers — use openaivec.set_* / openaivec.get_* instead.
These wrappers emit a DeprecationWarning and delegate to
openaivec._provider.
Functions¶
set_client ¶
get_client ¶
set_async_client ¶
Deprecated: use openaivec.set_async_client().
get_async_client ¶
Deprecated: use openaivec.get_async_client().
set_responses_model ¶
Deprecated: use openaivec.set_responses_model().
get_responses_model ¶
Deprecated: use openaivec.get_responses_model().
set_embeddings_model ¶
Deprecated: use openaivec.set_embeddings_model().
get_embeddings_model ¶
Deprecated: use openaivec.get_embeddings_model().
Series Accessor (.ai)¶
openaivec.pandas_ext._series_sync.OpenAIVecSeriesAccessor ¶
pandas Series accessor (.ai) that adds OpenAI helpers.
Source code in src/openaivec/pandas_ext/_series_sync.py
Functions¶
responses_with_cache ¶
responses_with_cache(
instructions: str,
cache: BatchCache[str, ResponseFormat],
response_format: type[ResponseFormat] = str,
**api_kwargs,
) -> pd.Series
Call an LLM once for every Series element using a provided cache.
This is a lower-level method that allows explicit cache management for advanced
use cases. Most users should use the standard responses method instead.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
cache
|
BatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of |
Source code in src/openaivec/pandas_ext/_series_sync.py
responses ¶
responses(
instructions: str,
response_format: type[ResponseFormat] = str,
batch_size: int | None = None,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Call an LLM once for every Series element.
Example
animals = pd.Series(["cat", "dog", "elephant"])
# Basic usage
animals.ai.responses("translate to French")
# With progress bar in Jupyter notebooks
large_series = pd.Series(["data"] * 1000)
large_series.ai.responses(
"analyze this data",
batch_size=32,
show_progress=True
)
# With custom temperature
animals.ai.responses(
"translate creatively",
temperature=0.8
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
batch_size
|
int | None
|
Number of prompts grouped into a single
request. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of |
Source code in src/openaivec/pandas_ext/_series_sync.py
embeddings_with_cache ¶
Compute OpenAI embeddings for every Series element using a provided cache.
This method allows external control over caching behavior by accepting a pre-configured BatchCache instance, enabling cache sharing across multiple operations or custom batch size management.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
BatchCache[str, ndarray]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are |
Source code in src/openaivec/pandas_ext/_series_sync.py
embeddings ¶
Compute OpenAI embeddings for every Series element.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int | None
|
Number of inputs grouped into a
single request. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are |
Source code in src/openaivec/pandas_ext/_series_sync.py
task_with_cache ¶
task_with_cache(
task: PreparedTask[ResponseFormat],
cache: BatchCache[str, ResponseFormat],
**api_kwargs,
) -> pd.Series
Execute a prepared task on every Series element using a provided cache.
This mirrors responses_with_cache but uses the task's stored instructions
and response format. A supplied BatchCache enables cross‑operation
deduplicated reuse and external batch size / progress control.
Example
from openaivec._model import PreparedTask
from openaivec._cache import BatchCache
shared_cache = BatchCache(batch_size=64)
sentiment_task = PreparedTask(...)
reviews = pd.Series(["Great product!", "Not satisfied", "Amazing quality"])
results = reviews.ai.task_with_cache(sentiment_task, cache=shared_cache)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format for processing the inputs. |
required |
cache
|
BatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the original Series index. |
Note
Core routing keys (model, system instructions, user input) are managed
internally and cannot be overridden.
Source code in src/openaivec/pandas_ext/_series_sync.py
task ¶
task(
task: PreparedTask,
batch_size: int | None = None,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Execute a prepared task on every Series element.
Example
from openaivec._model import PreparedTask
# Assume you have a prepared task for sentiment analysis
sentiment_task = PreparedTask(...)
reviews = pd.Series(["Great product!", "Not satisfied", "Amazing quality"])
# Basic usage
results = reviews.ai.task(sentiment_task)
# With progress bar for large datasets
large_reviews = pd.Series(["review text"] * 2000)
results = large_reviews.ai.task(
sentiment_task,
batch_size=50,
show_progress=True
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format for processing the inputs. |
required |
batch_size
|
int | None
|
Number of prompts grouped into a single
request to optimize API usage. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the original Series index. |
Note
Core batching / routing keys (model, instructions / system message,
user input) are managed by the library and cannot be overridden.
Source code in src/openaivec/pandas_ext/_series_sync.py
parse_with_cache ¶
parse_with_cache(
instructions: str,
cache: BatchCache[str, ResponseFormat],
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
**api_kwargs,
) -> pd.Series
Parse Series values into structured data using an LLM with a provided cache.
This method allows external control over caching behavior while parsing Series content into structured data. If no response format is provided, the method automatically infers an appropriate schema by analyzing the data patterns.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language description of what information to extract (e.g., "Extract customer information including name and contact details"). This guides both the extraction process and schema inference. |
required |
cache
|
BatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for the parsed data. Can be a Pydantic model class, built-in type (str, int, float, bool, list, dict), or None. If None, the method infers an appropriate schema based on the instructions and data. Defaults to None. |
None
|
max_examples
|
int
|
Maximum number of Series values to analyze when inferring the schema. Only used when response_format is None. Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series containing parsed structured data. Each value is an instance of the specified response_format or the inferred schema model, aligned with the original Series index. |
Source code in src/openaivec/pandas_ext/_series_sync.py
parse ¶
parse(
instructions: str,
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
batch_size: int | None = None,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Parse Series values into structured data using an LLM.
This method extracts structured information from unstructured text in the Series. When no response format is provided, it automatically infers an appropriate schema by analyzing patterns in the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language description of what information to extract (e.g., "Extract product details including price, category, and availability"). This guides both the extraction process and schema inference. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for the parsed data. Can be a Pydantic model class, built-in type (str, int, float, bool, list, dict), or None. If None, automatically infers a schema. Defaults to None. |
None
|
max_examples
|
int
|
Maximum number of Series values to analyze when inferring schema. Only used when response_format is None. Defaults to 100. |
100
|
batch_size
|
int | None
|
Number of requests to process per batch. None enables automatic optimization. Defaults to None. |
None
|
show_progress
|
bool
|
Display progress bar in Jupyter notebooks. Defaults to True. |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series containing parsed structured data as instances of response_format or the inferred schema model. |
Example
# With explicit schema
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
in_stock: bool
descriptions = pd.Series([
"iPhone 15 Pro - $999, available now",
"Samsung Galaxy S24 - $899, out of stock"
])
products = descriptions.ai.parse(
"Extract product information",
response_format=Product
)
# With automatic schema inference
reviews = pd.Series([
"Great product! 5 stars. Fast shipping.",
"Poor quality. 2 stars. Slow delivery."
])
parsed = reviews.ai.parse(
"Extract review rating and shipping feedback"
)
Source code in src/openaivec/pandas_ext/_series_sync.py
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 | |
infer_schema ¶
Infer a structured data schema from Series content using AI.
This method analyzes a sample of Series values to automatically generate a Pydantic model that captures the relevant information structure. The inferred schema supports both flat and hierarchical (nested) structures, making it suitable for complex data extraction tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language description of the extraction goal (e.g., "Extract customer information for CRM system", "Parse event details for calendar integration"). This guides which fields to include and their purpose. |
required |
max_examples
|
int
|
Maximum number of Series values to analyze for pattern detection. The method samples randomly up to this limit. Higher values may improve schema quality but increase inference time. Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
InferredSchema |
SchemaInferenceOutput
|
A comprehensive schema object containing: - instructions: Refined extraction objective statement - fields: Hierarchical field specifications with names, types, descriptions, and nested structures where applicable - inference_prompt: Optimized prompt for consistent extraction - model: Dynamically generated Pydantic model class supporting both flat and nested structures - task: PreparedTask configured for batch extraction using the inferred schema |
Example
# Simple flat structure
reviews = pd.Series([
"5 stars! Great product, fast shipping to NYC.",
"2 stars. Product broke, slow delivery to LA."
])
schema = reviews.ai.infer_schema(
"Extract review ratings and shipping information"
)
# Hierarchical structure
orders = pd.Series([
"Order #123: John Doe, 123 Main St, NYC. Items: iPhone ($999), Case ($29)",
"Order #456: Jane Smith, 456 Oak Ave, LA. Items: iPad ($799)"
])
schema = orders.ai.infer_schema(
"Extract order details including customer and items"
)
# Inferred schema may include nested structures like:
# - customer: {name: str, address: str, city: str}
# - items: [{product: str, price: float}]
# Apply the schema for extraction
extracted = orders.ai.task(schema.task)
Note
The inference process uses multiple AI iterations to ensure schema validity. Nested structures are automatically detected when the data contains hierarchical relationships. The generated Pydantic model ensures type safety and validation for all extracted data.
Source code in src/openaivec/pandas_ext/_series_sync.py
count_tokens ¶
Count tiktoken tokens per element.
Uses encode_batch for vectorized tokenization when available,
falling back to per-element map otherwise.
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Token counts for each element. |
Source code in src/openaivec/pandas_ext/_series_sync.py
extract ¶
Expand a Series of Pydantic models or dicts into DataFrame columns.
Each element should be a Pydantic model or dict. If the Series has a name, extracted columns are prefixed with it.
Example
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: Expanded representation. |
Source code in src/openaivec/pandas_ext/_series_sync.py
DataFrame Accessor (.ai)¶
openaivec.pandas_ext._dataframe_sync.OpenAIVecDataFrameAccessor ¶
pandas DataFrame accessor (.ai) that adds OpenAI helpers.
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
Functions¶
responses_with_cache ¶
responses_with_cache(
instructions: str,
cache: BatchCache[str, ResponseFormat],
response_format: type[ResponseFormat] = str,
**api_kwargs,
) -> pd.Series
Call an LLM once for every DataFrame row using a provided cache.
This method allows external control over caching behavior by accepting a pre-configured BatchCache instance, enabling cache sharing across multiple operations or custom batch size management.
Example
from openaivec._cache import BatchCache
# Create a shared cache with custom batch size
shared_cache = BatchCache(batch_size=64)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
result = df.ai.responses_with_cache(
"what is the animal's name?",
cache=shared_cache
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
cache
|
BatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Responses aligned with the DataFrame's original index. |
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
responses ¶
responses(
instructions: str,
response_format: type[ResponseFormat] = str,
batch_size: int | None = None,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Call an LLM once for every DataFrame row.
Example
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Basic usage
df.ai.responses("what is the animal's name?")
# With progress bar for large datasets
large_df = pd.DataFrame({"id": list(range(1000))})
large_df.ai.responses(
"generate a name for this ID",
batch_size=20,
show_progress=True
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
batch_size
|
int | None
|
Number of prompts grouped into a single
request. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Responses aligned with the DataFrame's original index. |
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
task_with_cache ¶
task_with_cache(
task: PreparedTask[ResponseFormat],
cache: BatchCache[str, ResponseFormat],
**api_kwargs,
) -> pd.Series
Execute a prepared task on every DataFrame row using a provided cache.
Example
from openaivec._model import PreparedTask
from openaivec._cache import BatchCache
shared_cache = BatchCache(batch_size=64)
analysis_task = PreparedTask(...)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
results = df.ai.task_with_cache(analysis_task, cache=shared_cache)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format for processing the inputs. |
required |
cache
|
BatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the DataFrame's original index. |
Note
Core routing keys are managed internally.
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
task ¶
task(
task: PreparedTask,
batch_size: int | None = None,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Execute a prepared task on every DataFrame row.
Example
from openaivec._model import PreparedTask
# Assume you have a prepared task for data analysis
analysis_task = PreparedTask(...)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Basic usage
results = df.ai.task(analysis_task)
# With progress bar for large datasets
large_df = pd.DataFrame({"id": list(range(1000))})
results = large_df.ai.task(
analysis_task,
batch_size=50,
show_progress=True
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format for processing the inputs. |
required |
batch_size
|
int | None
|
Number of requests sent in one batch
to optimize API usage. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the DataFrame's original index. |
Note
Core batching / routing keys (model, instructions / system message, user input)
are managed by the library and cannot be overridden.
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
parse_with_cache ¶
parse_with_cache(
instructions: str,
cache: BatchCache[str, ResponseFormat],
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
**api_kwargs,
) -> pd.Series
Parse DataFrame rows into structured data using an LLM with a provided cache.
This method processes each DataFrame row (converted to JSON) and extracts structured information using an LLM. External cache control enables deduplication across operations and custom batch management.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language description of what information to extract from each row (e.g., "Extract shipping details and order status"). Guides both extraction and schema inference. |
required |
cache
|
BatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for parsed data. Can be a Pydantic model, built-in type, or None for automatic schema inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum rows to analyze when inferring schema (only used when response_format is None). Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series containing parsed structured data as instances of response_format or the inferred schema model, indexed like the original DataFrame. |
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
parse ¶
parse(
instructions: str,
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
batch_size: int | None = None,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Parse DataFrame rows into structured data using an LLM.
Each row is converted to JSON and processed to extract structured information. When no response format is provided, the method automatically infers an appropriate schema from the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language description of extraction goals (e.g., "Extract transaction details including amount, date, and merchant"). Guides extraction and schema inference. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for parsed data. Can be a Pydantic model, built-in type, or None for automatic inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum rows to analyze for schema inference (when response_format is None). Defaults to 100. |
100
|
batch_size
|
int | None
|
Rows per API batch. None enables automatic optimization. Defaults to None. |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to True. |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Parsed structured data indexed like the original DataFrame. |
Example
df = pd.DataFrame({
'log': [
'2024-01-01 10:00 ERROR Database connection failed',
'2024-01-01 10:05 INFO Service started successfully'
]
})
# With automatic schema inference
parsed = df.ai.parse("Extract timestamp, level, and message")
# Returns Series with inferred structure like:
# {timestamp: str, level: str, message: str}
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
infer_schema ¶
Infer a structured data schema from DataFrame rows using AI.
This method analyzes a sample of DataFrame rows to automatically infer a structured schema that can be used for consistent data extraction. Each row is converted to JSON format and analyzed to identify patterns, field types, and potential categorical values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language description of how the extracted structured data will be used (e.g., "Extract operational metrics for dashboard", "Parse customer attributes for segmentation"). This guides field relevance and helps exclude irrelevant information. |
required |
max_examples
|
int
|
Maximum number of rows to analyze from the DataFrame. The method will sample randomly up to this limit. Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
InferredSchema |
SchemaInferenceOutput
|
An object containing: - instructions: Normalized statement of the extraction objective - fields: List of field specifications with names, types, and descriptions - inference_prompt: Reusable prompt for future extractions - model: Dynamically generated Pydantic model for parsing - task: PreparedTask for batch extraction operations |
Example
df = pd.DataFrame({
'text': [
"Order #123: Shipped to NYC, arriving Tuesday",
"Order #456: Delayed due to weather, new ETA Friday",
"Order #789: Delivered to customer in LA"
],
'timestamp': ['2024-01-01', '2024-01-02', '2024-01-03']
})
# Infer schema for logistics tracking
schema = df.ai.infer_schema(
instructions="Extract shipping status and location data for logistics tracking"
)
# Apply the schema to extract structured data
extracted_df = df.ai.task(schema.task)
Note
Each row is converted to JSON before analysis. The inference process automatically detects hierarchical relationships and creates appropriate nested structures when present. The generated Pydantic model ensures type safety and validation.
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
extract ¶
Flatten one column of Pydantic models or dicts into top-level columns.
The target column should contain Pydantic models or dicts. The source column is dropped from the result.
Example
This method returns a DataFrame with the same index as the original, where each column corresponds to a key in the dictionaries. The source column is dropped.Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column to expand. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: Original DataFrame with the extracted columns; the source column is dropped. |
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
fillna ¶
fillna(
target_column_name: str,
max_examples: int = 500,
batch_size: int | None = None,
show_progress: bool = True,
) -> pd.DataFrame
Fill missing values in a DataFrame column using AI-powered inference.
This method uses machine learning to intelligently fill missing (NaN) values in a specified column by analyzing patterns from non-missing rows in the DataFrame. It creates a prepared task that provides examples of similar rows to help the AI model predict appropriate values for the missing entries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_column_name
|
str
|
The name of the column containing missing values that need to be filled. |
required |
max_examples
|
int
|
The maximum number of example rows to use for context when predicting missing values. Higher values may improve accuracy but increase API costs and processing time. Defaults to 500. |
500
|
batch_size
|
int | None
|
Number of requests sent in one batch
to optimize API usage. Defaults to |
None
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: A new DataFrame with missing values filled in the target column. The original DataFrame is not modified. |
Example
df = pd.DataFrame({
'name': ['Alice', 'Bob', None, 'David'],
'age': [25, 30, 35, None],
'city': ['Tokyo', 'Osaka', 'Kyoto', 'Tokyo']
})
# Fill missing values in the 'name' column
filled_df = df.ai.fillna('name')
# With progress bar for large datasets
large_df = pd.DataFrame({'name': [None] * 1000, 'age': list(range(1000))})
filled_df = large_df.ai.fillna('name', batch_size=32, show_progress=True)
Note
If the target column has no missing values, the original DataFrame is returned unchanged.
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
similarity ¶
Compute cosine similarity between two columns containing embedding vectors.
This method calculates the cosine similarity between vectors stored in two columns of the DataFrame. The vectors should be numpy arrays or array-like objects that support dot product operations.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col1
|
str
|
Name of the first column containing embedding vectors. |
required |
col2
|
str
|
Name of the second column containing embedding vectors. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series containing cosine similarity scores between corresponding vectors in col1 and col2, with values ranging from -1 to 1, where 1 indicates identical direction. |
Source code in src/openaivec/pandas_ext/_dataframe_sync.py
Async Series Accessor (.aio)¶
openaivec.pandas_ext._series_async.AsyncOpenAIVecSeriesAccessor ¶
pandas Series accessor (.aio) that adds OpenAI helpers.
Source code in src/openaivec/pandas_ext/_series_async.py
Functions¶
responses_with_cache
async
¶
responses_with_cache(
instructions: str,
cache: AsyncBatchCache[str, ResponseFormat],
response_format: type[ResponseFormat] = str,
**api_kwargs,
) -> pd.Series
Call an LLM once for every Series element using a provided cache (asynchronously).
This method allows external control over caching behavior by accepting a pre-configured AsyncBatchCache instance, enabling cache sharing across multiple operations or custom batch size management. The concurrency is controlled by the cache instance itself.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
cache
|
AsyncBatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_series_async.py
responses
async
¶
responses(
instructions: str,
response_format: type[ResponseFormat] = str,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Call an LLM once for every Series element (asynchronously).
Example
animals = pd.Series(["cat", "dog", "elephant"])
# Must be awaited
results = await animals.aio.responses("translate to French")
# With progress bar for large datasets
large_series = pd.Series(["data"] * 1000)
results = await large_series.aio.responses(
"analyze this data",
batch_size=32,
max_concurrency=4,
show_progress=True
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
batch_size
|
int | None
|
Number of prompts grouped into a single
request. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent
requests. Defaults to |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_series_async.py
embeddings_with_cache
async
¶
Compute OpenAI embeddings for every Series element using a provided cache (asynchronously).
This method allows external control over caching behavior by accepting a pre-configured AsyncBatchCache instance, enabling cache sharing across multiple operations or custom batch size management. The concurrency is controlled by the cache instance itself.
Example
from openaivec._cache import AsyncBatchCache
import numpy as np
# Create a shared cache with custom batch size and concurrency
shared_cache = AsyncBatchCache[str, np.ndarray](
batch_size=64, max_concurrency=4
)
animals = pd.Series(["cat", "dog", "elephant"])
# Must be awaited
embeddings = await animals.aio.embeddings_with_cache(cache=shared_cache)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
AsyncBatchCache[str, ndarray]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_series_async.py
embeddings
async
¶
embeddings(
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Compute OpenAI embeddings for every Series element (asynchronously).
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int | None
|
Number of inputs grouped into a
single request. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent
requests. Defaults to |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_series_async.py
task_with_cache
async
¶
task_with_cache(
task: PreparedTask[ResponseFormat],
cache: AsyncBatchCache[str, ResponseFormat],
**api_kwargs,
) -> pd.Series
Execute a prepared task on every Series element using a provided cache (asynchronously).
This method allows external control over caching behavior by accepting a pre-configured AsyncBatchCache instance, enabling cache sharing across multiple operations or custom batch size management. The concurrency is controlled by the cache instance itself.
Example
from openaivec._model import PreparedTask
from openaivec._cache import AsyncBatchCache
shared_cache = AsyncBatchCache(batch_size=64, max_concurrency=4)
sentiment_task = PreparedTask(...)
reviews = pd.Series(["Great product!", "Not satisfied", "Amazing quality"])
results = await reviews.aio.task_with_cache(sentiment_task, cache=shared_cache)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format for processing the inputs. |
required |
cache
|
AsyncBatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the original Series index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_series_async.py
task
async
¶
task(
task: PreparedTask,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Execute a prepared task on every Series element (asynchronously).
Example
from openaivec._model import PreparedTask
# Assume you have a prepared task for sentiment analysis
sentiment_task = PreparedTask(...)
reviews = pd.Series(["Great product!", "Not satisfied", "Amazing quality"])
# Must be awaited
results = await reviews.aio.task(sentiment_task)
# With progress bar for large datasets
large_reviews = pd.Series(["review text"] * 2000)
results = await large_reviews.aio.task(
sentiment_task,
batch_size=50,
max_concurrency=4,
show_progress=True
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format for processing the inputs. |
required |
batch_size
|
int | None
|
Number of prompts grouped into a single
request to optimize API usage. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the original Series index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_series_async.py
parse_with_cache
async
¶
parse_with_cache(
instructions: str,
cache: AsyncBatchCache[str, ResponseFormat],
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
**api_kwargs,
) -> pd.Series
Parse Series values into structured data using an LLM with a provided cache (asynchronously).
This async method provides external cache control while parsing Series content into structured data. Automatic schema inference is performed when no response format is specified.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language description of what to extract (e.g., "Extract dates, amounts, and descriptions from receipts"). Guides both extraction and schema inference. |
required |
cache
|
AsyncBatchCache[str, ResponseFormat]
|
Pre-configured async cache for managing concurrent API calls and deduplication. Set cache.batch_size=None for automatic optimization. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure for parsed data. Can be a Pydantic model, built-in type, or None for automatic inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum values to analyze for schema inference (when response_format is None). Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series containing parsed structured data aligned with the original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_series_async.py
parse
async
¶
parse(
instructions: str,
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Parse Series values into structured data using an LLM (asynchronously).
Async version of the parse method, extracting structured information from unstructured text with automatic schema inference when needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language extraction goals (e.g., "Extract product names, prices, and categories from descriptions"). |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure. None triggers automatic schema inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum values for schema inference. Defaults to 100. |
100
|
batch_size
|
int | None
|
Requests per batch. None for automatic optimization. Defaults to None. |
None
|
max_concurrency
|
int
|
Maximum concurrent API requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar. Defaults to True. |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Parsed structured data indexed like the original Series. |
Example
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_series_async.py
Async DataFrame Accessor (.aio)¶
openaivec.pandas_ext._dataframe_async.AsyncOpenAIVecDataFrameAccessor ¶
pandas DataFrame accessor (.aio) that adds OpenAI helpers.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
Functions¶
responses_with_cache
async
¶
responses_with_cache(
instructions: str,
cache: AsyncBatchCache[str, ResponseFormat],
response_format: type[ResponseFormat] = str,
**api_kwargs,
) -> pd.Series
Call an LLM once for every DataFrame row using a provided cache (asynchronously).
This method allows external control over caching behavior by accepting a pre-configured AsyncBatchCache instance, enabling cache sharing across multiple operations or custom batch size management. The concurrency is controlled by the cache instance itself.
Example
from openaivec._cache import AsyncBatchCache
# Create a shared cache with custom batch size and concurrency
shared_cache = AsyncBatchCache(batch_size=64, max_concurrency=4)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Must be awaited
result = await df.aio.responses_with_cache(
"what is the animal's name?",
cache=shared_cache
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
cache
|
AsyncBatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Responses aligned with the DataFrame's original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
responses
async
¶
responses(
instructions: str,
response_format: type[ResponseFormat] = str,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Call an LLM once for every DataFrame row (asynchronously).
Example
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Must be awaited
results = await df.aio.responses("what is the animal's name?")
# With progress bar for large datasets
large_df = pd.DataFrame({"id": list(range(1000))})
results = await large_df.aio.responses(
"generate a name for this ID",
batch_size=20,
max_concurrency=4,
show_progress=True
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
System prompt prepended to every user message. |
required |
response_format
|
type[ResponseFormat]
|
Pydantic model or built‑in
type the assistant should return. Defaults to |
str
|
batch_size
|
int | None
|
Number of prompts grouped into a single
request. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent
requests. Defaults to |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Responses aligned with the DataFrame's original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
task_with_cache
async
¶
task_with_cache(
task: PreparedTask[ResponseFormat],
cache: AsyncBatchCache[str, ResponseFormat],
**api_kwargs,
) -> pd.Series
Execute a prepared task on every DataFrame row using a provided cache (asynchronously).
Example
from openaivec._model import PreparedTask
from openaivec._cache import AsyncBatchCache
shared_cache = AsyncBatchCache(batch_size=64, max_concurrency=4)
analysis_task = PreparedTask(...)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
results = await df.aio.task_with_cache(analysis_task, cache=shared_cache)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format for processing the inputs. |
required |
cache
|
AsyncBatchCache[str, ResponseFormat]
|
Pre-configured cache instance for managing API call batching and deduplication. Set cache.batch_size=None to enable automatic batch size optimization. |
required |
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the DataFrame's original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
task
async
¶
task(
task: PreparedTask,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Execute a prepared task on every DataFrame row (asynchronously).
Example
from openaivec._model import PreparedTask
# Assume you have a prepared task for data analysis
analysis_task = PreparedTask(...)
df = pd.DataFrame([
{"name": "cat", "legs": 4},
{"name": "dog", "legs": 4},
{"name": "elephant", "legs": 4},
])
# Must be awaited
results = await df.aio.task(analysis_task)
# With progress bar for large datasets
large_df = pd.DataFrame({"id": list(range(1000))})
results = await large_df.aio.task(
analysis_task,
batch_size=50,
max_concurrency=4,
show_progress=True
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
PreparedTask
|
A pre-configured task containing instructions, response format for processing the inputs. |
required |
batch_size
|
int | None
|
Number of requests sent in one batch
to optimize API usage. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Series whose values are instances of the task's response format, aligned with the DataFrame's original index. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
parse_with_cache
async
¶
parse_with_cache(
instructions: str,
cache: AsyncBatchCache[str, ResponseFormat],
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
**api_kwargs,
) -> pd.Series
Parse DataFrame rows into structured data using an LLM with a provided cache (asynchronously).
Async method for parsing DataFrame rows (as JSON) with external cache control, enabling deduplication across operations and concurrent processing.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language extraction goals (e.g., "Extract invoice details including items, quantities, and totals"). |
required |
cache
|
AsyncBatchCache[str, ResponseFormat]
|
Pre-configured async cache for concurrent API call management. |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure. None triggers automatic schema inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum rows for schema inference. Defaults to 100. |
100
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Parsed structured data indexed like the original DataFrame. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
parse
async
¶
parse(
instructions: str,
response_format: type[ResponseFormat] | None = None,
max_examples: int = 100,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = True,
**api_kwargs,
) -> pd.Series
Parse DataFrame rows into structured data using an LLM (asynchronously).
Async version for extracting structured information from DataFrame rows, with automatic schema inference when no format is specified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instructions
|
str
|
Plain language extraction goals (e.g., "Extract customer details, order items, and payment information"). |
required |
response_format
|
type[ResponseFormat] | None
|
Target structure. None triggers automatic inference. Defaults to None. |
None
|
max_examples
|
int
|
Maximum rows for schema inference. Defaults to 100. |
100
|
batch_size
|
int | None
|
Rows per batch. None for automatic optimization. Defaults to None. |
None
|
max_concurrency
|
int
|
Maximum concurrent requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar. Defaults to True. |
True
|
**api_kwargs
|
Additional OpenAI API parameters (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
Series
|
pandas.Series: Parsed structured data indexed like the original DataFrame. |
Example
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
pipe
async
¶
Apply a function to the DataFrame, supporting both synchronous and asynchronous functions.
This method allows chaining operations on the DataFrame, similar to pandas' pipe method,
but with support for asynchronous functions.
Example
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Callable[[DataFrame], Awaitable[T] | T]
|
A function that takes a DataFrame as input and returns either a result or an awaitable result. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
T |
T
|
The result of applying the function, either directly or after awaiting it. |
Note
This is an asynchronous method and must be awaited if the function returns an awaitable.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
assign
async
¶
Asynchronously assign new columns to the DataFrame, evaluating sequentially.
This method extends pandas' assign method by supporting asynchronous
functions as column values and evaluating assignments sequentially, allowing
later assignments to refer to columns created earlier in the same call.
For each key-value pair in kwargs:
- If the value is a callable, it is invoked with the current state of the DataFrame
(including columns created in previous steps of this assign call).
If the result is awaitable, it is awaited; otherwise, it is used directly.
- If the value is not callable, it is assigned directly to the new column.
Example
async def compute_column(df):
# Simulate an asynchronous computation
await asyncio.sleep(1)
return df["existing_column"] * 2
async def use_new_column(df):
# Access the column created in the previous step
await asyncio.sleep(1)
return df["new_column"] + 5
df = pd.DataFrame({"existing_column": [1, 2, 3]})
# Must be awaited
df = await df.aio.assign(
new_column=compute_column,
another_column=use_new_column
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Column names as keys and either static values or callables (synchronous or asynchronous) as values. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: A new DataFrame with the assigned columns. |
Note
This is an asynchronous method and must be awaited.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
fillna
async
¶
fillna(
target_column_name: str,
max_examples: int = 500,
batch_size: int | None = None,
max_concurrency: int = 8,
show_progress: bool = True,
) -> pd.DataFrame
Fill missing values in a DataFrame column using AI-powered inference (asynchronously).
This method uses machine learning to intelligently fill missing (NaN) values in a specified column by analyzing patterns from non-missing rows in the DataFrame. It creates a prepared task that provides examples of similar rows to help the AI model predict appropriate values for the missing entries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_column_name
|
str
|
The name of the column containing missing values that need to be filled. |
required |
max_examples
|
int
|
The maximum number of example rows to use for context when predicting missing values. Higher values may improve accuracy but increase API costs and processing time. Defaults to 500. |
500
|
batch_size
|
int | None
|
Number of requests sent in one batch
to optimize API usage. Defaults to |
None
|
max_concurrency
|
int
|
Maximum number of concurrent requests. Defaults to 8. |
8
|
show_progress
|
bool
|
Show progress bar in Jupyter notebooks. Defaults to |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: A new DataFrame with missing values filled in the target column. The original DataFrame is not modified. |
Example
df = pd.DataFrame({
'name': ['Alice', 'Bob', None, 'David'],
'age': [25, 30, 35, None],
'city': ['Tokyo', 'Osaka', 'Kyoto', 'Tokyo']
})
# Fill missing values in the 'name' column (must be awaited)
filled_df = await df.aio.fillna('name')
# With progress bar for large datasets
large_df = pd.DataFrame({'name': [None] * 1000, 'age': list(range(1000))})
filled_df = await large_df.aio.fillna(
'name',
batch_size=32,
max_concurrency=4,
show_progress=True
)
Note
This is an asynchronous method and must be awaited. If the target column has no missing values, the original DataFrame is returned unchanged.
Source code in src/openaivec/pandas_ext/_dataframe_async.py
463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 | |