AI-Powered Data Processing for Pandas & Spark¶

Welcome to openaivec - Transform your data analysis with OpenAI's language models! This library enables seamless integration of AI text processing, sentiment analysis, NLP tasks, and embeddings into your Pandas DataFrames and Apache Spark workflows for scalable data insights.

🚀 Quick Start Example¶

Transform your data with AI in just one line:

import pandas as pd
from openaivec import pandas_ext

# AI-powered data processing
fruits = pd.Series(["apple", "banana", "orange", "grape", "kiwi"])
fruits.ai.responses("Translate this fruit name into French.")
# Result: ['pomme', 'banane', 'orange', 'raisin', 'kiwi']

Perfect for data scientists, analysts, and ML engineers who want to leverage AI for text processing at scale.

📦 Installation¶

pipuvWith Spark Support

pip install openaivec

uv add openaivec

pip install "openaivec[spark]"
# or
uv add "openaivec[spark]"

🎯 Key Features¶

🚀 Vectorized Processing: Handle thousands of records in minutes, not hours
⚡ Asynchronous Interface: .aio accessor with batch_size and max_concurrency control
💰 Cost Efficient: Automatic deduplication significantly reduces API costs
🔗 Seamless Integration: Works within existing pandas/Spark workflows
📈 Enterprise Scale: From 100s to millions of records
🤖 Advanced NLP: Pre-built tasks for sentiment analysis, translation, NER, and more

Links¶

📚 Examples & Tutorials¶

Get started with these comprehensive examples:

📓 Getting Started - Basic pandas integration and usage
📓 Customer Feedback Analysis - Sentiment analysis & prioritization
📓 Survey Data Transformation - Unstructured to structured data
📓 Spark Processing - Enterprise-scale distributed processing
📓 Async Workflows - High-performance async processing
📓 Prompt Engineering - Advanced prompting techniques
📓 FAQ Generation - Auto-generate FAQs from documents

📖 API Reference¶

Detailed documentation for all components:

🔗 Main Package - Core classes (BatchResponses, BatchEmbeddings, FewShotPromptBuilder)
🔗 pandas_ext - Pandas Series and DataFrame extensions
🔗 spark - Apache Spark UDF builders
🔗 task - Pre-built task modules for NLP and customer support

Quick Start¶

Here is a simple example of how to use openaivec with pandas:

import pandas as pd
from openai import OpenAI
from openaivec import pandas_ext

from typing import List

# Set OpenAI/Azure client (optional; auto-detected from environment variables)
pandas_ext.set_client(OpenAI())

# Set models for responses and embeddings (optional; defaults shown)
pandas_ext.set_responses_model("gpt-4.1-mini")
pandas_ext.set_embeddings_model("text-embedding-3-small")


fruits: List[str] = ["apple", "banana", "orange", "grape", "kiwi", "mango", "peach", "pear", "pineapple", "strawberry"]
fruits_df = pd.DataFrame({"name": fruits})

fruits_df is a pandas DataFrame with a single column name containing the names of fruits. We can mutate the field name with the accessor ai to add a new column color with the color of each fruit:

fruits_df.assign(
    color=lambda df: df["name"].ai.responses("What is the color of this fruit?")
)

The result is a new DataFrame with the same number of rows as fruits_df, but with an additional column color containing the color of each fruit. The ai accessor uses the OpenAI API to generate the responses for each fruit name in the name column.

name	color
apple	red
banana	yellow
orange	orange
grape	purple
kiwi	brown
mango	orange
peach	orange
pear	green
pineapple	brown
strawberry	red

Structured Output is also supported. For example, we will translate the name of each fruit into multiple languages. We can use the ai accessor to generate a new column translation with the translation of each fruit name into English, French, Japanese, Spanish, German, Italian, Portuguese and Russian:

from pydantic import BaseModel

class Translation(BaseModel):
    en: str  # English
    fr: str  # French
    ja: str  # Japanese
    es: str  # Spanish
    de: str  # German
    it: str  # Italian
    pt: str  # Portuguese
    ru: str  # Russian

fruits_df.assign(
    translation=lambda df: df["name"].ai.responses(
        instructions="Translate this fruit name into English, French, Japanese, Spanish, German, Italian, Portuguese and Russian.",
        response_format=Translation,
    )
)

name	translation
apple	en='Apple' fr='Pomme' ja='リンゴ' es='Manzana' de...
banana	en='Banana' fr='Banane' ja='バナナ' es='Banana' de...
orange	en='Orange' fr='Orange' ja='オレンジ' es='Naranja' de...
grape	en='Grape' fr='Raisin' ja='ブドウ' es='Uva' de='T...
kiwi	en='Kiwi' fr='Kiwi' ja='キウイ' es='Kiwi' de='Kiw...
mango	en='Mango' fr='Mangue' ja='マンゴー' es='Mango' de...
peach	en='Peach' fr='Pêche' ja='モモ' es='Durazno' de...
pear	en='Pear' fr='Poire' ja='梨' es='Pera' de='Birn...
pineapple	en='Pineapple' fr='Ananas' ja='パイナップル' es='Piñ...
strawberry	en='Strawberry' fr='Fraise' ja='イチゴ' es='Fresa...

Structured output can be extracted into separate columns using the extract method. For example, we can extract the translations into separate columns for each language:

fruits_df.assign(
    translation=lambda df: df["name"].ai.responses(
        instructions="Translate this fruit name into English, French, Japanese, Spanish, German, Italian, Portuguese and Russian.",
        response_format=Translation,
    )
).ai.extract("translation")

name	translation_en	translation_fr	translation_ja	translation_es	translation_de	translation_it	translation_pt	translation_ru
apple	Apple	Pomme	リンゴ	Manzana	Apfel	Mela	Maçã	Яблоко
banana	Banana	Banane	バナナ	Banana	Banane	Banana	Banana	Банан
orange	Orange	Orange	オレンジ	Naranja	Orange	Arancia	Laranja	Апельсин
grape	Grape	Raisin	ブドウ	Uva	Traube	Uva	Uva	Виноград
kiwi	Kiwi	Kiwi	キウイ	Kiwi	Kiwi	Kiwi	Kiwi	Киви
mango	Mango	Mangue	マンゴー	Mango	Mango	Mango	Manga	Манго
peach	Peach	Pêche	モモ	Durazno	Pfirsich	Pesca	Pêssego	Персик
pear	Pear	Poire	梨	Pera	Birne	Pera	Pêra	Груша
pineapple	Pineapple	Ananas	パイナップル	Piña	Ananas	Ananas	Abacaxi	Ананас
strawberry	Strawberry	Fraise	イチゴ	Fresa	Erdbeere	Fragola	Morango	Клубника

Asynchronous Processing for High Performance¶

For processing large datasets efficiently, openaivec provides the .aio accessor that enables asynchronous, concurrent processing:

import asyncio
import pandas as pd
from openaivec import pandas_ext

# Large dataset processing
df = pd.DataFrame({
    "customer_feedback": [
        "Love the new features!",
        "App crashes frequently",
        "Great customer support",
        # ... thousands more rows
    ]
})

async def analyze_feedback():
    # Process with optimized parameters
    sentiments = await df["customer_feedback"].aio.responses(
        "Classify sentiment as positive, negative, or neutral",
        batch_size=64,         # Group 64 requests per API call
        max_concurrency=16     # Allow 16 concurrent requests
    )

    # Also works with embeddings
    embeddings = await df["customer_feedback"].aio.embeddings(
        batch_size=128,        # Larger batches for embeddings
        max_concurrency=8      # Conservative concurrency for embeddings
    )

    return sentiments, embeddings

# Execute async processing
results = asyncio.run(analyze_feedback())

Performance Tuning Parameters¶

batch_size (default: adaptive auto-tuning): - Leave unset (None) to let BatchingMapProxy pick an efficient size (targets 30–60 seconds per batch) - Set a positive integer for deterministic batch sizes when coordinating with rate limits - Use 0 or a negative value only when everything fits in a single request - Typical ranges: 32–128 for responses, 64–256 for embeddings when you need fixed sizes

max_concurrency (default: 8): - Limits the number of simultaneous API requests - Higher values: Faster processing but may hit rate limits - Lower values: More conservative, better for shared API quotas - Recommended: 4-16 depending on your OpenAI tier and usage patterns

When to Use Async vs Sync¶

Use .aio for: Large datasets (1000+ rows), time-sensitive processing, concurrent workflows
Use .ai for: Small datasets, interactive analysis, simple one-off operations