AI-Powered Data Processing for Pandas & Spark¶
Welcome to openaivec - Transform your data analysis with OpenAI's language models! This library enables seamless integration of AI text processing, sentiment analysis, NLP tasks, and embeddings into your Pandas DataFrames and Apache Spark workflows for scalable data insights.
🚀 Quick Start Example¶
Transform your data with AI in just one line:
import pandas as pd
from openaivec import pandas_ext
# AI-powered data processing
fruits = pd.Series(["apple", "banana", "orange", "grape", "kiwi"])
fruits.ai.responses("Translate this fruit name into French.")
# Result: ['pomme', 'banane', 'orange', 'raisin', 'kiwi']
Perfect for data scientists, analysts, and ML engineers who want to leverage AI for text processing at scale.
📦 Installation¶
🎯 Key Features¶
- 🚀 Vectorized Processing: Handle thousands of records in minutes, not hours
- ⚡ Asynchronous Interface:
.aio
accessor withbatch_size
andmax_concurrency
control - 💰 Cost Efficient: Automatic deduplication significantly reduces API costs
- 🔗 Seamless Integration: Works within existing pandas/Spark workflows
- 📈 Enterprise Scale: From 100s to millions of records
- 🤖 Advanced NLP: Pre-built tasks for sentiment analysis, translation, NER, and more
Links¶
📚 Examples & Tutorials¶
Get started with these comprehensive examples:
📓 Getting Started - Basic pandas integration and usage
📓 Customer Feedback Analysis - Sentiment analysis & prioritization
📓 Survey Data Transformation - Unstructured to structured data
📓 Spark Processing - Enterprise-scale distributed processing
📓 Async Workflows - High-performance async processing
📓 Prompt Engineering - Advanced prompting techniques
📓 FAQ Generation - Auto-generate FAQs from documents
📖 API Reference¶
Detailed documentation for all components:
🔗 Main Package - Core classes (BatchResponses, BatchEmbeddings, FewShotPromptBuilder)
🔗 pandas_ext - Pandas Series and DataFrame extensions
🔗 spark - Apache Spark UDF builders
🔗 task - Pre-built task modules for NLP and customer support
Quick Start¶
Here is a simple example of how to use openaivec
with pandas
:
import pandas as pd
from openai import OpenAI
from openaivec import pandas_ext
from typing import List
# Set OpenAI Client (optional: this is default client if environment "OPENAI_API_KEY" is set)
pandas_ext.use(OpenAI())
# Set models for responses and embeddings(optional: these are default models)
pandas_ext.responses_model("gpt-4.1-nano")
pandas_ext.embeddings_model("text-embedding-3-small")
fruits: List[str] = ["apple", "banana", "orange", "grape", "kiwi", "mango", "peach", "pear", "pineapple", "strawberry"]
fruits_df = pd.DataFrame({"name": fruits})
frults_df
is a pandas
DataFrame with a single column name
containing the names of fruits. We can mutate the Field name
with the accessor ai
to add a new column color
with the color of each fruit.:
The result is a new DataFrame with the same number of rows as fruits_df
, but with an additional column color
containing the color of each fruit. The ai
accessor uses the OpenAI API to generate the responses for each fruit name in the name
column.
name | color |
---|---|
apple | red |
banana | yellow |
orange | orange |
grape | purple |
kiwi | brown |
mango | orange |
peach | orange |
pear | green |
pineapple | brown |
strawberry | red |
Structured Output is also supported. For example, we will translate the name of each fruit into multiple languages. We can use the ai
accessor to generate a new column translation
with the translation of each fruit name into English, French, Japanese, Spanish, German, Italian, Portuguese and Russian:
from pydantic import BaseModel
class Translation(BaseModel):
en: str # English
fr: str # French
ja: str # Japanese
es: str # Spanish
de: str # German
it: str # Italian
pt: str # Portuguese
ru: str # Russian
fruits_df.assign(
translation=lambda df: df["name"].ai.responses(
instructions="Translate this fruit name into English, French, Japanese, Spanish, German, Italian, Portuguese and Russian.",
response_format=Translation,
)
)
name | translation |
---|---|
apple | en='Apple' fr='Pomme' ja='リンゴ' es='Manzana' de... |
banana | en='Banana' fr='Banane' ja='バナナ' es='Banana' de... |
orange | en='Orange' fr='Orange' ja='オレンジ' es='Naranja' de... |
grape | en='Grape' fr='Raisin' ja='ブドウ' es='Uva' de='T... |
kiwi | en='Kiwi' fr='Kiwi' ja='キウイ' es='Kiwi' de='Kiw... |
mango | en='Mango' fr='Mangue' ja='マンゴー' es='Mango' de... |
peach | en='Peach' fr='Pêche' ja='モモ' es='Durazno' de... |
pear | en='Pear' fr='Poire' ja='梨' es='Pera' de='Birn... |
pineapple | en='Pineapple' fr='Ananas' ja='パイナップル' es='Piñ... |
strawberry | en='Strawberry' fr='Fraise' ja='イチゴ' es='Fresa... |
Structured output can be extracted into separate columns using the extract
method. For example, we can extract the translations into separate columns for each language:
fruits_df.assign(
translation=lambda df: df["name"].ai.responses(
instructions="Translate this fruit name into English, French, Japanese, Spanish, German, Italian, Portuguese and Russian.",
response_format=Translation,
)
).ai.extract("translation")
name | translation_en | translation_fr | translation_ja | translation_es | translation_de | translation_it | translation_pt | translation_ru |
---|---|---|---|---|---|---|---|---|
apple | Apple | Pomme | リンゴ | Manzana | Apfel | Mela | Maçã | Яблоко |
banana | Banana | Banane | バナナ | Banana | Banane | Banana | Banana | Банан |
orange | Orange | Orange | オレンジ | Naranja | Orange | Arancia | Laranja | Апельсин |
grape | Grape | Raisin | ブドウ | Uva | Traube | Uva | Uva | Виноград |
kiwi | Kiwi | Kiwi | キウイ | Kiwi | Kiwi | Kiwi | Kiwi | Киви |
mango | Mango | Mangue | マンゴー | Mango | Mango | Mango | Manga | Манго |
peach | Peach | Pêche | モモ | Durazno | Pfirsich | Pesca | Pêssego | Персик |
pear | Pear | Poire | 梨 | Pera | Birne | Pera | Pêra | Груша |
pineapple | Pineapple | Ananas | パイナップル | Piña | Ananas | Ananas | Abacaxi | Ананас |
strawberry | Strawberry | Fraise | イチゴ | Fresa | Erdbeere | Fragola | Morango | Клубника |
Asynchronous Processing for High Performance¶
For processing large datasets efficiently, openaivec provides the .aio
accessor that enables asynchronous, concurrent processing:
import asyncio
import pandas as pd
from openaivec import pandas_ext
# Large dataset processing
df = pd.DataFrame({
"customer_feedback": [
"Love the new features!",
"App crashes frequently",
"Great customer support",
# ... thousands more rows
]
})
async def analyze_feedback():
# Process with optimized parameters
sentiments = await df["customer_feedback"].aio.responses(
"Classify sentiment as positive, negative, or neutral",
batch_size=64, # Group 64 requests per API call
max_concurrency=16 # Allow 16 concurrent requests
)
# Also works with embeddings
embeddings = await df["customer_feedback"].aio.embeddings(
batch_size=128, # Larger batches for embeddings
max_concurrency=8 # Conservative concurrency for embeddings
)
return sentiments, embeddings
# Execute async processing
results = asyncio.run(analyze_feedback())
Performance Tuning Parameters¶
batch_size
(default: 128 for responses, 128 for embeddings):
- Controls how many inputs are processed in a single API request
- Larger values: Fewer API calls, reduced overhead, but higher memory usage
- Smaller values: More granular processing, better for rate-limited scenarios
- Recommended: 32-128 for responses, 64-256 for embeddings
max_concurrency
(default: 8):
- Limits the number of simultaneous API requests
- Higher values: Faster processing but may hit rate limits
- Lower values: More conservative, better for shared API quotas
- Recommended: 4-16 depending on your OpenAI tier and usage patterns
When to Use Async vs Sync¶
- Use
.aio
for: Large datasets (1000+ rows), time-sensitive processing, concurrent workflows - Use
.ai
for: Small datasets, interactive analysis, simple one-off operations