Spark Extension¶

openaivec.spark_ext ¶

Asynchronous Spark UDFs for the OpenAI and Azure OpenAI APIs.

This module provides functions (responses_udf, task_udf, embeddings_udf, count_tokens_udf, split_to_chunks_udf, similarity_udf, parse_udf) for creating asynchronous Spark UDFs that communicate with either the public OpenAI API or Azure OpenAI using the openaivec.spark_ext subpackage. It supports UDFs for generating responses, creating embeddings, parsing text, and computing similarities asynchronously. The UDFs operate on Spark DataFrames and leverage asyncio for improved performance in I/O-bound operations.

Performance Optimization: All AI-powered UDFs (responses_udf, task_udf, embeddings_udf, parse_udf) automatically cache duplicate inputs within each partition, significantly reducing API calls and costs when processing datasets with overlapping content.

Setup¶

First, obtain a Spark session and configure authentication:

from pyspark.sql import SparkSession
from openaivec.spark_ext import setup, setup_azure

spark = SparkSession.builder.getOrCreate()

# Option 1: Using OpenAI
setup(
    spark,
    api_key="your-openai-api-key",
    responses_model_name="gpt-4.1-mini",  # Optional: set default model
    embeddings_model_name="text-embedding-3-small"  # Optional: set default model
)

# Option 2: Using Azure OpenAI
# setup_azure(
#     spark,
#     api_key="your-azure-openai-api-key",
#     base_url="https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/",
#     responses_model_name="my-gpt4-deployment",  # Optional: set default deployment
#     embeddings_model_name="my-embedding-deployment"  # Optional: set default deployment
# )

# Option 3: Using Azure OpenAI with Entra ID (no API key)
# Set AZURE_OPENAI_BASE_URL in your environment.
# openaivec automatically uses DefaultAzureCredential when AZURE_OPENAI_API_KEY is not set.

Next, create UDFs and register them:

from openaivec.spark_ext import responses_udf, task_udf, embeddings_udf, count_tokens_udf, split_to_chunks_udf
from pydantic import BaseModel

# Define a Pydantic model for structured responses (optional)
class Translation(BaseModel):
    en: str
    fr: str
    # ... other languages

# Register the asynchronous responses UDF with performance tuning
spark.udf.register(
    "translate_async",
    responses_udf(
        instructions="Translate the text to multiple languages.",
        response_format=Translation,
        model_name="gpt-4.1-mini",  # For Azure: deployment name, for OpenAI: model name
        batch_size=64,              # Rows per API request within partition
        max_concurrency=8           # Concurrent requests PER EXECUTOR
    ),
)

# Or use a predefined task with task_udf
from openaivec.task import nlp
spark.udf.register(
    "sentiment_async",
    task_udf(nlp.sentiment_analysis()),
)

# Register the asynchronous embeddings UDF with performance tuning
spark.udf.register(
    "embed_async",
    embeddings_udf(
        model_name="text-embedding-3-small",  # For Azure: deployment name, for OpenAI: model name
        batch_size=128,                       # Larger batches for embeddings
        max_concurrency=8                     # Concurrent requests PER EXECUTOR
    ),
)

# Register token counting, text chunking, and similarity UDFs
spark.udf.register("count_tokens", count_tokens_udf())
spark.udf.register("split_chunks", split_to_chunks_udf(max_tokens=512, sep=[".", "!", "?"]))
spark.udf.register("compute_similarity", similarity_udf())

You can now invoke the UDFs from Spark SQL:

SELECT
    text,
    translate_async(text) AS translation,
    sentiment_async(text) AS sentiment,
    embed_async(text) AS embedding,
    count_tokens(text) AS token_count,
    split_chunks(text) AS chunks,
    compute_similarity(embed_async(text1), embed_async(text2)) AS similarity
FROM your_table;

Performance Considerations¶

When using these UDFs in distributed Spark environments:

batch_size: Controls rows processed per API request within each partition. Recommended: 32-128 for responses, 64-256 for embeddings.
max_concurrency: Sets concurrent API requests PER EXECUTOR, not per cluster. Total cluster concurrency = max_concurrency × number_of_executors. Recommended: 4-12 per executor to avoid overwhelming OpenAI rate limits.
Rate Limit Management: Monitor OpenAI API usage when scaling executors. Consider your OpenAI tier limits and adjust max_concurrency accordingly.

Example for a 5-executor cluster with max_concurrency=8: Total concurrent requests = 8 × 5 = 40 simultaneous API calls.

Note: AI-powered UDFs run one reusable asyncio event loop per invocation and use partition-local caches to avoid duplicate remote calls inside a partition.

Classes¶

Functions¶

setup ¶

setup(
    spark: SparkSession,
    api_key: str,
    responses_model_name: str | None = None,
    embeddings_model_name: str | None = None,
)

Setup OpenAI authentication and default model names in Spark environment. 1. Configures OpenAI API key in SparkContext environment. 2. Configures OpenAI API key in local process environment. 3. Optionally registers default model names for responses and embeddings in the DI container.

Parameters:

Name	Type	Description	Default
`spark`	`SparkSession`	The Spark session to configure.	required
`api_key`	`str`	OpenAI API key for authentication.	required
`responses_model_name`	`str \| None`	Default model name for response generation. If provided, registers `ResponsesModelName` in the DI container.	`None`
`embeddings_model_name`	`str \| None`	Default model name for embeddings. If provided, registers `EmbeddingsModelName` in the DI container.	`None`

Example

from pyspark.sql import SparkSession
from openaivec.spark_ext import setup

spark = SparkSession.builder.getOrCreate()
setup(
    spark,
    api_key="sk-***",
    responses_model_name="gpt-4.1-mini",
    embeddings_model_name="text-embedding-3-small",
)

Source code in src/openaivec/spark_ext.py

def setup(
    spark: SparkSession, api_key: str, responses_model_name: str | None = None, embeddings_model_name: str | None = None
):
    """Setup OpenAI authentication and default model names in Spark environment.
    1. Configures OpenAI API key in SparkContext environment.
    2. Configures OpenAI API key in local process environment.
    3. Optionally registers default model names for responses and embeddings in the DI container.

    Args:
        spark (SparkSession): The Spark session to configure.
        api_key (str): OpenAI API key for authentication.
        responses_model_name (str | None): Default model name for response generation.
            If provided, registers `ResponsesModelName` in the DI container.
        embeddings_model_name (str | None): Default model name for embeddings.
            If provided, registers `EmbeddingsModelName` in the DI container.

    Example:
        ```python
        from pyspark.sql import SparkSession
        from openaivec.spark_ext import setup

        spark = SparkSession.builder.getOrCreate()
        setup(
            spark,
            api_key="sk-***",
            responses_model_name="gpt-4.1-mini",
            embeddings_model_name="text-embedding-3-small",
        )
        ```
    """

    CONTAINER.register(SparkSession, lambda: spark)
    CONTAINER.register(SparkContext, lambda: CONTAINER.resolve(SparkSession).sparkContext)

    sc = CONTAINER.resolve(SparkContext)
    sc.environment["OPENAI_API_KEY"] = api_key

    os.environ["OPENAI_API_KEY"] = api_key

    if responses_model_name:
        CONTAINER.register(ResponsesModelName, lambda: ResponsesModelName(responses_model_name))

    if embeddings_model_name:
        CONTAINER.register(EmbeddingsModelName, lambda: EmbeddingsModelName(embeddings_model_name))

    CONTAINER.clear_singletons()

setup_azure ¶

setup_azure(
    spark: SparkSession,
    api_key: str | None = None,
    base_url: str | None = None,
    responses_model_name: str | None = None,
    embeddings_model_name: str | None = None,
)

Setup Azure OpenAI authentication and default model names in Spark environment. 1. Configures Azure OpenAI base URL in SparkContext environment. 2. Optionally configures Azure OpenAI API key in SparkContext environment. 3. Configures Azure OpenAI base URL in local process environment. 4. Optionally configures Azure OpenAI API key in local process environment. 5. Optionally registers default model names for responses and embeddings in the DI container.

Note

For API-key authentication, provide api_key. For Entra ID authentication, omit api_key and configure only base_url.

Args: spark (SparkSession): The Spark session to configure. api_key (str | None): Azure OpenAI API key for authentication. When not provided, AZURE_OPENAI_API_KEY is cleared and Entra ID can be used. base_url (str | None): Base URL for the Azure OpenAI resource. Required. responses_model_name (str | None): Default model name for response generation. If provided, registers ResponsesModelName in the DI container. embeddings_model_name (str | None): Default model name for embeddings. If provided, registers EmbeddingsModelName in the DI container.

Example

from pyspark.sql import SparkSession
from openaivec.spark_ext import setup_azure

spark = SparkSession.builder.getOrCreate()
setup_azure(
    spark,
    api_key="azure-key",
    base_url="https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/",
    responses_model_name="gpt4-deployment",
    embeddings_model_name="embedding-deployment",
)

Raises: ValueError: If base_url is not provided.

Source code in src/openaivec/spark_ext.py

def setup_azure(
    spark: SparkSession,
    api_key: str | None = None,
    base_url: str | None = None,
    responses_model_name: str | None = None,
    embeddings_model_name: str | None = None,
):
    """Setup Azure OpenAI authentication and default model names in Spark environment.
    1. Configures Azure OpenAI base URL in SparkContext environment.
    2. Optionally configures Azure OpenAI API key in SparkContext environment.
    3. Configures Azure OpenAI base URL in local process environment.
    4. Optionally configures Azure OpenAI API key in local process environment.
    5. Optionally registers default model names for responses and embeddings in the DI container.

    Note:
        For API-key authentication, provide ``api_key``. For Entra ID authentication,
        omit ``api_key`` and configure only ``base_url``.
    Args:
        spark (SparkSession): The Spark session to configure.
        api_key (str | None): Azure OpenAI API key for authentication. When not
            provided, ``AZURE_OPENAI_API_KEY`` is cleared and Entra ID can be used.
        base_url (str | None): Base URL for the Azure OpenAI resource. Required.
        responses_model_name (str | None): Default model name for response generation.
            If provided, registers `ResponsesModelName` in the DI container.
        embeddings_model_name (str | None): Default model name for embeddings.
            If provided, registers `EmbeddingsModelName` in the DI container.

    Example:
        ```python
        from pyspark.sql import SparkSession
        from openaivec.spark_ext import setup_azure

        spark = SparkSession.builder.getOrCreate()
        setup_azure(
            spark,
            api_key="azure-key",
            base_url="https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/",
            responses_model_name="gpt4-deployment",
            embeddings_model_name="embedding-deployment",
        )
        ```
    Raises:
        ValueError: If ``base_url`` is not provided.
    """
    if base_url is None:
        raise ValueError("base_url is required")

    CONTAINER.register(SparkSession, lambda: spark)
    CONTAINER.register(SparkContext, lambda: CONTAINER.resolve(SparkSession).sparkContext)

    sc = CONTAINER.resolve(SparkContext)
    if api_key:
        sc.environment["AZURE_OPENAI_API_KEY"] = api_key
    else:
        sc.environment.pop("AZURE_OPENAI_API_KEY", None)
    sc.environment["AZURE_OPENAI_BASE_URL"] = base_url

    if api_key:
        os.environ["AZURE_OPENAI_API_KEY"] = api_key
    else:
        os.environ.pop("AZURE_OPENAI_API_KEY", None)
    os.environ["AZURE_OPENAI_BASE_URL"] = base_url

    if responses_model_name:
        CONTAINER.register(ResponsesModelName, lambda: ResponsesModelName(responses_model_name))

    if embeddings_model_name:
        CONTAINER.register(EmbeddingsModelName, lambda: EmbeddingsModelName(embeddings_model_name))

    CONTAINER.clear_singletons()

setup_entra_id ¶

setup_entra_id(
    spark: SparkSession,
    base_url: str,
    tenant_id: str,
    client_id: str,
    client_secret: str | None = None,
    kv_url: str | None = None,
    kv_secret_name: str | None = None,
    responses_model_name: str | None = None,
    embeddings_model_name: str | None = None,
)

Setup Entra ID (Service Principal) authentication for Spark environment.

Configures Azure OpenAI with Service Principal credentials using DefaultAzureCredential (via EnvironmentCredential). Propagates credentials to both the local process and Spark executors via sc.environment.

The client secret can be provided directly via client_secret, or retrieved from Key Vault via kv_url and kv_secret_name (requires notebookutils on the Fabric driver).

Note

Any existing OPENAI_API_KEY or AZURE_OPENAI_API_KEY is cleared to ensure the Entra ID path is used.

Parameters:

Name	Type	Description	Default
`spark`	`SparkSession`	The Spark session to configure.	required
`base_url`	`str`	Base URL for the Azure OpenAI resource (e.g. `"https://YOUR-RESOURCE.services.ai.azure.com/openai/v1/"`).	required
`tenant_id`	`str`	Entra ID tenant ID.	required
`client_id`	`str`	Service Principal (App Registration) client ID.	required
`client_secret`	`str \| None`	Service Principal client secret. If `None`, the secret is retrieved from Key Vault using `kv_url` and `kv_secret_name`.	`None`
`kv_url`	`str \| None`	Azure Key Vault URL for secret retrieval. Required when `client_secret` is `None`.	`None`
`kv_secret_name`	`str \| None`	Secret name in Key Vault. Required when `client_secret` is `None`.	`None`
`responses_model_name`	`str \| None`	Default model name for response generation. If provided, registers `ResponsesModelName` in the DI container.	`None`
`embeddings_model_name`	`str \| None`	Default model name for embeddings. If provided, registers `EmbeddingsModelName` in the DI container.	`None`

Raises:

Type	Description
`ValueError`	If neither `client_secret` nor both `kv_url` and `kv_secret_name` are provided.

Example

from pyspark.sql import SparkSession
from openaivec.spark_ext import setup_entra_id

spark = SparkSession.builder.getOrCreate()

# Option 1: Provide client_secret directly
setup_entra_id(
    spark,
    base_url="https://YOUR-RESOURCE.services.ai.azure.com/openai/v1/",
    tenant_id="your-tenant-id",
    client_id="your-client-id",
    client_secret="your-secret",
)

# Option 2: Retrieve from Key Vault (Fabric driver only)
setup_entra_id(
    spark,
    base_url="https://YOUR-RESOURCE.services.ai.azure.com/openai/v1/",
    tenant_id="your-tenant-id",
    client_id="your-client-id",
    kv_url="https://YOUR-KEYVAULT.vault.azure.net/",
    kv_secret_name="your-secret-name",
)

Source code in src/openaivec/spark_ext.py

def setup_entra_id(
    spark: SparkSession,
    base_url: str,
    tenant_id: str,
    client_id: str,
    client_secret: str | None = None,
    kv_url: str | None = None,
    kv_secret_name: str | None = None,
    responses_model_name: str | None = None,
    embeddings_model_name: str | None = None,
):
    """Setup Entra ID (Service Principal) authentication for Spark environment.

    Configures Azure OpenAI with Service Principal credentials using
    ``DefaultAzureCredential`` (via ``EnvironmentCredential``).
    Propagates credentials to both the local process and Spark executors
    via ``sc.environment``.

    The client secret can be provided directly via ``client_secret``, or
    retrieved from Key Vault via ``kv_url`` and ``kv_secret_name`` (requires
    ``notebookutils`` on the Fabric driver).

    Note:
        Any existing ``OPENAI_API_KEY`` or ``AZURE_OPENAI_API_KEY`` is
        cleared to ensure the Entra ID path is used.

    Args:
        spark (SparkSession): The Spark session to configure.
        base_url (str): Base URL for the Azure OpenAI resource
            (e.g. ``"https://YOUR-RESOURCE.services.ai.azure.com/openai/v1/"``).
        tenant_id (str): Entra ID tenant ID.
        client_id (str): Service Principal (App Registration) client ID.
        client_secret (str | None): Service Principal client secret.
            If ``None``, the secret is retrieved from Key Vault using
            ``kv_url`` and ``kv_secret_name``.
        kv_url (str | None): Azure Key Vault URL for secret retrieval.
            Required when ``client_secret`` is ``None``.
        kv_secret_name (str | None): Secret name in Key Vault.
            Required when ``client_secret`` is ``None``.
        responses_model_name (str | None): Default model name for response generation.
            If provided, registers ``ResponsesModelName`` in the DI container.
        embeddings_model_name (str | None): Default model name for embeddings.
            If provided, registers ``EmbeddingsModelName`` in the DI container.

    Raises:
        ValueError: If neither ``client_secret`` nor both ``kv_url`` and
            ``kv_secret_name`` are provided.

    Example:
        ```python
        from pyspark.sql import SparkSession
        from openaivec.spark_ext import setup_entra_id

        spark = SparkSession.builder.getOrCreate()

        # Option 1: Provide client_secret directly
        setup_entra_id(
            spark,
            base_url="https://YOUR-RESOURCE.services.ai.azure.com/openai/v1/",
            tenant_id="your-tenant-id",
            client_id="your-client-id",
            client_secret="your-secret",
        )

        # Option 2: Retrieve from Key Vault (Fabric driver only)
        setup_entra_id(
            spark,
            base_url="https://YOUR-RESOURCE.services.ai.azure.com/openai/v1/",
            tenant_id="your-tenant-id",
            client_id="your-client-id",
            kv_url="https://YOUR-KEYVAULT.vault.azure.net/",
            kv_secret_name="your-secret-name",
        )
        ```
    """
    if client_secret is None:
        if not kv_url or not kv_secret_name:
            raise ValueError(
                "Either client_secret or both kv_url and kv_secret_name must be provided. "
                "Use kv_url/kv_secret_name to retrieve the secret from Key Vault on a Fabric driver."
            )
        try:
            import notebookutils as nbu  # type: ignore[import-not-found]
        except ImportError:
            raise ValueError(
                "notebookutils is not available. Key Vault retrieval via kv_url/kv_secret_name "
                "is only supported on the Microsoft Fabric driver."
            )
        client_secret = nbu.credentials.getSecret(kv_url, kv_secret_name)
        if not client_secret:
            raise ValueError(
                f"Key Vault returned an empty secret for {kv_secret_name!r} at {kv_url!r}. "
                "Verify the secret exists and has a non-empty value."
            )

    CONTAINER.register(SparkSession, lambda: spark)
    CONTAINER.register(SparkContext, lambda: CONTAINER.resolve(SparkSession).sparkContext)

    sc = CONTAINER.resolve(SparkContext)

    # Clear stale API-key auth to ensure Entra ID path is used
    for key in ("OPENAI_API_KEY", "AZURE_OPENAI_API_KEY"):
        sc.environment.pop(key, None)
        os.environ.pop(key, None)

    sc.environment["AZURE_OPENAI_BASE_URL"] = base_url
    sc.environment["AZURE_TENANT_ID"] = tenant_id
    sc.environment["AZURE_CLIENT_ID"] = client_id
    sc.environment["AZURE_CLIENT_SECRET"] = client_secret

    os.environ["AZURE_OPENAI_BASE_URL"] = base_url
    os.environ["AZURE_TENANT_ID"] = tenant_id
    os.environ["AZURE_CLIENT_ID"] = client_id
    os.environ["AZURE_CLIENT_SECRET"] = client_secret

    if responses_model_name:
        CONTAINER.register(ResponsesModelName, lambda: ResponsesModelName(responses_model_name))

    if embeddings_model_name:
        CONTAINER.register(EmbeddingsModelName, lambda: EmbeddingsModelName(embeddings_model_name))

    CONTAINER.clear_singletons()

responses_udf ¶

responses_udf(
    instructions: str,
    response_format: type[ResponseFormat] = str,
    model_name: str | None = None,
    batch_size: int | None = None,
    max_concurrency: int = 8,
    multimodal: bool = False,
    **api_kwargs,
) -> UserDefinedFunction

Create an asynchronous Spark pandas UDF for generating responses.

Configures and builds UDFs that use AsyncBatchResponses on a single reusable event loop per UDF invocation. Each partition maintains its own bounded cache to eliminate duplicate API calls within the partition while avoiding repeated event-loop setup per Arrow batch.

Note

Authentication must be configured via SparkContext environment variables. Set the appropriate environment variables on the SparkContext:

For OpenAI: sc.environment["OPENAI_API_KEY"] = "your-openai-api-key"

For Azure OpenAI: API key auth: sc.environment["AZURE_OPENAI_API_KEY"] = "your-azure-openai-api-key" sc.environment["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/" Entra ID auth: sc.environment["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/" # Do not set AZURE_OPENAI_API_KEY

Parameters:

Name	Type	Description	Default
`instructions`	`str`	The system prompt or instructions for the model.	required
`response_format`	`type[ResponseFormat]`	The desired output format. Either `str` for plain text or a Pydantic `BaseModel` for structured JSON output. Defaults to `str`.	`str`
`model_name`	`str \| None`	For Azure OpenAI, use your deployment name (e.g., "my-gpt4-deployment"). For OpenAI, use the model name (e.g., "gpt-4.1-mini"). Defaults to configured model in DI container via ResponsesModelName if not provided.	`None`
`batch_size`	`int \| None`	Number of rows per async batch request within each partition. Larger values reduce API call overhead but increase memory usage. Defaults to None (automatic batch size optimization that dynamically adjusts based on execution time, targeting 30-60 seconds per batch). Set to a positive integer (e.g., 32-128) for fixed batch size.	`None`
`max_concurrency`	`int`	Maximum number of concurrent API requests PER EXECUTOR. Total cluster concurrency = max_concurrency × number_of_executors. Higher values increase throughput but may hit OpenAI rate limits. Recommended: 4-12 per executor. Defaults to 8.	`8`
`**api_kwargs`		Additional OpenAI API parameters (e.g. `temperature`, `top_p`, `frequency_penalty`, `presence_penalty`, `seed`, `max_output_tokens`, etc.) forwarded verbatim to the underlying API calls. These parameters are applied to all API requests made by the UDF.	`{}`

Returns:

Name	Type	Description
`UserDefinedFunction`	`UserDefinedFunction`	A Spark pandas UDF configured to generate responses asynchronously. Output schema is `StringType` or a struct derived from `response_format`.

Raises:

Type	Description
`ValueError`	If `response_format` is not `str` or a Pydantic `BaseModel`.

Example

from pyspark.sql import SparkSession
from openaivec.spark_ext import responses_udf, setup

spark = SparkSession.builder.getOrCreate()
setup(spark, api_key="sk-***", responses_model_name="gpt-4.1-mini")
udf = responses_udf("Reply with one word.")
spark.udf.register("short_answer", udf)
df = spark.createDataFrame([("hello",), ("bye",)], ["text"])
df.selectExpr("short_answer(text) as reply").show()

Note

For optimal performance in distributed environments: - Automatic Caching: Duplicate inputs within each partition are cached, reducing API calls and costs significantly on datasets with repeated content - Monitor OpenAI API rate limits when scaling executor count - Consider your OpenAI tier limits: total_requests = max_concurrency × executors - Use Spark UI to optimize partition sizes relative to batch_size - Multimodal: Local file paths are not accessible from executors. Use HTTP(S) URLs or pre-encoded data URIs when multimodal=True.

Source code in src/openaivec/spark_ext.py

def responses_udf(
    instructions: str,
    response_format: type[ResponseFormat] = str,
    model_name: str | None = None,
    batch_size: int | None = None,
    max_concurrency: int = 8,
    multimodal: bool = False,
    **api_kwargs,
) -> UserDefinedFunction:
    """Create an asynchronous Spark pandas UDF for generating responses.

    Configures and builds UDFs that use ``AsyncBatchResponses`` on a single
    reusable event loop per UDF invocation. Each partition maintains its own
    bounded cache to eliminate duplicate API calls within the partition while
    avoiding repeated event-loop setup per Arrow batch.

    Note:
        Authentication must be configured via SparkContext environment variables.
        Set the appropriate environment variables on the SparkContext:

        For OpenAI:
            sc.environment["OPENAI_API_KEY"] = "your-openai-api-key"

        For Azure OpenAI:
            API key auth:
                sc.environment["AZURE_OPENAI_API_KEY"] = "your-azure-openai-api-key"
                sc.environment["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
            Entra ID auth:
                sc.environment["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
                # Do not set AZURE_OPENAI_API_KEY

    Args:
        instructions (str): The system prompt or instructions for the model.
        response_format (type[ResponseFormat]): The desired output format. Either `str` for plain text
            or a Pydantic `BaseModel` for structured JSON output. Defaults to `str`.
        model_name (str | None): For Azure OpenAI, use your deployment name (e.g., "my-gpt4-deployment").
            For OpenAI, use the model name (e.g., "gpt-4.1-mini"). Defaults to configured model in DI container
            via ResponsesModelName if not provided.
        batch_size (int | None): Number of rows per async batch request within each partition.
            Larger values reduce API call overhead but increase memory usage.
            Defaults to None (automatic batch size optimization that dynamically
            adjusts based on execution time, targeting 30-60 seconds per batch).
            Set to a positive integer (e.g., 32-128) for fixed batch size.
        max_concurrency (int): Maximum number of concurrent API requests **PER EXECUTOR**.
            Total cluster concurrency = max_concurrency × number_of_executors.
            Higher values increase throughput but may hit OpenAI rate limits.
            Recommended: 4-12 per executor. Defaults to 8.
        **api_kwargs: Additional OpenAI API parameters (e.g. ``temperature``, ``top_p``,
            ``frequency_penalty``, ``presence_penalty``, ``seed``, ``max_output_tokens``, etc.)
            forwarded verbatim to the underlying API calls. These parameters are applied to
            all API requests made by the UDF.

    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to generate responses asynchronously.
            Output schema is `StringType` or a struct derived from `response_format`.

    Raises:
        ValueError: If `response_format` is not `str` or a Pydantic `BaseModel`.

    Example:
        ```python
        from pyspark.sql import SparkSession
        from openaivec.spark_ext import responses_udf, setup

        spark = SparkSession.builder.getOrCreate()
        setup(spark, api_key="sk-***", responses_model_name="gpt-4.1-mini")
        udf = responses_udf("Reply with one word.")
        spark.udf.register("short_answer", udf)
        df = spark.createDataFrame([("hello",), ("bye",)], ["text"])
        df.selectExpr("short_answer(text) as reply").show()
        ```

    Note:
        For optimal performance in distributed environments:
        - **Automatic Caching**: Duplicate inputs within each partition are cached,
          reducing API calls and costs significantly on datasets with repeated content
        - Monitor OpenAI API rate limits when scaling executor count
        - Consider your OpenAI tier limits: total_requests = max_concurrency × executors
        - Use Spark UI to optimize partition sizes relative to batch_size
        - **Multimodal**: Local file paths are not accessible from executors.
          Use HTTP(S) URLs or pre-encoded data URIs when ``multimodal=True``.
    """
    _model_name = model_name or CONTAINER.resolve(ResponsesModelName).value

    if issubclass(response_format, BaseModel):
        spark_schema = _pydantic_to_spark_schema(response_format)
        json_schema_string = serialize_base_model(response_format)

        @pandas_udf(returnType=spark_schema)  # type: ignore[call-overload]
        def structure_udf(col: Iterator[pd.Series]) -> Iterator[pd.DataFrame]:
            async_client = get_async_client()
            response_model = deserialize_base_model(json_schema_string)
            cache = AsyncBatchCache[str, response_model](
                batch_size=batch_size,
                max_concurrency=max_concurrency,
                max_cache_size=DEFAULT_MANAGED_CACHE_SIZE,
            )
            batch_client = AsyncBatchResponses(
                client=async_client,
                model_name=_model_name,
                system_message=instructions,
                response_format=response_model,
                cache=cache,
                api_kwargs=api_kwargs,
                multimodal=multimodal,
            )

            async def run_part(part: pd.Series) -> pd.DataFrame:
                predictions = await batch_client.parse(part.tolist())
                return pd.DataFrame(pd.Series(predictions, index=part.index, name=part.name).map(_safe_dump).tolist())

            async def cleanup() -> None:
                await cache.clear()

            yield from run_partition_async(col, run_part, cleanup)

        return structure_udf  # type: ignore[return-value]

    elif issubclass(response_format, str):

        @pandas_udf(returnType=StringType())  # type: ignore[call-overload]
        def string_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
            async_client = get_async_client()
            cache = AsyncBatchCache[str, str](
                batch_size=batch_size,
                max_concurrency=max_concurrency,
                max_cache_size=DEFAULT_MANAGED_CACHE_SIZE,
            )
            batch_client = AsyncBatchResponses(
                client=async_client,
                model_name=_model_name,
                system_message=instructions,
                response_format=str,
                cache=cache,
                api_kwargs=api_kwargs,
                multimodal=multimodal,
            )

            async def run_part(part: pd.Series) -> pd.Series:
                predictions = await batch_client.parse(part.tolist())
                return pd.Series(predictions, index=part.index, name=part.name).map(_safe_cast_str)

            async def cleanup() -> None:
                await cache.clear()

            yield from run_partition_async(col, run_part, cleanup)

        return string_udf  # type: ignore[return-value]

    else:
        raise ValueError(f"Unsupported response_format: {response_format}")

task_udf ¶

task_udf(
    task: PreparedTask[ResponseFormat],
    model_name: str | None = None,
    batch_size: int | None = None,
    max_concurrency: int = 8,
    multimodal: bool = False,
    **api_kwargs,
) -> UserDefinedFunction

Create an asynchronous Spark pandas UDF from a predefined task.

This function allows users to create UDFs from predefined tasks such as sentiment analysis, translation, or other common NLP operations defined in the openaivec.task module. Each partition maintains its own cache to eliminate duplicate API calls within the partition, significantly reducing API usage and costs when processing datasets with overlapping content.

Parameters:

Name	Type	Description	Default
`task`	`PreparedTask`	A predefined task configuration containing instructions and response format.	required
`model_name`	`str \| None`	For Azure OpenAI, use your deployment name (e.g., "my-gpt4-deployment"). For OpenAI, use the model name (e.g., "gpt-4.1-mini"). Defaults to configured model in DI container via ResponsesModelName if not provided.	`None`
`batch_size`	`int \| None`	Number of rows per async batch request within each partition. Larger values reduce API call overhead but increase memory usage. Defaults to None (automatic batch size optimization that dynamically adjusts based on execution time, targeting 30-60 seconds per batch). Set to a positive integer (e.g., 32-128) for fixed batch size.	`None`
`max_concurrency`	`int`	Maximum number of concurrent API requests PER EXECUTOR. Total cluster concurrency = max_concurrency × number_of_executors. Higher values increase throughput but may hit OpenAI rate limits. Recommended: 4-12 per executor. Defaults to 8.	`8`

Additional Keyword Args

Arbitrary OpenAI Responses API parameters (e.g. temperature, top_p, frequency_penalty, presence_penalty, seed, max_output_tokens, etc.) are forwarded verbatim to the underlying API calls. These parameters are applied to all API requests made by the UDF.

Returns:

Name	Type	Description
`UserDefinedFunction`	`UserDefinedFunction`	A Spark pandas UDF configured to execute the specified task asynchronously with automatic caching for duplicate inputs within each partition. Output schema is StringType for str response format or a struct derived from the task's response format for BaseModel.

Example

from openaivec.task import nlp

sentiment_udf = task_udf(nlp.sentiment_analysis())

spark.udf.register("analyze_sentiment", sentiment_udf)

Note

Automatic Caching: Duplicate inputs within each partition are cached, reducing API calls and costs significantly on datasets with repeated content.

Source code in src/openaivec/spark_ext.py

def task_udf(
    task: PreparedTask[ResponseFormat],
    model_name: str | None = None,
    batch_size: int | None = None,
    max_concurrency: int = 8,
    multimodal: bool = False,
    **api_kwargs,
) -> UserDefinedFunction:
    """Create an asynchronous Spark pandas UDF from a predefined task.

    This function allows users to create UDFs from predefined tasks such as sentiment analysis,
    translation, or other common NLP operations defined in the openaivec.task module.
    Each partition maintains its own cache to eliminate duplicate API calls within
    the partition, significantly reducing API usage and costs when processing
    datasets with overlapping content.

    Args:
        task (PreparedTask): A predefined task configuration containing instructions
            and response format.
        model_name (str | None): For Azure OpenAI, use your deployment name (e.g., "my-gpt4-deployment").
            For OpenAI, use the model name (e.g., "gpt-4.1-mini"). Defaults to configured model in DI container
            via ResponsesModelName if not provided.
        batch_size (int | None): Number of rows per async batch request within each partition.
            Larger values reduce API call overhead but increase memory usage.
            Defaults to None (automatic batch size optimization that dynamically
            adjusts based on execution time, targeting 30-60 seconds per batch).
            Set to a positive integer (e.g., 32-128) for fixed batch size.
        max_concurrency (int): Maximum number of concurrent API requests **PER EXECUTOR**.
            Total cluster concurrency = max_concurrency × number_of_executors.
            Higher values increase throughput but may hit OpenAI rate limits.
            Recommended: 4-12 per executor. Defaults to 8.

    Additional Keyword Args:
        Arbitrary OpenAI Responses API parameters (e.g. ``temperature``, ``top_p``,
        ``frequency_penalty``, ``presence_penalty``, ``seed``, ``max_output_tokens``, etc.)
        are forwarded verbatim to the underlying API calls. These parameters are applied to
        all API requests made by the UDF.

    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to execute the specified task
            asynchronously with automatic caching for duplicate inputs within each partition.
            Output schema is StringType for str response format or a struct derived from
            the task's response format for BaseModel.

    Example:
        ```python
        from openaivec.task import nlp

        sentiment_udf = task_udf(nlp.sentiment_analysis())

        spark.udf.register("analyze_sentiment", sentiment_udf)
        ```

    Note:
        **Automatic Caching**: Duplicate inputs within each partition are cached,
        reducing API calls and costs significantly on datasets with repeated content.
    """
    return responses_udf(
        instructions=task.instructions,
        response_format=task.response_format,
        model_name=model_name,
        batch_size=batch_size,
        max_concurrency=max_concurrency,
        multimodal=multimodal,
        **api_kwargs,
    )

infer_schema ¶

infer_schema(
    instructions: str,
    example_table_name: str,
    example_field_name: str,
    max_examples: int = 100,
) -> SchemaInferenceOutput

Infer the schema for a response format based on example data.

This function retrieves examples from a Spark table and infers the schema for the response format using the provided instructions. It is useful when you want to dynamically generate a schema based on existing data.

Parameters:

Name	Type	Description	Default
`instructions`	`str`	Instructions for the model to infer the schema.	required
`example_table_name`	`str \| None`	Name of the Spark table containing example data.	required
`example_field_name`	`str \| None`	Name of the field in the table to use as examples.	required
`max_examples`	`int`	Maximum number of examples to retrieve for schema inference.	`100`

Returns:

Name	Type	Description
`InferredSchema`	`SchemaInferenceOutput`	An object containing the inferred schema and response format.

Example

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark.createDataFrame([("great product",), ("bad service",)], ["text"]).createOrReplaceTempView("examples")
infer_schema(
    instructions="Classify sentiment as positive or negative.",
    example_table_name="examples",
    example_field_name="text",
    max_examples=2,
)

Source code in src/openaivec/spark_ext.py

def infer_schema(
    instructions: str,
    example_table_name: str,
    example_field_name: str,
    max_examples: int = 100,
) -> SchemaInferenceOutput:
    """Infer the schema for a response format based on example data.

    This function retrieves examples from a Spark table and infers the schema
    for the response format using the provided instructions. It is useful when
    you want to dynamically generate a schema based on existing data.

    Args:
        instructions (str): Instructions for the model to infer the schema.
        example_table_name (str | None): Name of the Spark table containing example data.
        example_field_name (str | None): Name of the field in the table to use as examples.
        max_examples (int): Maximum number of examples to retrieve for schema inference.

    Returns:
        InferredSchema: An object containing the inferred schema and response format.

    Example:
        ```python
        from pyspark.sql import SparkSession

        spark = SparkSession.builder.getOrCreate()
        spark.createDataFrame([("great product",), ("bad service",)], ["text"]).createOrReplaceTempView("examples")
        infer_schema(
            instructions="Classify sentiment as positive or negative.",
            example_table_name="examples",
            example_field_name="text",
            max_examples=2,
        )
        ```
    """

    spark = CONTAINER.resolve(SparkSession)
    examples: list[str] = (
        spark.table(example_table_name).rdd.map(lambda row: row[example_field_name]).takeSample(False, max_examples)
    )

    input = SchemaInferenceInput(
        instructions=instructions,
        examples=examples,
    )
    inferer = CONTAINER.resolve(SchemaInferer)
    return inferer.infer_schema(input)

parse_udf ¶

parse_udf(
    instructions: str,
    response_format: type[ResponseFormat] | None = None,
    example_table_name: str | None = None,
    example_field_name: str | None = None,
    max_examples: int = 100,
    model_name: str | None = None,
    batch_size: int | None = None,
    max_concurrency: int = 8,
    multimodal: bool = False,
    **api_kwargs,
) -> UserDefinedFunction

Create an asynchronous Spark pandas UDF for parsing responses. This function allows users to create UDFs that parse responses based on provided instructions and either a predefined response format or example data. It supports both structured responses using Pydantic models and plain text responses. Each partition maintains its own cache to eliminate duplicate API calls within the partition, significantly reducing API usage and costs when processing datasets with overlapping content.

Parameters:

Name	Type	Description	Default
`instructions`	`str`	The system prompt or instructions for the model.	required
`response_format`	`type[ResponseFormat] \| None`	The desired output format. Either `str` for plain text or a Pydantic `BaseModel` for structured JSON output. If not provided, the schema will be inferred from example data.	`None`
`example_table_name`	`str \| None`	Name of the Spark table containing example data. If provided, `example_field_name` must also be specified.	`None`
`example_field_name`	`str \| None`	Name of the field in the table to use as examples. If provided, `example_table_name` must also be specified.	`None`
`max_examples`	`int`	Maximum number of examples to retrieve for schema inference. Defaults to 100.	`100`
`model_name`	`str \| None`	For Azure OpenAI, use your deployment name (e.g., "my-gpt4-deployment"). For OpenAI, use the model name (e.g., "gpt-4.1-mini"). Defaults to configured model in DI container via ResponsesModelName if not provided.	`None`
`batch_size`	`int \| None`	Number of rows per async batch request within each partition. Larger values reduce API call overhead but increase memory usage. Defaults to None (automatic batch size optimization that dynamically adjusts based on execution time, targeting 30-60 seconds per batch). Set to a positive integer (e.g., 32-128) for fixed batch size	`None`
`max_concurrency`	`int`	Maximum number of concurrent API requests PER EXECUTOR. Total cluster concurrency = max_concurrency × number_of_executors. Higher values increase throughput but may hit OpenAI rate limits. Recommended: 4-12 per executor. Defaults to 8.	`8`
`**api_kwargs`		Additional OpenAI API parameters (e.g. `temperature`, `top_p`, `frequency_penalty`, `presence_penalty`, `seed`, `max_output_tokens`, etc.) forwarded verbatim to the underlying API calls. These parameters are applied to all API requests made by the UDF and override any parameters set in the response_format or example data.	`{}`

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark.createDataFrame(
    [("Order #123 delivered",), ("Order #456 delayed",)],
    ["body"],
).createOrReplaceTempView("messages")
udf = parse_udf(
    instructions="Extract order id as `order_id` and status as `status`.",
    example_table_name="messages",
    example_field_name="body",
)
spark.udf.register("parse_ticket", udf)
spark.sql("SELECT parse_ticket(body) AS parsed FROM messages").show()

Returns: UserDefinedFunction: A Spark pandas UDF configured to parse responses asynchronously. Output schema is StringType for str response format or a struct derived from the response_format for BaseModel. Raises: ValueError: If neither response_format nor example_table_name and example_field_name are provided.

Source code in src/openaivec/spark_ext.py

def parse_udf(
    instructions: str,
    response_format: type[ResponseFormat] | None = None,
    example_table_name: str | None = None,
    example_field_name: str | None = None,
    max_examples: int = 100,
    model_name: str | None = None,
    batch_size: int | None = None,
    max_concurrency: int = 8,
    multimodal: bool = False,
    **api_kwargs,
) -> UserDefinedFunction:
    """Create an asynchronous Spark pandas UDF for parsing responses.
    This function allows users to create UDFs that parse responses based on
    provided instructions and either a predefined response format or example data.
    It supports both structured responses using Pydantic models and plain text responses.
    Each partition maintains its own cache to eliminate duplicate API calls within
    the partition, significantly reducing API usage and costs when processing
    datasets with overlapping content.

    Args:
        instructions (str): The system prompt or instructions for the model.
        response_format (type[ResponseFormat] | None): The desired output format.
            Either `str` for plain text or a Pydantic `BaseModel` for structured JSON output.
            If not provided, the schema will be inferred from example data.
        example_table_name (str | None): Name of the Spark table containing example data.
            If provided, `example_field_name` must also be specified.
        example_field_name (str | None): Name of the field in the table to use as examples.
            If provided, `example_table_name` must also be specified.
        max_examples (int): Maximum number of examples to retrieve for schema inference.
            Defaults to 100.
        model_name (str | None): For Azure OpenAI, use your deployment name (e.g., "my-gpt4-deployment").
            For OpenAI, use the model name (e.g., "gpt-4.1-mini"). Defaults to configured model in DI container
            via ResponsesModelName if not provided.
        batch_size (int | None): Number of rows per async batch request within each partition.
            Larger values reduce API call overhead but increase memory usage.
            Defaults to None (automatic batch size optimization that dynamically
            adjusts based on execution time, targeting 30-60 seconds per batch).
            Set to a positive integer (e.g., 32-128) for fixed batch size
        max_concurrency (int): Maximum number of concurrent API requests **PER EXECUTOR**.
            Total cluster concurrency = max_concurrency × number_of_executors.
            Higher values increase throughput but may hit OpenAI rate limits.
            Recommended: 4-12 per executor. Defaults to 8.
        **api_kwargs: Additional OpenAI API parameters (e.g. ``temperature``, ``top_p``,
            ``frequency_penalty``, ``presence_penalty``, ``seed``, ``max_output_tokens``, etc.)
            forwarded verbatim to the underlying API calls. These parameters are applied to
            all API requests made by the UDF and override any parameters set in the
            response_format or example data.
    Example:
        ```python
        from pyspark.sql import SparkSession

        spark = SparkSession.builder.getOrCreate()
        spark.createDataFrame(
            [("Order #123 delivered",), ("Order #456 delayed",)],
            ["body"],
        ).createOrReplaceTempView("messages")
        udf = parse_udf(
            instructions="Extract order id as `order_id` and status as `status`.",
            example_table_name="messages",
            example_field_name="body",
        )
        spark.udf.register("parse_ticket", udf)
        spark.sql("SELECT parse_ticket(body) AS parsed FROM messages").show()
        ```
    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to parse responses asynchronously.
            Output schema is `StringType` for str response format or a struct derived from
            the response_format for BaseModel.
    Raises:
        ValueError: If neither `response_format` nor `example_table_name` and `example_field_name` are provided.
    """

    if not response_format and not (example_field_name and example_table_name):
        raise ValueError("Either response_format or example_table_name and example_field_name must be provided.")

    if response_format is None:
        # Guard above guarantees both are present when inferring.
        if example_table_name is None or example_field_name is None:
            raise ValueError("Either response_format or example_table_name and example_field_name must be provided.")
        schema = infer_schema(
            instructions=instructions,
            example_table_name=example_table_name,
            example_field_name=example_field_name,
            max_examples=max_examples,
        )
        resolved_instructions = schema.inference_prompt
        resolved_response_format = cast(type[ResponseFormat], schema.model)
    else:
        resolved_instructions = instructions
        resolved_response_format = response_format

    return responses_udf(
        instructions=resolved_instructions,
        response_format=resolved_response_format,
        model_name=model_name,
        batch_size=batch_size,
        max_concurrency=max_concurrency,
        multimodal=multimodal,
        **api_kwargs,
    )

embeddings_udf ¶

embeddings_udf(
    model_name: str | None = None,
    batch_size: int | None = None,
    max_concurrency: int = 8,
    **api_kwargs,
) -> UserDefinedFunction

Create an asynchronous Spark pandas UDF for generating embeddings.

Configures and builds UDFs that use AsyncBatchEmbeddings on a single reusable event loop per UDF invocation. Each partition maintains its own bounded cache to eliminate duplicate API calls within the partition while avoiding repeated event-loop setup per Arrow batch.

Note

Authentication must be configured via SparkContext environment variables. Set the appropriate environment variables on the SparkContext:

For OpenAI: sc.environment["OPENAI_API_KEY"] = "your-openai-api-key"

For Azure OpenAI: API key auth: sc.environment["AZURE_OPENAI_API_KEY"] = "your-azure-openai-api-key" sc.environment["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/" Entra ID auth: sc.environment["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/" # Do not set AZURE_OPENAI_API_KEY

Parameters:

Name	Type	Description	Default
`model_name`	`str \| None`	For Azure OpenAI, use your deployment name (e.g., "my-embedding-deployment"). For OpenAI, use the model name (e.g., "text-embedding-3-small"). Defaults to configured model in DI container via EmbeddingsModelName if not provided.	`None`
`batch_size`	`int \| None`	Number of rows per async batch request within each partition. Larger values reduce API call overhead but increase memory usage. Defaults to None (automatic batch size optimization that dynamically adjusts based on execution time, targeting 30-60 seconds per batch). Set to a positive integer (e.g., 64-256) for fixed batch size. Embeddings typically handle larger batches efficiently.	`None`
`max_concurrency`	`int`	Maximum number of concurrent API requests PER EXECUTOR. Total cluster concurrency = max_concurrency × number_of_executors. Higher values increase throughput but may hit OpenAI rate limits. Recommended: 4-12 per executor. Defaults to 8.	`8`
`**api_kwargs`		Additional OpenAI API parameters (e.g., dimensions for text-embedding-3 models).	`{}`

Returns:

Name	Type	Description
`UserDefinedFunction`	`UserDefinedFunction`	A Spark pandas UDF configured to generate embeddings asynchronously with automatic caching for duplicate inputs within each partition, returning an `ArrayType(FloatType())` column.

Note

For optimal performance in distributed environments: - Automatic Caching: Duplicate inputs within each partition are cached, reducing API calls and costs significantly on datasets with repeated content - Monitor OpenAI API rate limits when scaling executor count - Consider your OpenAI tier limits: total_requests = max_concurrency × executors - Embeddings API typically has higher throughput than chat completions - Use larger batch_size for embeddings compared to response generation

Source code in src/openaivec/spark_ext.py

def embeddings_udf(
    model_name: str | None = None,
    batch_size: int | None = None,
    max_concurrency: int = 8,
    **api_kwargs,
) -> UserDefinedFunction:
    """Create an asynchronous Spark pandas UDF for generating embeddings.

    Configures and builds UDFs that use ``AsyncBatchEmbeddings`` on a single
    reusable event loop per UDF invocation. Each partition maintains its own
    bounded cache to eliminate duplicate API calls within the partition while
    avoiding repeated event-loop setup per Arrow batch.

    Note:
        Authentication must be configured via SparkContext environment variables.
        Set the appropriate environment variables on the SparkContext:

        For OpenAI:
            sc.environment["OPENAI_API_KEY"] = "your-openai-api-key"

        For Azure OpenAI:
            API key auth:
                sc.environment["AZURE_OPENAI_API_KEY"] = "your-azure-openai-api-key"
                sc.environment["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
            Entra ID auth:
                sc.environment["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
                # Do not set AZURE_OPENAI_API_KEY

    Args:
        model_name (str | None): For Azure OpenAI, use your deployment name (e.g., "my-embedding-deployment").
            For OpenAI, use the model name (e.g., "text-embedding-3-small").
            Defaults to configured model in DI container via EmbeddingsModelName if not provided.
        batch_size (int | None): Number of rows per async batch request within each partition.
            Larger values reduce API call overhead but increase memory usage.
            Defaults to None (automatic batch size optimization that dynamically
            adjusts based on execution time, targeting 30-60 seconds per batch).
            Set to a positive integer (e.g., 64-256) for fixed batch size.
            Embeddings typically handle larger batches efficiently.
        max_concurrency (int): Maximum number of concurrent API requests **PER EXECUTOR**.
            Total cluster concurrency = max_concurrency × number_of_executors.
            Higher values increase throughput but may hit OpenAI rate limits.
            Recommended: 4-12 per executor. Defaults to 8.
        **api_kwargs: Additional OpenAI API parameters (e.g., dimensions for text-embedding-3 models).

    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to generate embeddings asynchronously
            with automatic caching for duplicate inputs within each partition,
            returning an `ArrayType(FloatType())` column.

    Note:
        For optimal performance in distributed environments:
        - **Automatic Caching**: Duplicate inputs within each partition are cached,
          reducing API calls and costs significantly on datasets with repeated content
        - Monitor OpenAI API rate limits when scaling executor count
        - Consider your OpenAI tier limits: total_requests = max_concurrency × executors
        - Embeddings API typically has higher throughput than chat completions
        - Use larger batch_size for embeddings compared to response generation
    """

    _model_name = model_name or CONTAINER.resolve(EmbeddingsModelName).value

    @pandas_udf(returnType=ArrayType(FloatType()))  # type: ignore[call-overload,misc]
    def _embeddings_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
        async_client = get_async_client()
        cache = AsyncBatchCache[str, np.ndarray](
            batch_size=batch_size,
            max_concurrency=max_concurrency,
            max_cache_size=DEFAULT_MANAGED_CACHE_SIZE,
        )
        batch_client = AsyncBatchEmbeddings(
            client=async_client,
            model_name=_model_name,
            cache=cache,
            api_kwargs=api_kwargs,
        )

        async def run_part(part: pd.Series) -> pd.Series:
            embeddings = await batch_client.create(part.tolist())
            if embeddings:
                flat = np.concatenate(embeddings)
                dim = len(embeddings[0])
                arrow_arr = pa.FixedSizeListArray.from_arrays(pa.array(flat, type=pa.float32()), list_size=dim)
                return pd.Series(
                    arrow_arr, index=part.index, name=part.name, dtype=pd.ArrowDtype(pa.list_(pa.float32(), dim))
                )
            return pd.Series([], index=part.index, name=part.name, dtype=object)

        async def cleanup() -> None:
            await cache.clear()

        yield from run_partition_async(col, run_part, cleanup)

    return _embeddings_udf  # type: ignore[return-value]

split_to_chunks_udf ¶

split_to_chunks_udf(
    max_tokens: int, sep: list[str]
) -> UserDefinedFunction

Create a pandas‑UDF that splits text into token‑bounded chunks.

Parameters:

Name	Type	Description	Default
`max_tokens`	`int`	Maximum tokens allowed per chunk.	required
`sep`	`list[str]`	Ordered list of separator strings used by `TextChunker`.	required

Returns:

Type	Description
`UserDefinedFunction`	A pandas UDF producing an `ArrayType(StringType())` column whose values are lists of chunks respecting the `max_tokens` limit.

Source code in src/openaivec/spark_ext.py

def split_to_chunks_udf(max_tokens: int, sep: list[str]) -> UserDefinedFunction:
    """Create a pandas‑UDF that splits text into token‑bounded chunks.

    Args:
        max_tokens (int): Maximum tokens allowed per chunk.
        sep (list[str]): Ordered list of separator strings used by ``TextChunker``.

    Returns:
        A pandas UDF producing an ``ArrayType(StringType())`` column whose
            values are lists of chunks respecting the ``max_tokens`` limit.
    """

    @pandas_udf(ArrayType(StringType()))  # type: ignore[call-overload,misc]
    def fn(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
        encoding = tiktoken.get_encoding("o200k_base")
        chunker = TextChunker(encoding)

        for part in col:
            yield part.map(lambda x: chunker.split(x, max_tokens=max_tokens, sep=sep) if isinstance(x, str) else [])

    return fn  # type: ignore[return-value]

count_tokens_udf ¶

count_tokens_udf() -> UserDefinedFunction

Create a pandas‑UDF that counts tokens for every string cell.

The UDF uses tiktoken to approximate tokenisation and caches the resulting Encoding object per executor.

Returns:

Type	Description
`UserDefinedFunction`	A pandas UDF producing an `IntegerType` column with token counts.

Source code in src/openaivec/spark_ext.py

def count_tokens_udf() -> UserDefinedFunction:
    """Create a pandas‑UDF that counts tokens for every string cell.

    The UDF uses *tiktoken* to approximate tokenisation and caches the
    resulting ``Encoding`` object per executor.

    Returns:
        A pandas UDF producing an ``IntegerType`` column with token counts.
    """

    @pandas_udf(IntegerType())  # type: ignore[call-overload]
    def fn(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
        encoding = tiktoken.get_encoding("o200k_base")

        for part in col:
            texts = part.tolist()
            safe_texts = [x if isinstance(x, str) else "" for x in texts]
            encoded = encoding.encode_batch(safe_texts)
            token_counts = [len(t) if isinstance(t, list) else 0 for t in encoded]
            yield pd.Series(token_counts, index=part.index, name=part.name)

    return fn  # type: ignore[return-value]

similarity_udf ¶

similarity_udf() -> UserDefinedFunction

Create a pandas-UDF that computes cosine similarity between embedding vectors.

Returns:

Name	Type	Description
`UserDefinedFunction`	`UserDefinedFunction`	A Spark pandas UDF that takes two embedding vector columns and returns their cosine similarity as a FloatType column.

Source code in src/openaivec/spark_ext.py

def similarity_udf() -> UserDefinedFunction:
    """Create a pandas-UDF that computes cosine similarity between embedding vectors.

    Returns:
        UserDefinedFunction: A Spark pandas UDF that takes two embedding vector columns
            and returns their cosine similarity as a FloatType column.
    """

    @pandas_udf(FloatType())  # type: ignore[call-overload]
    def fn(a: pd.Series, b: pd.Series) -> pd.Series:
        # Import pandas_ext to ensure .ai accessor is available in Spark workers
        from openaivec import pandas_ext

        # Explicitly reference pandas_ext to satisfy linters
        assert pandas_ext is not None

        return pd.DataFrame({"a": a, "b": b}).ai.similarity("a", "b")

    return fn  # type: ignore[return-value]