Basics of spark¶
This notebook demonstrates how to integrate Apache Spark with OpenAI's API to perform token counting, embedding generation, and multilingual translation using Spark UDFs.
In [ ]:
Copied!
# Initialize Spark session
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
sc.environment["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY")
# Initialize Spark session
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
sc.environment["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY")
Create Dummy Data¶
Create a simple DataFrame containing names of fruits.
In [2]:
Copied!
# Create DataFrame with fruit names
fruit_data = [("apple",), ("banana",), ("cherry",), ("mango",), ("orange",), ("peach",), ("pear",), ("pineapple",), ("plum",), ("strawberry",)]
df = spark.createDataFrame(fruit_data, ["name"])
df.createOrReplaceTempView("fruits")
# Create DataFrame with fruit names
fruit_data = [("apple",), ("banana",), ("cherry",), ("mango",), ("orange",), ("peach",), ("pear",), ("pineapple",), ("plum",), ("strawberry",)]
df = spark.createDataFrame(fruit_data, ["name"])
df.createOrReplaceTempView("fruits")
In [3]:
Copied!
# Display the fruits DataFrame
spark.sql("select * from fruits").show()
# Display the fruits DataFrame
spark.sql("select * from fruits").show()
[Stage 0:> (0 + 1) / 1]
+----------+ | name| +----------+ | apple| | banana| | cherry| | mango| | orange| | peach| | pear| | pineapple| | plum| |strawberry| +----------+
Count Tokens¶
Use OpenAI's GPT model to count the number of tokens in each fruit name.
In [ ]:
Copied!
# Register UDF to count tokens using OpenAI GPT model
from openaivec.spark import count_tokens_udf
spark.udf.register("count_tokens", count_tokens_udf())
# Register UDF to count tokens using OpenAI GPT model
from openaivec.spark import count_tokens_udf
spark.udf.register("count_tokens", count_tokens_udf())
In [5]:
Copied!
# Show token counts for each fruit name
spark.sql("""
select
name,
count_tokens(name) as token_count
from fruits
""").show()
# Show token counts for each fruit name
spark.sql("""
select
name,
count_tokens(name) as token_count
from fruits
""").show()
+----------+-----------+ | name|token_count| +----------+-----------+ | apple| 1| | banana| 1| | cherry| 2| | mango| 2| | orange| 1| | peach| 2| | pear| 1| | pineapple| 2| | plum| 2| |strawberry| 3| +----------+-----------+
Generate Embeddings¶
Generate embeddings for each fruit name using OpenAI's embedding model.
In [ ]:
Copied!
# Register UDF to generate embeddings
import os
from openaivec.spark import embeddings_udf
spark.udf.register("embed", embeddings_udf(model_name="text-embedding-3-small", batch_size=1024))
# Register UDF to generate embeddings
import os
from openaivec.spark import embeddings_udf
spark.udf.register("embed", embeddings_udf(model_name="text-embedding-3-small", batch_size=1024))
In [7]:
Copied!
# Display embeddings for each fruit name
spark.sql("""
select
name,
embed(name) as embedding
from fruits
""").show()
# Display embeddings for each fruit name
spark.sql("""
select
name,
embed(name) as embedding
from fruits
""").show()
[Stage 8:===================================================> (10 + 1) / 11]
+----------+--------------------+ | name| embedding| +----------+--------------------+ | apple|[0.01763439, -0.0...| | banana|[0.013411593, -0....| | cherry|[0.036222804, -0....| | mango|[0.055474974, -0....| | orange|[-0.025922043, -0...| | peach|[0.030673496, -0....| | pear|[0.023664422, -0....| | pineapple|[0.020983547, -0....| | plum|[0.0049052937, 6....| |strawberry|[0.020106195, -0....| +----------+--------------------+
Multilingual Translation¶
Translate fruit names into multiple languages using OpenAI's GPT model.
In [ ]:
Copied!
# Register UDF for multilingual translation
import os
from pydantic import BaseModel
from openaivec.spark import responses_udf
class Translation(BaseModel):
en: str
fr: str
ja: str
es: str
de: str
it: str
pt: str
ru: str
spark.udf.register("translate", responses_udf(
instructions="Translate the following text to English, French, Japanese, Spanish, German, Italian, Portuguese, and Russian.",
response_format=Translation,
model_name="gpt-4.1-nano"
))
# Register UDF for multilingual translation
import os
from pydantic import BaseModel
from openaivec.spark import responses_udf
class Translation(BaseModel):
en: str
fr: str
ja: str
es: str
de: str
it: str
pt: str
ru: str
spark.udf.register("translate", responses_udf(
instructions="Translate the following text to English, French, Japanese, Spanish, German, Italian, Portuguese, and Russian.",
response_format=Translation,
model_name="gpt-4.1-nano"
))
In [9]:
Copied!
# Display translations for each fruit name
spark.sql("""
select
name,
translate(name) as t,
t.en as en,
t.fr as fr,
t.ja as ja,
t.es as es,
t.de as de,
t.it as it,
t.pt as pt,
t.ru as ru
from fruits
""").show()
# Display translations for each fruit name
spark.sql("""
select
name,
translate(name) as t,
t.en as en,
t.fr as fr,
t.ja as ja,
t.es as es,
t.de as de,
t.it as it,
t.pt as pt,
t.ru as ru
from fruits
""").show()
[Stage 11:==============================================> (9 + 2) / 11]
+----------+-----------------------+----------+------+------------+-------+--------+--------+-------+--------+ | name| t| en| fr| ja| es| de| it| pt| ru| +----------+-----------------------+----------+------+------------+-------+--------+--------+-------+--------+ | apple| {apple, pomme, リン...| apple| pomme| リンゴ|manzana| Apfel| mela| maçã| яблоко| | banana| {banana, banane, ...| banana|banane| バナナ|plátano| Banane| banana| banana| банан| | cherry| {cherry, cerise, ...| cherry|cerise| さくらんぼ| cereza| Kirsche|ciliegia| cereja| вишня| | mango| {mango, mangue, マ...| mango|mangue| マンゴー| mango| Mango| mango| manga| манго| | orange| {orange, orange, ...| orange|orange| オレンジ|naranja| Orange| arancia|laranja|апельсин| | peach| {peach, pêche, もも...| peach| pêche| もも|durazno|Pfirsich| pesca|pêssego| персик| | pear| {pear, poire, 梨, ...| pear| poire| 梨| pera| Birne| pera| pêra| груша| | pineapple| {Pineapple, Anana...| Pineapple|Ananas|パイナップル| Piña| Ananas| Ananas|Abacaxi| Ананас| | plum|{plum, prune, プラム...| plum| prune| プラム|ciruela| Pflaume| prugna| ameixa| слива| |strawberry| {strawberry, frai...|strawberry|fraise| イチゴ| fresa|Erdbeere| fragola|morango|клубника| +----------+-----------------------+----------+------+------------+-------+--------+--------+-------+--------+
Conclusion¶
This notebook illustrated how to effectively integrate Apache Spark with OpenAI's API for various NLP tasks such as token counting, embedding generation, and multilingual translation.