OpenAIVideoTarget supports three modes:
Text-to-video: Generate a video from a text prompt.
Remix: Create a variation of an existing video (using
video_idfrom a prior generation).Text+Image-to-video: Use an image as the first frame of the generated video.
Note that the video scorer requires opencv, which is not a default PyRIT dependency. You need to install it manually or using pip install pyrit[opencv].
Text-to-Video¶
This example shows the simplest mode: generating video from text prompts, with scoring.
from pyrit.executor.attack import (
AttackExecutor,
AttackScoringConfig,
PromptSendingAttack,
)
from pyrit.prompt_target import OpenAIChatTarget, OpenAIVideoTarget
from pyrit.score import (
AudioTrueFalseScorer,
AzureContentFilterScorer,
SelfAskTrueFalseScorer,
TrueFalseQuestion,
VideoFloatScaleScorer,
VideoTrueFalseScorer,
)
from pyrit.setup import IN_MEMORY, initialize_pyrit_async
await initialize_pyrit_async(memory_db_type=IN_MEMORY) # type: ignore
video_target = OpenAIVideoTarget()Found default environment files: ['./.pyrit/.env', './.pyrit/.env.local']
Loaded environment file: ./.pyrit/.env
Loaded environment file: ./.pyrit/.env.local
No new upgrade operations detected.
Generating and scoring a video:¶
Using the video target you can send prompts to generate a video. The video scorer can evaluate the video content itself. Note this section is simply scoring the video not the audio.
objectives = [
"Video of a raccoon pirate eating flan at a cafe in Spain",
"Video of a raccoon pirate eating a croissant at a cafe in France",
]
objective_scorer = SelfAskTrueFalseScorer(
chat_target=OpenAIChatTarget(),
true_false_question=TrueFalseQuestion(true_description="A raccoon dressed as a pirate is actively eating a pastry"),
)
video_scorer = VideoTrueFalseScorer(
image_capable_scorer=objective_scorer,
num_sampled_frames=10,
)
attack = PromptSendingAttack(
objective_target=video_target,
attack_scoring_config=AttackScoringConfig(
objective_scorer=video_scorer,
auxiliary_scorers=[VideoFloatScaleScorer(image_capable_scorer=AzureContentFilterScorer())],
),
)
results = await AttackExecutor().execute_attack_async( # type: ignore
attack=attack,
objectives=objectives,
)
for result in results:
await output_attack_async(result, include_auxiliary_scores=True)Scoring video and audio together:¶
The audio scorer can be added in in order to evaluate both the video content and the audio present in the generated video.
Note: the current audio scoring will use transcription, so if the audio is not able to be transcribed this will return False
# Scorer for audio content (transcript) - checks what is HEARD/SAID in the video
objectives = [
"Video of a raccoon pirate eating a croissant at a cafe in France who says 'Bonjour!, my name is Roakey and this is the best croissant ever!' Ensure the video contains all the audio.",
"Video of a raccoon pirate eating a croissant at a cafe in Spain who says 'Hola a todos, my name is Roakey and I am in Spain!' Ensure the video contains all the audio.",
]
# Visual scorer - checks what is SEEN in the video frames
visual_scorer = SelfAskTrueFalseScorer(
chat_target=OpenAIChatTarget(),
true_false_question=TrueFalseQuestion(true_description="A raccoon dressed as a pirate is actively eating a pastry"),
)
# Audio transcript scorer - checks what is SAID in the video
audio_text_scorer = SelfAskTrueFalseScorer(
chat_target=OpenAIChatTarget(),
true_false_question=TrueFalseQuestion(
true_description="Someone introduces themselves and expresses enjoyment of a croissant"
),
)
# BOTH the audio and visual scorers must be true for the overall score to be true (AND aggregation)
audio_and_video_scorer = VideoTrueFalseScorer(
image_capable_scorer=visual_scorer,
num_sampled_frames=3,
audio_scorer=AudioTrueFalseScorer(text_capable_scorer=audio_text_scorer),
)
attack = PromptSendingAttack(
objective_target=video_target,
attack_scoring_config=AttackScoringConfig(
objective_scorer=audio_and_video_scorer,
),
)
results = await AttackExecutor().execute_attack_async( # type: ignore
attack=attack,
objectives=objectives,
)
for result in results:
await output_attack_async(result, include_auxiliary_scores=True)
# Capture video_id from the first result for use in the remix section below
video_id = results[0].last_response.prompt_metadata["video_id"]
print(f"Video ID for remix: {video_id}")Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x000001AC5D2963C0>
Unclosed connector
connections: ['deque([(<aiohttp.client_proto.ResponseHandler object at 0x000001AC5D732060>, 157108.9184825), (<aiohttp.client_proto.ResponseHandler object at 0x000001AC61CCDE50>, 157108.9271858), (<aiohttp.client_proto.ResponseHandler object at 0x000001AC5D733CE0>, 157108.9494305), (<aiohttp.client_proto.ResponseHandler object at 0x000001AC5D296F90>, 157108.9844432), (<aiohttp.client_proto.ResponseHandler object at 0x000001AC61CCD950>, 157108.9954321)])']
connector: <aiohttp.connector.TCPConnector object at 0x000001AC5D7479D0>
════════════════════════════════════════════════════════════════════════════════════════════════════
✅ ATTACK RESULT: SUCCESS ✅
════════════════════════════════════════════════════════════════════════════════════════════════════
Attack Summary
────────────────────────────────────────────────────────────────────────────────────────────────────
📋 Basic Information
• Objective: Video of a raccoon pirate eating a croissant at a cafe in France who says 'Bonjour!, my name is Roakey and this is the best croissant ever!' Ensure the video contains all the audio.
• Attack Type: PromptSendingAttack
• Conversation ID: 016dd66c-c5e0-4303-9c8d-afbe731091ee
⚡ Execution Metrics
• Turns Executed: 1
• Execution Time: 1m 26s
🎯 Outcome
• Status: ✅ SUCCESS
• Reason: Objective achieved according to scorer
Final Score
Scorer: VideoTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
- VideoTrueFalseScorer true: Frames (3): - SelfAskTrueFalseScorer true: The image
depicts a raccoon dressed in pirate attire at a café holding a croissant. This fulfills
the condition of a raccoon dressed as a pirate actively eating a pastry, which is a
croissant in this scenario. Therefore, the criteria are met, rendering the statement
true.
- SelfAskTrueFalseScorer true: The image shows a raccoon dressed as a pirate holding
and eating a croissant, fulfilling the description's criteria. The setting appears to be
a cafe, and the raccoon is a central focus of the image, actively engaging with the
pastry. This aligns with the true condition.
- SelfAskTrueFalseScorer true: The image matches the criteria described. A raccoon
dressed as a pirate holding a pastry is present, explicitly reflecting the activity of
eating. Though the image alone cannot confirm audio details from the video, the visual
clearly reflects the main elements outlined in the narrative: a raccoon pirate eating a
pasty.
- AudioTrueFalseScorer true: The message clearly aligns with the description for
'True,' as the person introduces themselves ('my name is Rocky') and expresses enjoyment
of a croissant ('this is the best croissant ever'). Both conditions in the description
are met.
Audio transcript scored: The message clearly aligns with the description for 'True,' as
the person introduces themselves ('my name is Rocky') and expresses enjoyment of a
croissant ('this is the best croissant ever'). Both conditions in the description are
met.
Conversation History with Objective Target
────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────────────────────────────────────────────
🔹 Turn 1 - USER
────────────────────────────────────────────────────────────────────────────────────────────────────
Video of a raccoon pirate eating a croissant at a cafe in France who says 'Bonjour!, my name is
Roakey and this is the best croissant ever!' Ensure the video contains all the audio.
────────────────────────────────────────────────────────────────────────────────────────────────────
🔸 ASSISTANT
────────────────────────────────────────────────────────────────────────────────────────────────────
./git/PyRIT-wt-ffmpeg-warnings/dbdata/prompt-memory-
entries\videos\1778605085802407.mp4
📊 Scores:
Scorer: SelfAskTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
The image matches the criteria described. A raccoon dressed as a pirate holding a
pastry is present, explicitly reflecting the activity of eating. Though the image
alone cannot confirm audio details from the video, the visual clearly reflects the
main elements outlined in the narrative: a raccoon pirate eating a pasty.
Scorer: SelfAskTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
The image depicts a raccoon dressed in pirate attire at a café holding a croissant.
This fulfills the condition of a raccoon dressed as a pirate actively eating a
pastry, which is a croissant in this scenario. Therefore, the criteria are met,
rendering the statement true.
Scorer: SelfAskTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
The image shows a raccoon dressed as a pirate holding and eating a croissant,
fulfilling the description's criteria. The setting appears to be a cafe, and the
raccoon is a central focus of the image, actively engaging with the pastry. This
aligns with the true condition.
Scorer: SelfAskTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
The message clearly aligns with the description for 'True,' as the person introduces
themselves ('my name is Rocky') and expresses enjoyment of a croissant ('this is the
best croissant ever'). Both conditions in the description are met.
Scorer: AudioTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
The message clearly aligns with the description for 'True,' as the person introduces
themselves ('my name is Rocky') and expresses enjoyment of a croissant ('this is the
best croissant ever'). Both conditions in the description are met.
Audio transcript scored: The message clearly aligns with the description for 'True,'
as the person introduces themselves ('my name is Rocky') and expresses enjoyment of
a croissant ('this is the best croissant ever'). Both conditions in the description
are met.
Scorer: VideoTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
- VideoTrueFalseScorer true: Frames (3): - SelfAskTrueFalseScorer true: The
image depicts a raccoon dressed in pirate attire at a café holding a croissant. This
fulfills the condition of a raccoon dressed as a pirate actively eating a pastry,
which is a croissant in this scenario. Therefore, the criteria are met, rendering
the statement true.
- SelfAskTrueFalseScorer true: The image shows a raccoon dressed as a pirate
holding and eating a croissant, fulfilling the description's criteria. The setting
appears to be a cafe, and the raccoon is a central focus of the image, actively
engaging with the pastry. This aligns with the true condition.
- SelfAskTrueFalseScorer true: The image matches the criteria described. A
raccoon dressed as a pirate holding a pastry is present, explicitly reflecting the
activity of eating. Though the image alone cannot confirm audio details from the
video, the visual clearly reflects the main elements outlined in the narrative: a
raccoon pirate eating a pasty.
- AudioTrueFalseScorer true: The message clearly aligns with the description for
'True,' as the person introduces themselves ('my name is Rocky') and expresses
enjoyment of a croissant ('this is the best croissant ever'). Both conditions in the
description are met.
Audio transcript scored: The message clearly aligns with the description for 'True,'
as the person introduces themselves ('my name is Rocky') and expresses enjoyment of
a croissant ('this is the best croissant ever'). Both conditions in the description
are met.
────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────────────────────────────────────────────
Report generated at: 2026-05-12 16:59:37 UTC
════════════════════════════════════════════════════════════════════════════════════════════════════
❌ ATTACK RESULT: FAILURE ❌
════════════════════════════════════════════════════════════════════════════════════════════════════
Attack Summary
────────────────────────────────────────────────────────────────────────────────────────────────────
📋 Basic Information
• Objective: Video of a raccoon pirate eating a croissant at a cafe in Spain who says 'Hola a todos, my name is Roakey and I am in Spain!' Ensure the video contains all the audio.
• Attack Type: PromptSendingAttack
• Conversation ID: 7b56f54f-94df-4ae2-852d-8c2b8979e1a8
⚡ Execution Metrics
• Turns Executed: 1
• Execution Time: 1m 14s
🎯 Outcome
• Status: ❌ FAILURE
• Reason: Failed to achieve objective after 1 attempts
Final Score
Scorer: VideoTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: false
• Rationale:
- VideoTrueFalseScorer true: Frames (3): - SelfAskTrueFalseScorer true: The image
provides clear evidence of a raccoon dressed as a pirate holding and seemingly eating a
croissant, with additional visual cues such as the cafe setting in Spain. This fulfills
the criteria of the description as the raccoon is dressed as a pirate and actively
engaging with a pastry, meeting the condition outlined.
- SelfAskTrueFalseScorer true: The image depicts a raccoon wearing a pirate outfit,
including a hat and an eye patch, sitting at a café table, holding and eating a
croissant. This satisfies the condition of the raccoon being dressed as a pirate and
actively consuming a pastry, which matches the 'True' objective. While accompanying
audio isn't evident in the image, the visual description fulfills the stated criterion.
- SelfAskTrueFalseScorer true: The image depicts a raccoon dressed as a pirate
(complete with a pirate hat and an eyepatch) holding a croissant, which matches the
condition of the raccoon being actively involved in eating a pastry. The surrounding
environment and details like the cup of coffee and the outdoor setting further establish
the scene as one taking place in a café. This matches the description, making it 'True.'
- AudioTrueFalseScorer false: The message introduces the speaker, Rocky, and
indicates that they are in Spain, but it does not express enjoyment of a croissant as
required by the 'True' description. Therefore, this response does not match the criteria
for being considered 'True'.
Audio transcript scored: The message introduces the speaker, Rocky, and indicates that
they are in Spain, but it does not express enjoyment of a croissant as required by the
'True' description. Therefore, this response does not match the criteria for being
considered 'True'.
Conversation History with Objective Target
────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────────────────────────────────────────────
🔹 Turn 1 - USER
────────────────────────────────────────────────────────────────────────────────────────────────────
Video of a raccoon pirate eating a croissant at a cafe in Spain who says 'Hola a todos, my name is
Roakey and I am in Spain!' Ensure the video contains all the audio.
────────────────────────────────────────────────────────────────────────────────────────────────────
🔸 ASSISTANT
────────────────────────────────────────────────────────────────────────────────────────────────────
./git/PyRIT-wt-ffmpeg-warnings/dbdata/prompt-memory-
entries\videos\1778605159676699.mp4
📊 Scores:
Scorer: SelfAskTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
The image provides clear evidence of a raccoon dressed as a pirate holding and
seemingly eating a croissant, with additional visual cues such as the cafe setting
in Spain. This fulfills the criteria of the description as the raccoon is dressed as
a pirate and actively engaging with a pastry, meeting the condition outlined.
Scorer: SelfAskTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
The image depicts a raccoon wearing a pirate outfit, including a hat and an eye
patch, sitting at a café table, holding and eating a croissant. This satisfies the
condition of the raccoon being dressed as a pirate and actively consuming a pastry,
which matches the 'True' objective. While accompanying audio isn't evident in the
image, the visual description fulfills the stated criterion.
Scorer: SelfAskTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: true
• Rationale:
The image depicts a raccoon dressed as a pirate (complete with a pirate hat and an
eyepatch) holding a croissant, which matches the condition of the raccoon being
actively involved in eating a pastry. The surrounding environment and details like
the cup of coffee and the outdoor setting further establish the scene as one taking
place in a café. This matches the description, making it 'True.'
Scorer: SelfAskTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: false
• Rationale:
The message introduces the speaker, Rocky, and indicates that they are in Spain, but
it does not express enjoyment of a croissant as required by the 'True' description.
Therefore, this response does not match the criteria for being considered 'True'.
Scorer: AudioTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: false
• Rationale:
The message introduces the speaker, Rocky, and indicates that they are in Spain, but
it does not express enjoyment of a croissant as required by the 'True' description.
Therefore, this response does not match the criteria for being considered 'True'.
Audio transcript scored: The message introduces the speaker, Rocky, and indicates
that they are in Spain, but it does not express enjoyment of a croissant as required
by the 'True' description. Therefore, this response does not match the criteria for
being considered 'True'.
Scorer: VideoTrueFalseScorer
• Category: N/A
• Type: true_false
• Value: false
• Rationale:
- VideoTrueFalseScorer true: Frames (3): - SelfAskTrueFalseScorer true: The
image provides clear evidence of a raccoon dressed as a pirate holding and seemingly
eating a croissant, with additional visual cues such as the cafe setting in Spain.
This fulfills the criteria of the description as the raccoon is dressed as a pirate
and actively engaging with a pastry, meeting the condition outlined.
- SelfAskTrueFalseScorer true: The image depicts a raccoon wearing a pirate
outfit, including a hat and an eye patch, sitting at a café table, holding and
eating a croissant. This satisfies the condition of the raccoon being dressed as a
pirate and actively consuming a pastry, which matches the 'True' objective. While
accompanying audio isn't evident in the image, the visual description fulfills the
stated criterion.
- SelfAskTrueFalseScorer true: The image depicts a raccoon dressed as a pirate
(complete with a pirate hat and an eyepatch) holding a croissant, which matches the
condition of the raccoon being actively involved in eating a pastry. The surrounding
environment and details like the cup of coffee and the outdoor setting further
establish the scene as one taking place in a café. This matches the description,
making it 'True.'
- AudioTrueFalseScorer false: The message introduces the speaker, Rocky, and
indicates that they are in Spain, but it does not express enjoyment of a croissant
as required by the 'True' description. Therefore, this response does not match the
criteria for being considered 'True'.
Audio transcript scored: The message introduces the speaker, Rocky, and indicates
that they are in Spain, but it does not express enjoyment of a croissant as required
by the 'True' description. Therefore, this response does not match the criteria for
being considered 'True'.
────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────────────────────────────────────────────
Report generated at: 2026-05-12 16:59:37 UTC
Video ID for remix: video_6a035bdb054c81909e05a8d5375c16ca
Remix (Video Variation)¶
Remix creates a variation of an existing video. After any successful generation, the response
includes a video_id in prompt_metadata. Pass this back via prompt_metadata={"video_id": "<id>"} to remix.
from pyrit.models import Message, MessagePiece
# Remix using the video_id captured from the text-to-video section above
remix_piece = MessagePiece(
role="user",
original_value="Make it a watercolor painting style",
prompt_metadata={"video_id": video_id},
)
remix_result = await video_target.send_prompt_async(message=Message([remix_piece])) # type: ignore
print(f"Remixed video: {remix_result[0].message_pieces[0].converted_value}")Remixed video: ./git/PyRIT-wt-ffmpeg-warnings/dbdata/prompt-memory-entries/videos/1778605235992591.mp4
Text+Image-to-Video¶
Use an image as the first frame of the generated video. The input image dimensions must match
the video resolution (e.g. 1280x720). Pass both a text piece and an image_path piece in the same message.
import uuid
# Create a simple test image matching the video resolution (1280x720)
from PIL import Image
from pyrit.common.path import HOME_PATH
sample_image = HOME_PATH / "assets" / "pyrit_architecture.png"
resized = Image.open(sample_image).resize((1280, 720)).convert("RGB")
import tempfile
tmp = tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) # noqa: SIM115
resized.save(tmp, format="JPEG")
tmp.close()
image_path = tmp.name
# Send text + image to the video target
i2v_target = OpenAIVideoTarget()
conversation_id = str(uuid.uuid4())
text_piece = MessagePiece(
role="user",
original_value="Animate this image with gentle camera motion",
conversation_id=conversation_id,
)
image_piece = MessagePiece(
role="user",
original_value=image_path,
converted_value_data_type="image_path",
conversation_id=conversation_id,
)
result = await i2v_target.send_prompt_async(message=Message([text_piece, image_piece])) # type: ignore
print(f"Text+Image-to-video result: {result[0].message_pieces[0].converted_value}")Text+Image-to-video result: ./git/PyRIT-wt-ffmpeg-warnings/dbdata/prompt-memory-entries/videos/1778605316427155.mp4