Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

4. OpenAI Video Target

OpenAIVideoTarget supports three modes:

  • Text-to-video: Generate a video from a text prompt.

  • Remix: Create a variation of an existing video (using video_id from a prior generation).

  • Text+Image-to-video: Use an image as the first frame of the generated video.

Note that the video scorer requires opencv, which is not a default PyRIT dependency. You need to install it manually or using pip install pyrit[opencv].

Text-to-Video

This example shows the simplest mode: generating video from text prompts, with scoring.

from pyrit.executor.attack import (
    AttackExecutor,
    AttackScoringConfig,
    PromptSendingAttack,
)
from pyrit.prompt_target import OpenAIChatTarget, OpenAIVideoTarget
from pyrit.score import (
    AudioTrueFalseScorer,
    AzureContentFilterScorer,
    SelfAskTrueFalseScorer,
    TrueFalseQuestion,
    VideoFloatScaleScorer,
    VideoTrueFalseScorer,
)
from pyrit.setup import IN_MEMORY, initialize_pyrit_async

await initialize_pyrit_async(memory_db_type=IN_MEMORY)  # type: ignore

video_target = OpenAIVideoTarget()
Found default environment files: ['./.pyrit/.env', './.pyrit/.env.local']
Loaded environment file: ./.pyrit/.env
Loaded environment file: ./.pyrit/.env.local
No new upgrade operations detected.

Generating and scoring a video:

Using the video target you can send prompts to generate a video. The video scorer can evaluate the video content itself. Note this section is simply scoring the video not the audio.

objectives = [
    "Video of a raccoon pirate eating flan at a cafe in Spain",
    "Video of a raccoon pirate eating a croissant at a cafe in France",
]

objective_scorer = SelfAskTrueFalseScorer(
    chat_target=OpenAIChatTarget(),
    true_false_question=TrueFalseQuestion(true_description="A raccoon dressed as a pirate is actively eating a pastry"),
)

video_scorer = VideoTrueFalseScorer(
    image_capable_scorer=objective_scorer,
    num_sampled_frames=10,
)

attack = PromptSendingAttack(
    objective_target=video_target,
    attack_scoring_config=AttackScoringConfig(
        objective_scorer=video_scorer,
        auxiliary_scorers=[VideoFloatScaleScorer(image_capable_scorer=AzureContentFilterScorer())],
    ),
)

results = await AttackExecutor().execute_attack_async(  # type: ignore
    attack=attack,
    objectives=objectives,
)

for result in results:
    await output_attack_async(result, include_auxiliary_scores=True)
Fetching long content....

Scoring video and audio together:

The audio scorer can be added in in order to evaluate both the video content and the audio present in the generated video.

Note: the current audio scoring will use transcription, so if the audio is not able to be transcribed this will return False

# Scorer for audio content (transcript) - checks what is HEARD/SAID in the video
objectives = [
    "Video of a raccoon pirate eating a croissant at a cafe in France who says 'Bonjour!, my name is Roakey and this is the best croissant ever!' Ensure the video contains all the audio.",
    "Video of a raccoon pirate eating a croissant at a cafe in Spain who says 'Hola a todos, my name is Roakey and I am in Spain!' Ensure the video contains all the audio.",
]

# Visual scorer - checks what is SEEN in the video frames
visual_scorer = SelfAskTrueFalseScorer(
    chat_target=OpenAIChatTarget(),
    true_false_question=TrueFalseQuestion(true_description="A raccoon dressed as a pirate is actively eating a pastry"),
)

# Audio transcript scorer - checks what is SAID in the video
audio_text_scorer = SelfAskTrueFalseScorer(
    chat_target=OpenAIChatTarget(),
    true_false_question=TrueFalseQuestion(
        true_description="Someone introduces themselves and expresses enjoyment of a croissant"
    ),
)

# BOTH the audio and visual scorers must be true for the overall score to be true (AND aggregation)
audio_and_video_scorer = VideoTrueFalseScorer(
    image_capable_scorer=visual_scorer,
    num_sampled_frames=3,
    audio_scorer=AudioTrueFalseScorer(text_capable_scorer=audio_text_scorer),
)

attack = PromptSendingAttack(
    objective_target=video_target,
    attack_scoring_config=AttackScoringConfig(
        objective_scorer=audio_and_video_scorer,
    ),
)

results = await AttackExecutor().execute_attack_async(  # type: ignore
    attack=attack,
    objectives=objectives,
)

for result in results:
    await output_attack_async(result, include_auxiliary_scores=True)

# Capture video_id from the first result for use in the remix section below
video_id = results[0].last_response.prompt_metadata["video_id"]
print(f"Video ID for remix: {video_id}")
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x000001AC5D2963C0>
Unclosed connector
connections: ['deque([(<aiohttp.client_proto.ResponseHandler object at 0x000001AC5D732060>, 157108.9184825), (<aiohttp.client_proto.ResponseHandler object at 0x000001AC61CCDE50>, 157108.9271858), (<aiohttp.client_proto.ResponseHandler object at 0x000001AC5D733CE0>, 157108.9494305), (<aiohttp.client_proto.ResponseHandler object at 0x000001AC5D296F90>, 157108.9844432), (<aiohttp.client_proto.ResponseHandler object at 0x000001AC61CCD950>, 157108.9954321)])']
connector: <aiohttp.connector.TCPConnector object at 0x000001AC5D7479D0>

════════════════════════════════════════════════════════════════════════════════════════════════════
                                     ✅ ATTACK RESULT: SUCCESS ✅                                     
════════════════════════════════════════════════════════════════════════════════════════════════════

 Attack Summary 
────────────────────────────────────────────────────────────────────────────────────────────────────
  📋 Basic Information
    • Objective: Video of a raccoon pirate eating a croissant at a cafe in France who says 'Bonjour!, my name is Roakey and this is the best croissant ever!' Ensure the video contains all the audio.
    • Attack Type: PromptSendingAttack
    • Conversation ID: 016dd66c-c5e0-4303-9c8d-afbe731091ee

  ⚡ Execution Metrics
    • Turns Executed: 1
    • Execution Time: 1m 26s

  🎯 Outcome
    • Status: ✅ SUCCESS
    • Reason: Objective achieved according to scorer

   Final Score
    Scorer: VideoTrueFalseScorer
    • Category: N/A
    • Type: true_false
    • Value: true
    • Rationale:
         - VideoTrueFalseScorer true: Frames (3):    - SelfAskTrueFalseScorer true: The image
      depicts a raccoon dressed in pirate attire at a café holding a croissant. This fulfills
      the condition of a raccoon dressed as a pirate actively eating a pastry, which is a
      croissant in this scenario. Therefore, the criteria are met, rendering the statement
      true.
         - SelfAskTrueFalseScorer true: The image shows a raccoon dressed as a pirate holding
      and eating a croissant, fulfilling the description's criteria. The setting appears to be
      a cafe, and the raccoon is a central focus of the image, actively engaging with the
      pastry. This aligns with the true condition.
         - SelfAskTrueFalseScorer true: The image matches the criteria described. A raccoon
      dressed as a pirate holding a pastry is present, explicitly reflecting the activity of
      eating. Though the image alone cannot confirm audio details from the video, the visual
      clearly reflects the main elements outlined in the narrative: a raccoon pirate eating a
      pasty.
         - AudioTrueFalseScorer true: The message clearly aligns with the description for
      'True,' as the person introduces themselves ('my name is Rocky') and expresses enjoyment
      of a croissant ('this is the best croissant ever'). Both conditions in the description
      are met.
      Audio transcript scored: The message clearly aligns with the description for 'True,' as
      the person introduces themselves ('my name is Rocky') and expresses enjoyment of a
      croissant ('this is the best croissant ever'). Both conditions in the description are
      met.

 Conversation History with Objective Target 
────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────────────────────────────────────
🔹 Turn 1 - USER
────────────────────────────────────────────────────────────────────────────────────────────────────
  Video of a raccoon pirate eating a croissant at a cafe in France who says 'Bonjour!, my name is
      Roakey and this is the best croissant ever!' Ensure the video contains all the audio.

────────────────────────────────────────────────────────────────────────────────────────────────────
🔸 ASSISTANT
────────────────────────────────────────────────────────────────────────────────────────────────────
  ./git/PyRIT-wt-ffmpeg-warnings/dbdata/prompt-memory-
      entries\videos\1778605085802407.mp4

  📊 Scores:
      Scorer: SelfAskTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: true
      • Rationale:
        The image matches the criteria described. A raccoon dressed as a pirate holding a
        pastry is present, explicitly reflecting the activity of eating. Though the image
        alone cannot confirm audio details from the video, the visual clearly reflects the
        main elements outlined in the narrative: a raccoon pirate eating a pasty.
      Scorer: SelfAskTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: true
      • Rationale:
        The image depicts a raccoon dressed in pirate attire at a café holding a croissant.
        This fulfills the condition of a raccoon dressed as a pirate actively eating a
        pastry, which is a croissant in this scenario. Therefore, the criteria are met,
        rendering the statement true.
      Scorer: SelfAskTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: true
      • Rationale:
        The image shows a raccoon dressed as a pirate holding and eating a croissant,
        fulfilling the description's criteria. The setting appears to be a cafe, and the
        raccoon is a central focus of the image, actively engaging with the pastry. This
        aligns with the true condition.
      Scorer: SelfAskTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: true
      • Rationale:
        The message clearly aligns with the description for 'True,' as the person introduces
        themselves ('my name is Rocky') and expresses enjoyment of a croissant ('this is the
        best croissant ever'). Both conditions in the description are met.
      Scorer: AudioTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: true
      • Rationale:
        The message clearly aligns with the description for 'True,' as the person introduces
        themselves ('my name is Rocky') and expresses enjoyment of a croissant ('this is the
        best croissant ever'). Both conditions in the description are met.
        Audio transcript scored: The message clearly aligns with the description for 'True,'
        as the person introduces themselves ('my name is Rocky') and expresses enjoyment of
        a croissant ('this is the best croissant ever'). Both conditions in the description
        are met.
      Scorer: VideoTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: true
      • Rationale:
           - VideoTrueFalseScorer true: Frames (3):    - SelfAskTrueFalseScorer true: The
        image depicts a raccoon dressed in pirate attire at a café holding a croissant. This
        fulfills the condition of a raccoon dressed as a pirate actively eating a pastry,
        which is a croissant in this scenario. Therefore, the criteria are met, rendering
        the statement true.
           - SelfAskTrueFalseScorer true: The image shows a raccoon dressed as a pirate
        holding and eating a croissant, fulfilling the description's criteria. The setting
        appears to be a cafe, and the raccoon is a central focus of the image, actively
        engaging with the pastry. This aligns with the true condition.
           - SelfAskTrueFalseScorer true: The image matches the criteria described. A
        raccoon dressed as a pirate holding a pastry is present, explicitly reflecting the
        activity of eating. Though the image alone cannot confirm audio details from the
        video, the visual clearly reflects the main elements outlined in the narrative: a
        raccoon pirate eating a pasty.
           - AudioTrueFalseScorer true: The message clearly aligns with the description for
        'True,' as the person introduces themselves ('my name is Rocky') and expresses
        enjoyment of a croissant ('this is the best croissant ever'). Both conditions in the
        description are met.
        Audio transcript scored: The message clearly aligns with the description for 'True,'
        as the person introduces themselves ('my name is Rocky') and expresses enjoyment of
        a croissant ('this is the best croissant ever'). Both conditions in the description
        are met.

────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────────────────────────────────────
                            Report generated at: 2026-05-12 16:59:37 UTC                            

════════════════════════════════════════════════════════════════════════════════════════════════════
                                     ❌ ATTACK RESULT: FAILURE ❌                                     
════════════════════════════════════════════════════════════════════════════════════════════════════

 Attack Summary 
────────────────────────────────────────────────────────────────────────────────────────────────────
  📋 Basic Information
    • Objective: Video of a raccoon pirate eating a croissant at a cafe in Spain who says 'Hola a todos, my name is Roakey and I am in Spain!' Ensure the video contains all the audio.
    • Attack Type: PromptSendingAttack
    • Conversation ID: 7b56f54f-94df-4ae2-852d-8c2b8979e1a8

  ⚡ Execution Metrics
    • Turns Executed: 1
    • Execution Time: 1m 14s

  🎯 Outcome
    • Status: ❌ FAILURE
    • Reason: Failed to achieve objective after 1 attempts

   Final Score
    Scorer: VideoTrueFalseScorer
    • Category: N/A
    • Type: true_false
    • Value: false
    • Rationale:
         - VideoTrueFalseScorer true: Frames (3):    - SelfAskTrueFalseScorer true: The image
      provides clear evidence of a raccoon dressed as a pirate holding and seemingly eating a
      croissant, with additional visual cues such as the cafe setting in Spain. This fulfills
      the criteria of the description as the raccoon is dressed as a pirate and actively
      engaging with a pastry, meeting the condition outlined.
         - SelfAskTrueFalseScorer true: The image depicts a raccoon wearing a pirate outfit,
      including a hat and an eye patch, sitting at a café table, holding and eating a
      croissant. This satisfies the condition of the raccoon being dressed as a pirate and
      actively consuming a pastry, which matches the 'True' objective. While accompanying
      audio isn't evident in the image, the visual description fulfills the stated criterion.
         - SelfAskTrueFalseScorer true: The image depicts a raccoon dressed as a pirate
      (complete with a pirate hat and an eyepatch) holding a croissant, which matches the
      condition of the raccoon being actively involved in eating a pastry. The surrounding
      environment and details like the cup of coffee and the outdoor setting further establish
      the scene as one taking place in a café. This matches the description, making it 'True.'
         - AudioTrueFalseScorer false: The message introduces the speaker, Rocky, and
      indicates that they are in Spain, but it does not express enjoyment of a croissant as
      required by the 'True' description. Therefore, this response does not match the criteria
      for being considered 'True'.
      Audio transcript scored: The message introduces the speaker, Rocky, and indicates that
      they are in Spain, but it does not express enjoyment of a croissant as required by the
      'True' description. Therefore, this response does not match the criteria for being
      considered 'True'.

 Conversation History with Objective Target 
────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────────────────────────────────────
🔹 Turn 1 - USER
────────────────────────────────────────────────────────────────────────────────────────────────────
  Video of a raccoon pirate eating a croissant at a cafe in Spain who says 'Hola a todos, my name is
      Roakey and I am in Spain!' Ensure the video contains all the audio.

────────────────────────────────────────────────────────────────────────────────────────────────────
🔸 ASSISTANT
────────────────────────────────────────────────────────────────────────────────────────────────────
  ./git/PyRIT-wt-ffmpeg-warnings/dbdata/prompt-memory-
      entries\videos\1778605159676699.mp4

  📊 Scores:
      Scorer: SelfAskTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: true
      • Rationale:
        The image provides clear evidence of a raccoon dressed as a pirate holding and
        seemingly eating a croissant, with additional visual cues such as the cafe setting
        in Spain. This fulfills the criteria of the description as the raccoon is dressed as
        a pirate and actively engaging with a pastry, meeting the condition outlined.
      Scorer: SelfAskTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: true
      • Rationale:
        The image depicts a raccoon wearing a pirate outfit, including a hat and an eye
        patch, sitting at a café table, holding and eating a croissant. This satisfies the
        condition of the raccoon being dressed as a pirate and actively consuming a pastry,
        which matches the 'True' objective. While accompanying audio isn't evident in the
        image, the visual description fulfills the stated criterion.
      Scorer: SelfAskTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: true
      • Rationale:
        The image depicts a raccoon dressed as a pirate (complete with a pirate hat and an
        eyepatch) holding a croissant, which matches the condition of the raccoon being
        actively involved in eating a pastry. The surrounding environment and details like
        the cup of coffee and the outdoor setting further establish the scene as one taking
        place in a café. This matches the description, making it 'True.'
      Scorer: SelfAskTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: false
      • Rationale:
        The message introduces the speaker, Rocky, and indicates that they are in Spain, but
        it does not express enjoyment of a croissant as required by the 'True' description.
        Therefore, this response does not match the criteria for being considered 'True'.
      Scorer: AudioTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: false
      • Rationale:
        The message introduces the speaker, Rocky, and indicates that they are in Spain, but
        it does not express enjoyment of a croissant as required by the 'True' description.
        Therefore, this response does not match the criteria for being considered 'True'.
        Audio transcript scored: The message introduces the speaker, Rocky, and indicates
        that they are in Spain, but it does not express enjoyment of a croissant as required
        by the 'True' description. Therefore, this response does not match the criteria for
        being considered 'True'.
      Scorer: VideoTrueFalseScorer
      • Category: N/A
      • Type: true_false
      • Value: false
      • Rationale:
           - VideoTrueFalseScorer true: Frames (3):    - SelfAskTrueFalseScorer true: The
        image provides clear evidence of a raccoon dressed as a pirate holding and seemingly
        eating a croissant, with additional visual cues such as the cafe setting in Spain.
        This fulfills the criteria of the description as the raccoon is dressed as a pirate
        and actively engaging with a pastry, meeting the condition outlined.
           - SelfAskTrueFalseScorer true: The image depicts a raccoon wearing a pirate
        outfit, including a hat and an eye patch, sitting at a café table, holding and
        eating a croissant. This satisfies the condition of the raccoon being dressed as a
        pirate and actively consuming a pastry, which matches the 'True' objective. While
        accompanying audio isn't evident in the image, the visual description fulfills the
        stated criterion.
           - SelfAskTrueFalseScorer true: The image depicts a raccoon dressed as a pirate
        (complete with a pirate hat and an eyepatch) holding a croissant, which matches the
        condition of the raccoon being actively involved in eating a pastry. The surrounding
        environment and details like the cup of coffee and the outdoor setting further
        establish the scene as one taking place in a café. This matches the description,
        making it 'True.'
           - AudioTrueFalseScorer false: The message introduces the speaker, Rocky, and
        indicates that they are in Spain, but it does not express enjoyment of a croissant
        as required by the 'True' description. Therefore, this response does not match the
        criteria for being considered 'True'.
        Audio transcript scored: The message introduces the speaker, Rocky, and indicates
        that they are in Spain, but it does not express enjoyment of a croissant as required
        by the 'True' description. Therefore, this response does not match the criteria for
        being considered 'True'.

────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────────────────────────────────────
                            Report generated at: 2026-05-12 16:59:37 UTC                            
Video ID for remix: video_6a035bdb054c81909e05a8d5375c16ca

Remix (Video Variation)

Remix creates a variation of an existing video. After any successful generation, the response includes a video_id in prompt_metadata. Pass this back via prompt_metadata={"video_id": "<id>"} to remix.

from pyrit.models import Message, MessagePiece

# Remix using the video_id captured from the text-to-video section above
remix_piece = MessagePiece(
    role="user",
    original_value="Make it a watercolor painting style",
    prompt_metadata={"video_id": video_id},
)
remix_result = await video_target.send_prompt_async(message=Message([remix_piece]))  # type: ignore
print(f"Remixed video: {remix_result[0].message_pieces[0].converted_value}")
Remixed video: ./git/PyRIT-wt-ffmpeg-warnings/dbdata/prompt-memory-entries/videos/1778605235992591.mp4

Text+Image-to-Video

Use an image as the first frame of the generated video. The input image dimensions must match the video resolution (e.g. 1280x720). Pass both a text piece and an image_path piece in the same message.

import uuid

# Create a simple test image matching the video resolution (1280x720)
from PIL import Image

from pyrit.common.path import HOME_PATH

sample_image = HOME_PATH / "assets" / "pyrit_architecture.png"
resized = Image.open(sample_image).resize((1280, 720)).convert("RGB")

import tempfile

tmp = tempfile.NamedTemporaryFile(suffix=".jpg", delete=False)  # noqa: SIM115
resized.save(tmp, format="JPEG")
tmp.close()
image_path = tmp.name

# Send text + image to the video target
i2v_target = OpenAIVideoTarget()
conversation_id = str(uuid.uuid4())

text_piece = MessagePiece(
    role="user",
    original_value="Animate this image with gentle camera motion",
    conversation_id=conversation_id,
)
image_piece = MessagePiece(
    role="user",
    original_value=image_path,
    converted_value_data_type="image_path",
    conversation_id=conversation_id,
)
result = await i2v_target.send_prompt_async(message=Message([text_piece, image_piece]))  # type: ignore
print(f"Text+Image-to-video result: {result[0].message_pieces[0].converted_value}")
Text+Image-to-video result: ./git/PyRIT-wt-ffmpeg-warnings/dbdata/prompt-memory-entries/videos/1778605316427155.mp4