Video Alt Text

GenAIScript supports speech transcription and video frame extraction which can be combined to analyze videos.

Video Alt Text

The HTML video attribute does not have an alt attribute.. but you can still attach a accessible description using the aria-label attribute. We will build a script that generates the description using the transcript and video frames.

Transcript

We use the transcribe function to generate the transcript. It will use the transcription model alias to compute a transcription. For OpenAI, it defaults to openai:whisper-1.

Transcriptions are useful to reduce hallucations of LLMs when analyzing images and also provides good timestemp candidates to screenshot the video stream.

const file = env.files[0]
const transcript = await transcribe(file) // OpenAI whisper

Video Frames

The next step is to use the transcript to screenshot the video stream. GenAIScript uses ffmpeg to render the frames so make sure you have it installed and configured.

const frames = await ffmpeg.extractFrames(file, {
    transcript,
})

Context

Both the transcript and the frames are added to the prompt context. Since some videos may be silent, we ignore empty transcripts. We also use low detail for the frames to improve performance.

def("TRANSCRIPT", transcript?.srt, { ignoreEmpty: true }) // ignore silent videos
defImages(frames, { detail: "low" }) // low detail for better performance

Prompting it together

Finally, we give the task to the LLM to generate the alt text.

$`You are an expert in assistive technology.
You will analyze the video and generate a description alt text for the video.
`

Using this script, you can automatically generate high quality alt text for videos.

genaiscript run video-alt-text path_to_video.mp4

Full source

script({
    description: "Generate a description alt text for a video",
    accept: ".mp4,.webm",
    system: [
        "system.output_plaintext",
        "system.safety_jailbreak",
        "system.safety_harmful_content",
        "system.safety_validate_harmful_content",
    ],
    files: "src/audio/helloworld.mp4",
    model: "vision",
})

const file = env.files[0]
const transcript = await transcribe(file, { cache: "alt-text" }) // OpenAI whisper
const frames = await ffmpeg.extractFrames(file, {
    transcript,
}) // ffmpeg to extract frames

def("TRANSCRIPT", transcript?.srt, { ignoreEmpty: true }) // ignore silent videos
defImages(frames, { detail: "low" }) // low detail for better performance

$`You are an expert in assistive technology.
You will analyze the video and generate a description alt text for the video.

- The video is included as a set of <FRAMES> images and the <TRANSCRIPT>.
- Do not include alt text in the description.
- Keep it short but descriptive.
- Do not generate the [ character.`