How Vimeo Implemented AI-Powered Subtitles

Mar 11

In this article, we will look at how the Vimeo engineering team overcame this problem and the decisions it made

8 Comments

“First, separate the creative work from the structural work. Asking one LLM call to be both brilliant and obedient means optimizing for competing goals, and research suggests that makes it worse at both”

this is great advice even for more general use cases. aligning llms with different areas in vector space really has a performance effect. great read :)

Hozefa

Mar 13

At Loom, we also built AI powered subtitles (and captions), below was our thinking model,

- Along with preserving the content and tone we asked the prompt to translate every caption block and preserve that number. Ensure that every single input caption block is translated in the output. The output must contain the exact same number of caption blocks as the input, from the first to the very last (including the final block, regardless of length). Do not stop early.

- We also asked to maintain the same timing and conciseness of the original captions and passed in the original language captions so that the LLM could translate each caption phrase.

Strive to match the original caption's length (word count/character count) closely for on-screen readability and synchronization with dialogue. Use concise phrasing, common synonyms, and efficient sentence structures.

- In order to prevent a timing sync problem we asked the LLM to preserve timestamps of the original captions: All timestamps must match the input captions exactly, with none missing.

- We wanted translations to read easily and naturally in the new language and specifically did not want to translate word for word in a way that would not be easily understood. There are grammatical and syntax nuances between different language. Use vocabulary and current slang widely understood by native speakers of the target language. Avoid jargon, archaic terms, or overly formal language unless explicitly required by the source material's register.

- If there were sections that did not make sense when translated we ask to preserve the word/ phrase in the original transcript. If a phrase or part of a phrase cannot be translated clearly and naturally into the target language while preserving context and intent, keep the original language text from the input.

- What helped were the examples we provided to show that we wanted timestamps and context preserved.

Alicia Wong

Mar 11

I really liked the fallback chain idea, from a user perspective, seeing a repeated subtitle is an understandable error. Seeing a blank could lead the user to think something is broken or skipped. Great example of when to override the pure model output.

Jianan

Mar 22

That’s not Japanese

Jay Stansell

Mar 14Edited

the part that interests me is the threshold where teams stop manually verifying the AI output and just ship it. 95% accuracy sounds great until you're the 5% and it's your name on the product

Kevin Choppin

Mar 13

Good article. Another option for the Japanese example would be to simply repeat the same condensed line across slots so it appears for longer. That way it still lines up with the length of the audio.

Opinion AI

Mar 11

Really useful example. One AI step should not try to do everything at once. Separating good translation from strict formatting feels like the real lesson here.

Sumit Kumar

Mar 11

Nice

ByteByteGo Newsletter

How Vimeo Implemented AI-Powered Subtitles