Cue

Subtitle tool that uses AI to generate, edit, style, and burn subtitles into videos.

Outcome: Lowered the cost barrier to creating professional-looking subtitles with an easy-to-use app for independent creators.

Download for Windows View on GitHub

Overview

Cue is an open source desktop app for adding AI-generated subtitles to videos in any language. Users can edit subtitles, style them with different visual effects, and export the final video with the subtitles burned in.

Why I built Cue

My wife creates content and teaches online, and she was paying for subtitle tools to make videos. She wanted something free and easy to use.

OpenAI's Whisper already existed as a free tool for generating subtitles, but it doesn't come with a UI out of the box. Free UI wrappers do exist, but in my experience they aren't simple enough for non-technical users, and most don't support the workflow my wife needed: generate subtitles, fix mistakes, style them, preview the result, and export the video.

At the same time, I was learning how to build with AI agents, so Cue became a good opportunity. I wanted to see if I could create a useful, simple enough tool, yet carefully designed rather than merely functional.

How I designed and built Cue

I started by researching what the experience needed to be: the ideal workflow for someone like my wife, what tools already existed, where they felt too complex, and what the shortest path to seeing subtitles on screen could look like. After that, I built the basic workflow: select video, transcribe, preview, export.

Design work followed: I mapped the main screens, what users needed to do on each one, and how to navigate between them. I then created a design system and worked through the happy flow and supporting states (e.g., reopening saved projects, changing settings, loading states, empty states, etc.).

Testing with my wife throughout helped surface confusing, unnecessary, and repetitive parts of the experience.

UX challenges and solutions

Shortest path to subtitle preview

Before transcription starts, I noticed other apps tend to ask users how they want to transcribe: do they prefer speed or accuracy? Do they want to clean up the audio first, even if that takes longer? Where should the subtitled video be saved? But those decisions add friction right where simplicity and speed matter most. So I hid them behind defaults and moved them into Settings for users who want more control. The initial state stays focused on value instead: a simple invitation to select a video and a quick way to demo Cue before doing any work.

Cue on first launch with a value prop, video selector, and demo experience — First launch. Value prop, CTA to select a video, and a demo experience; no pre-requisite configurations.

Cue settings screen with transcription and export defaults — Settings. Users who want more control can change defaults here.

The waiting problem

Transcription time varies based on the user's hardware and the length of the video, so waiting was an important part of the experience to design.

First, users are not forced to sit and watch while transcription runs. Cue auto-saves their work, so they can leave and come back later. They can also start another project while one is still running and keep multiple projects open as tabs in the title bar. On the home page, they can see each project's status as it progresses.

Second, instead of showing only a progress bar, the progress screen shows what the AI is doing. Users can see when the app is detecting the video's language, which language it detected, when it is extracting audio, and so on. That reduces uncertainty and makes the wait feel more meaningful.

Third, I added an option to play calm jazz while transcription is running. It adds a bit of playfulness, but it also gives users a practical signal that transcription is done if they are busy while Cue is working.

Cue transcription progress with labeled steps — Transcription progress. Steps are labeled as they happen so users know exactly what is going on.

Making subtitle editing discoverable

One of the design goals was to let users edit subtitles directly on the video preview instead of adding a separate editing area. That kept the screen cleaner and made the experience feel more direct. But in-video subtitles are not something people expect to click or edit.

I explored several solutions, including written instructions, a keyboard shortcut hint, and a Pendo-like popover. But each one added noise and complexity.

Instead, I revisited the workflow. After transcription, users may want to fix text the AI got wrong. So rather than explain that subtitles are editable, I made that state visible by default: when transcription is complete, subtitles are already selected, the text cursor is blinking, and the text toolbar is visible above the subtitles. Users who don't want to edit text can ignore that state and move on to styling or exporting.

Making styling feel simpler but still powerful

The Editor includes 26 subtitle styling options. To keep that from feeling overwhelming, I used progressive disclosure and contextual placement of controls, grouping related options together and placing them where users would expect to find them.

Cue editor with active text toolbar and segmented effect controls — The Editor. Text toolbar is active when transcription finishes so users can start editing right away. Effect controls are segmented to reduce clutter.

Controls directly related to text, such as font, text color, and line spacing, live in a floating toolbar above the subtitles. Controls related to visual effects, such as subtitle background, shadow, and karaoke highlighting, live in a side panel. The panel itself is progressive: only the first section is enabled and expanded by default, and each effect reveals its related controls only when it's turned on.

What I learned

AI agents are not yet reliable substitutes for human design thinking. For example, no matter how hard I pushed agents to suggest better solutions for the problem of the side panel feeling overwhelming, they kept returning to the same ideas: regroup and reorganize controls, trim options, apply progressive disclosure. That helped, but it didn't solve the issue, and none of the suggestions questioned whether the side panel was the right pattern in the first place.

I also found that goals need to be described carefully if you want agents to recommend the right tech stack, because missing details cost time. Early on, I explained that I wanted Cue to be a local, offline desktop app for transcribing videos with Whisper and burning subtitles into videos. But I failed to mention that visual flexibility and pixel-perfect custom components mattered to me, and I didn't provide design references. The suggested UI framework greatly limited what the interface could become. By the time I realized that, I had already lost time. Restating my goals more clearly and including the missing information led to a stack that allowed more control and made it easier to guide agents toward the design I wanted.

What's next for Cue

Cue is currently in beta. It already supports the full flow from transcription to styled export, but it's still being refined. Right now, I'm focused on making editing faster by giving users a way to review and edit all subtitles in one place in addition to editing directly on top of the preview video. I'm also adding a subtitle track above the seek bar so users can quickly see where subtitles appear in the video. Beyond that, I want to give users more expressive control over subtitle styling through more style presets, more fonts, support for custom fonts, words-per-subtitle controls, and more.