SkillsWhitepaperHow It WorksResultsFAQ Join Waitlist
SKILL FILE

Local Video Clip Engine with AI

Build a local Opus Clip replacement — Whisper for transcription, FFmpeg for rendering, ASS karaoke captions, zero subscriptions.

$0/mo vs $20-40/mo for Opus Clip
~3s per 60s clip (VideoToolbox)
500 lines of TypeScript total
Download Skill File ↓

How video content flows across your company

One recording generates clips, quotes, and assets for every department — automatically

Video Recorded Podcast, interview, webinar
1 Whisper Transcription
2 AI Identifies Clip-Worthy Segments
3 FFmpeg Renders Clips
4 ASS Captions Overlaid
Marketing
  • Social clips for Reels & Shorts
  • YouTube Shorts publishing
  • Carousel content from quotes
  • Blog post pull-quotes
Sales
  • Demo highlight reels
  • Testimonial clips for outreach
  • Case study video snippets
  • Proposal video attachments
Growth
  • Engagement-optimized clip formats
  • A/B test thumbnails
  • Platform performance tracking
  • Content velocity metrics
CRM
  • Clip links on contact records
  • Meeting highlight reels
  • Video engagement events logged
Instagram Reels
YouTube Shorts
LinkedIn Video Posts
Quote Carousels
Podcast Audiograms
Events Tracked
Content pieces tracked
Clip performance logged
Source video linked
Replaces Opus Clip
$19/mo $0/mo
$228/yr saved
Video Recorded Podcast, interview, webinar
1
Whisper Transcription Word-level timestamps with speaker diarization — runs locally, $0
2
AI Identifies Clip-Worthy Segments Heuristic scorer ranks segments by engagement signals — $0.006/min
3
FFmpeg Renders Clips Hardware-accelerated crop, encode, and format — ~3s per 60s clip
4
ASS Captions Overlaid Word-by-word karaoke animation burned into each clip
Marketing
  • Social clips for Reels & Shorts
  • YouTube Shorts publishing
  • Carousel content from quotes
  • Blog post pull-quotes
Sales
  • Demo highlight reels
  • Testimonial clips for outreach
  • Case study video snippets
  • Proposal video attachments
Growth
  • Engagement-optimized clip formats
  • A/B test thumbnails
  • Platform performance tracking
  • Content velocity metrics
CRM
  • Clip links on contact records
  • Meeting highlight reels
  • Video engagement events logged
Content Outputs
Instagram Reels from marketing
YouTube Shorts from marketing
LinkedIn Video Posts from marketing
Podcast Audiograms from sales
Quote Carousels from marketing
Everything Tracked
Content pieces tracked
Clip performance logged
Source video linked
Replaces Opus Clip
$19/mo $0/mo
$228/yr saved

Cancel your Opus Clip subscription

CANCEL THIS

Opus Clip

$20/mo
  • × Subscription fees
  • × Data locked in their dashboard
  • × Per-seat pricing
  • × Export limits
vs
BUILD THIS

SoloStack + Claude Code

$0/mo
  • Pay-per-use, no subscription
  • Your data in your repo
  • Zero vendor lock-in
  • Unlimited exports
Save $240/year

What this skill file teaches Claude

Drop one markdown file into your repo. Claude Code learns how to run this entire workflow.

1

Word-by-Word Karaoke Captions

ASS subtitle \kf tags progressively fill each word with brand yellow (#FEBB02) as it's spoken — the exact same visual effect as Opus Clip, with full control over fonts, colors, and positioning.

2

Automatic Clip Selection

Heuristic scorer slides a window across the transcript and ranks segments by engagement signals: word density, sentence boundaries, questions, keyword triggers like 'here's the thing' and 'the key is'.

3

Hardware-Accelerated Rendering

FFmpeg's VideoToolbox encoder uses Apple Silicon hardware to render clips 10-20x faster than software encoding. A 60-second clip renders in ~3 seconds at 8Mbps quality.

4

Flexible Output Formats

Crop to vertical (9:16 for Reels/TikTok), square (1:1 for Instagram feed), or keep original aspect ratio. One command flag controls the output format.

5

Fully Local & Private

Nothing leaves your machine. No cloud uploads, no API keys for transcription, no render queues. Your raw podcast footage stays on your disk.

6

Human-Editable Subtitles

Captions are standard ASS subtitle files — open them in any text editor to fix typos, adjust timing, or change styling before the final render.

Apify Actor: N/A — fully local pipeline · $0 (runs entirely on your machine)

Build it with plain English

Tell Claude Code what to do. It handles the rest.

claude — solostack/
you: |
Processing podcast.mp4 (45:12, 1920x1080, 30fps)...

✓ Transcribed 45:12 with Whisper medium (8,247 words)
✓ Scored 156 candidate windows
✓ Selected top 5 clips (38s, 62s, 45s, 71s, 54s)
✓ Generated karaoke captions for 5 clips
✓ Rendered 5 vertical clips (9:16) via VideoToolbox

Output:
  clips/clip-001.mp4 (38s) — score: 87
  clips/clip-002.mp4 (62s) — score: 84
  clips/clip-003.mp4 (45s) — score: 81
  clips/clip-004.mp4 (71s) — score: 78
  clips/clip-005.mp4 (54s) — score: 74

Total render time: 14s
you: |
Transcribing interview.mp4 with Whisper medium...

✓ 32:18 transcribed (5,891 words, 94% avg confidence)

Top 8 clip candidates:
  #1 [12:04-13:12] score:91 — "Here's what most people get wrong about..."
  #2 [05:22-06:30] score:88 — "The key insight we discovered was..."
  #3 [24:15-25:18] score:85 — "If I had to start over tomorrow..."
  #4 [08:41-09:52] score:82 — "Let me tell you exactly what happened..."
  #5 [18:33-19:28] score:79 — "The biggest mistake I see is..."
  ...

Use: npx tsx clip.ts render interview.mp4 --clips 1,3,5
you: |
Rendering clips 1, 3, 5 in square format (1:1)...

✓ clip-001.mp4 — 68s, 1080x1080, captions burned ✓ (2.8s render)
✓ clip-003.mp4 — 63s, 1080x1080, captions burned ✓ (2.6s render)
✓ clip-005.mp4 — 55s, 1080x1080, captions burned ✓ (2.3s render)

3 clips rendered in 7.7s total
Output: clips/

What you can build with this

Podcast clip repurposing

Turn a 60-minute podcast episode into 5-10 vertical clips with animated captions, ready for TikTok, Reels, and YouTube Shorts — in under 2 minutes of render time.

Interview highlight reels

Score and extract the most quotable moments from recorded interviews. The heuristic scorer catches questions, keyword triggers, and natural pause boundaries.

Course content snippets

Chop online course recordings into bite-sized clips for social media promotion. Each clip gets word-by-word captions that make it watchable on mute.

Webinar repurposing

Extract the best segments from hour-long webinars for LinkedIn, Twitter, and email campaigns. Square format for feed posts, vertical for Stories/Reels.

Things to know

!

Whisper's medium model downloads ~1.5GB on first run. After that, it's cached locally and runs offline.

!

Transcription runs at ~1x realtime on M-series Macs — a 60-minute podcast takes ~60 minutes to transcribe. Use the 'small' model for faster (but less accurate) results.

!

Clip selection uses heuristics, not AI — it finds likely-good segments, but you should review the candidates and pick your favorites. The selection is 80% as good as AI-powered tools.

!

Center-crop works for most talking-head content. If your video has important action at the edges, you may need to adjust the crop offset manually.

Get the full skill file

Everything above is 80% of the skill file. Download the complete version with full implementation details, agent prompts, and ready-to-run scripts.

Common questions

Opus Clip uses AI to select clips and generates word-by-word captions. This tool does both — clip selection via heuristic scoring (word density, questions, keyword triggers) and captions via ASS karaoke subtitles (\kf tags). The visual caption effect is identical. The main difference: Opus Clip's AI selection may catch some clips the heuristics miss, but you review clips manually anyway. For $0/mo vs $20-40/mo, the trade-off is worth it.
No. Everything runs locally. Whisper runs on-device (no OpenAI API key needed — it's the open-source model, not the API). FFmpeg is a local binary. Nothing is uploaded anywhere.
Any Mac with Apple Silicon (M1/M2/M3/M4) runs this well. Whisper medium model uses ~2GB RAM. FFmpeg's VideoToolbox hardware encoder is built into every Mac. On Intel Macs or Linux, it works too — just slower (software encoding via libx264 instead of VideoToolbox).
Yes — the captions are standard ASS subtitle files. You can change the font, size, color, outline thickness, position, and animation timing by editing the ASS style definition. Default: white text with brand yellow (#FEBB02) karaoke fill, 4px black outline, bold, bottom-center.
Any format FFmpeg can read — MP4, MOV, MKV, AVI, WebM, and more. Output is always MP4 with H.264 video and AAC audio, optimized for social media playback.
Yes. The CLI accepts multiple input files or a glob pattern. Each video is processed sequentially (transcribe → select → caption → render) to avoid memory issues with Whisper.

Ready to automate?

SoloStack gives you every skill pre-installed — scraping, marketing, sales, CRM, and more. One repo. Every department.

Join Waitlist →