skills$openclaw/faster-whisper
theplasmak7.8k

by theplasmak

faster-whisper – OpenClaw Skill

faster-whisper is an OpenClaw Skills integration for coding workflows. Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps.

7.8k stars2.9k forksSecurity L1
Updated Feb 7, 2026Created Feb 7, 2026coding

Skill Snapshot

namefaster-whisper
descriptionLocal speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps. OpenClaw Skills integration.
ownertheplasmak
repositorytheplasmak/faster-whisper
languageMarkdown
licenseMIT
topics
securityL1
installopenclaw add @theplasmak/faster-whisper
last updatedFeb 7, 2026

Maintainer

theplasmak

theplasmak

Maintains faster-whisper in the OpenClaw Skills directory.

View GitHub profile
File Explorer
7 files
.
scripts
transcribe.py
6.1 KB
_meta.json
638 B
requirements.txt
117 B
setup.sh
5.1 KB
skill.json
357 B
SKILL.md
11.9 KB
SKILL.md

name: faster-whisper description: Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps. version: 1.0.4 author: ThePlasmak homepage: https://github.com/ThePlasmak/faster-whisper tags: ["audio", "transcription", "whisper", "speech-to-text", "ml", "cuda", "gpu"] platforms: ["windows", "linux", "macos", "wsl2"] metadata: {"moltbot":{"emoji":"🗣️","requires":{"bins":["ffmpeg","python3"]}}}

Faster Whisper

Local speech-to-text using faster-whisper — a CTranslate2 reimplementation of OpenAI's Whisper that runs 4-6x faster with identical accuracy. With GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).

When to Use

Use this skill when you need to:

  • Transcribe audio/video files — meetings, interviews, podcasts, lectures, YouTube videos
  • Convert speech to text locally — no API costs, works offline (after model download)
  • Batch process multiple audio files — efficient for large collections
  • Generate subtitles/captions — word-level timestamps available
  • Multilingual transcription — supports 99+ languages with auto-detection

Trigger phrases: "transcribe this audio", "convert speech to text", "what did they say", "make a transcript", "audio to text", "subtitle this video"

When NOT to use:

  • Real-time/streaming transcription (use streaming-optimized tools instead)
  • Cloud-only environments without local compute
  • Files <10 seconds where API call latency doesn't matter

Quick Reference

TaskCommandNotes
Basic transcription./scripts/transcribe audio.mp3Uses default distil-large-v3
Faster English./scripts/transcribe audio.mp3 --model distil-medium.en --language enEnglish-only, 6.8x faster
Maximum accuracy./scripts/transcribe audio.mp3 --model large-v3-turbo --beam-size 10Slower but best quality
Word timestamps./scripts/transcribe audio.mp3 --word-timestampsFor subtitles/captions
JSON output./scripts/transcribe audio.mp3 --json -o output.jsonProgrammatic access
Multilingual./scripts/transcribe audio.mp3 --model large-v3-turboAuto-detects language
Remove silence./scripts/transcribe audio.mp3 --vadVoice activity detection

Model Selection

Choose the right model for your needs:

digraph model_selection {
    rankdir=LR;
    node [shape=box, style=rounded];

    start [label="Start", shape=doublecircle];
    need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
    multilingual [label="Multilingual\ncontent?", shape=diamond];
    resource_constrained [label="Resource\nconstraints?", shape=diamond];

    large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    distil_large [label="distil-large-v3\n(default)", style="rounded,filled", fillcolor=lightgreen];
    distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
    distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];

    start -> need_accuracy;
    need_accuracy -> large_v3 [label="yes"];
    need_accuracy -> multilingual [label="no"];
    multilingual -> large_turbo [label="yes"];
    multilingual -> resource_constrained [label="no (English)"];
    resource_constrained -> distil_small [label="mobile/edge"];
    resource_constrained -> distil_medium [label="some limits"];
    resource_constrained -> distil_large [label="no"];
}

Model Table

Standard Models (Full Whisper)
ModelSizeSpeedAccuracyUse Case
tiny / tiny.en39MFastestBasicQuick drafts
base / base.en74MVery fastGoodGeneral use
small / small.en244MFastBetterMost tasks
medium / medium.en769MModerateHighQuality transcription
large-v1/v2/v31.5GBSlowerBestMaximum accuracy
large-v3-turbo809MFastExcellentRecommended for accuracy
Distilled Models (~6x Faster, ~1% WER difference)
ModelSizeSpeed vs StandardAccuracyUse Case
distil-large-v3756M~6.3x faster9.7% WERDefault, best balance
distil-large-v2756M~5.8x faster10.1% WERFallback
distil-medium.en394M~6.8x faster11.1% WEREnglish-only, resource-constrained
distil-small.en166M~5.6x faster12.1% WERMobile/edge devices

.en models are English-only and slightly faster/better for English content.

Setup

Linux / macOS / WSL2

# Run the setup script (creates venv, installs deps, auto-detects GPU)
./setup.sh

Windows (Native)

# Run from PowerShell (auto-installs Python & ffmpeg if missing via winget)
.\setup.ps1

The Windows setup script will:

  • Auto-install Python 3.12 via winget if not found
  • Auto-install ffmpeg via winget if not found
  • Detect NVIDIA GPU and install CUDA-enabled PyTorch
  • Create venv and install all dependencies

Requirements:

  • Linux/macOS/WSL2: Python 3.10+, ffmpeg
  • Windows: Nothing! Setup auto-installs prerequisites via winget

Platform Support

PlatformAccelerationSpeedAuto-Install
Windows + NVIDIA GPUCUDA~20x realtime 🚀✅ Full
Linux + NVIDIA GPUCUDA~20x realtime 🚀Manual prereqs
WSL2 + NVIDIA GPUCUDA~20x realtime 🚀Manual prereqs
macOS Apple SiliconCPU*~3-5x realtimeManual prereqs
macOS IntelCPU~1-2x realtimeManual prereqs
Windows (no GPU)CPU~1x realtime✅ Full
Linux (no GPU)CPU~1x realtimeManual prereqs

*faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.

GPU Support (IMPORTANT!)

The setup script auto-detects your GPU and installs PyTorch with CUDA. Always use GPU if available — CPU transcription is extremely slow.

HardwareSpeed9-min video
RTX 3070 (GPU)~20x realtime~27 sec
CPU (int8)~0.3x realtime~30 min

If setup didn't detect your GPU, manually install PyTorch with CUDA:

Linux/macOS/WSL2:

# For CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118

Windows:

# For CUDA 12.x
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu118

Usage

Linux/macOS/WSL2:

# Basic transcription
./scripts/transcribe audio.mp3

# With specific model
./scripts/transcribe audio.wav --model large-v3-turbo

# With word timestamps
./scripts/transcribe audio.mp3 --word-timestamps

# Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en

# JSON output
./scripts/transcribe audio.mp3 --json

Windows (cmd or PowerShell):

# Basic transcription
.\scripts\transcribe.cmd audio.mp3

# With specific model
.\scripts\transcribe.cmd audio.wav --model large-v3-turbo

# With word timestamps (PowerShell native syntax also works)
.\scripts\transcribe.ps1 audio.mp3 -WordTimestamps

# JSON output
.\scripts\transcribe.cmd audio.mp3 --json

Options

--model, -m        Model name (default: distil-large-v3)
--language, -l     Language code (e.g., en, es, fr - auto-detect if omitted)
--word-timestamps  Include word-level timestamps
--beam-size        Beam search size (default: 5, higher = more accurate but slower)
--vad              Enable voice activity detection (removes silence)
--json, -j         Output as JSON
--output, -o       Save transcript to file
--device           cpu or cuda (auto-detected)
--compute-type     int8, float16, float32 (default: auto-optimized)
--quiet, -q        Suppress progress messages

Examples

# Transcribe YouTube audio (after extraction with yt-dlp)
yt-dlp -x --audio-format mp3 <URL> -o audio.mp3
./scripts/transcribe audio.mp3

# Batch transcription with JSON output
for file in *.mp3; do
  ./scripts/transcribe "$file" --json > "${file%.mp3}.json"
done

# High-accuracy transcription with larger beam size
./scripts/transcribe audio.mp3 \
  --model large-v3-turbo --beam-size 10 --word-timestamps

# Fast English-only transcription
./scripts/transcribe audio.mp3 \
  --model distil-medium.en --language en

# Transcribe with VAD (removes silence)
./scripts/transcribe audio.mp3 --vad

Common Mistakes

MistakeProblemSolution
Using CPU when GPU available10-20x slower transcriptionCheck nvidia-smi on Windows/Linux; verify CUDA installation
Not specifying languageWastes time auto-detecting on known contentUse --language en when you know the language
Using wrong modelUnnecessary slowness or poor accuracyDefault distil-large-v3 is excellent; only use large-v3 if accuracy issues
Ignoring distilled modelsMissing 6x speedup with <1% accuracy lossTry distil-large-v3 before reaching for standard models
Forgetting ffmpegSetup fails or audio can't be processedSetup script handles this; manual installs need ffmpeg separately
Out of memory errorsModel too large for available VRAM/RAMUse smaller model or --compute-type int8
Over-engineering beam sizeDiminishing returns past beam-size 5-7Default 5 is fine; try 10 for critical transcripts

Performance Notes

  • First run: Downloads model to ~/.cache/huggingface/ (one-time)
  • GPU: Automatically uses CUDA if available (~10-20x faster)
  • Quantization: INT8 used on CPU for ~4x speedup with minimal accuracy loss
  • Memory:
    • distil-large-v3: ~2GB RAM / ~1GB VRAM
    • large-v3-turbo: ~4GB RAM / ~2GB VRAM
    • tiny/base: <1GB RAM

Why faster-whisper?

  • Speed: ~4-6x faster than OpenAI's original Whisper
  • Accuracy: Identical (uses same model weights)
  • Efficiency: Lower memory usage via quantization
  • Production-ready: Stable C++ backend (CTranslate2)
  • Distilled models: ~6x faster with <1% accuracy loss

Troubleshooting

"CUDA not available — using CPU": Install PyTorch with CUDA (see GPU Support above) Setup fails: Make sure Python 3.10+ is installed Out of memory: Use smaller model or --compute-type int8 Slow on CPU: Expected — use GPU for practical transcription Model download fails: Check ~/.cache/huggingface/ permissions (Linux/macOS) or %USERPROFILE%\.cache\huggingface\ (Windows)

Windows-Specific

"winget not found": Install App Installer from Microsoft Store, or install Python/ffmpeg manually "Python not in PATH after install": Close and reopen your terminal, then run setup.ps1 again PowerShell execution policy error: Run Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned or use transcribe.cmd instead nvidia-smi not found but have GPU: Install NVIDIA drivers — the Game Ready or Studio drivers include nvidia-smi

References

README.md

No README available.

Permissions & Security

Security level L1: Low-risk skills with minimal permissions. Review inputs and outputs before running in production.

Requirements

  • OpenClaw CLI installed and configured.
  • Language: Markdown
  • License: MIT
  • Topics:

FAQ

How do I install faster-whisper?

Run openclaw add @theplasmak/faster-whisper in your terminal. This installs faster-whisper into your OpenClaw Skills catalog.

Does this skill run locally or in the cloud?

OpenClaw Skills execute locally by default. Review the SKILL.md and permissions before running any skill.

Where can I verify the source code?

The source repository is available at https://github.com/openclaw/skills/tree/main/skills/theplasmak/faster-whisper. Review commits and README documentation before installing.