Free, local speech-to-text using OpenAI Whisper.
Install dependencies (one-time setup):
pip install openai-whisper torch
Optional: Install ffmpeg for broader format support:
brew install ffmpegsudo apt install ffmpegpython ~/.openclaw/skills/whisper-stt/scripts/transcribe.py <audio_file>
| Option | Description |
|---|---|
| -------- | ------------- |
--model | Model size: tiny, base, small, medium, large, large-v3-turbo (default: base) |
--language, -l | Language code: zh, en, ja, etc. (auto-detect if not specified) |
--output, -o | Output format: json, txt, srt, vtt (default: json) |
Chinese audio to text:
python ~/.openclaw/skills/whisper-stt/scripts/transcribe.py recording.m4a --language zh --output txt
Generate subtitles (SRT):
python ~/.openclaw/skills/whisper-stt/scripts/transcribe.py video.mp4 --output srt > subtitles.srt
Use faster model:
python ~/.openclaw/skills/whisper-stt/scripts/transcribe.py audio.mp3 --model tiny --output txt
High accuracy (slower):
python ~/.openclaw/skills/whisper-stt/scripts/transcribe.py audio.mp3 --model large-v3 --output txt
| Model | Speed | Accuracy | VRAM/RAM | Best For |
|---|---|---|---|---|
| ------- | ------- | ---------- | ---------- | ---------- |
| tiny | ~32x | Basic | ~1GB | Quick tests, low resource |
| base | ~16x | Good | ~1GB | Balanced speed/accuracy |
| small | ~6x | Better | ~2GB | Better accuracy |
| medium | ~2x | Very Good | ~5GB | High accuracy |
| large | 1x | Excellent | ~10GB | Best quality |
| large-v3-turbo | ~8x | Excellent | ~6GB | Fast + accurate (recommended) |
"ModuleNotFoundError: No module named 'whisper'"
→ Run: pip install openai-whisper torch
"ffmpeg not found"
→ Install ffmpeg or convert audio to WAV format first
Slow transcription
→ Use smaller model (tiny/base) or ensure GPU is available (Apple Silicon MPS, NVIDIA CUDA)
Poor accuracy on Chinese
→ Use --language zh explicitly and consider larger model (medium/large)
Powered by OpenAI Whisper - open source speech recognition.
共 1 个版本