speak-tts

Give your agent the ability to speak to you real-time. Talk to your Claude! Local TTS, text-to-speech, voice synthesis, audio generation with voice cloning on Apple Silicon. Use for reading articles aloud, audiobook narration, or voice responses. Runs entirely on-device via MLX - private, no API keys.

4Sterne

0Forks

Aktualisiert 1/28/2026

Skill holen Quellcode

SKILL.md

readonlyread-only

name

speak-tts

description

speak - Talk to your Claude!

Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon.
Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.

Prerequisites

Requirement	Check	Install
Apple Silicon Mac	`uname -m` → arm64	Intel not supported
macOS 12.0+	`sw_vers`	-
sox	`which sox`	`brew install sox`
ffmpeg	`which ffmpeg`	`brew install ffmpeg`
poppler (PDF)	`which pdftotext`	`brew install poppler`

Input Sources

Source	Example
Text file	`speak article.txt`
Markdown	`speak doc.md`
Direct string	`speak "Hello"`
Clipboard	`pbpaste \| speak`
Stdin	`cat file.txt \| speak`

Web Articles

lynx -dump -nolist "https://example.com/article" | speak --output article.wav

Converting Formats

Format	Convert Command
PDF	`pdftotext doc.pdf doc.txt`
DOCX	`textutil -convert txt doc.docx`
HTML	`pandoc -f html -t plain doc.html > doc.txt`

Output Modes

Goal	Command
Save for later	`speak text.txt --output file.wav`
Listen now (streaming)	`speak text.txt --stream`
Listen now (complete)	`speak text.txt --play`
Both	`speak text.txt --stream --output file.wav`

Default Behavior

speak article.txt          # → ~/Audio/speak/article.wav (no playback)
speak "Hello"              # → ~/Audio/speak/speak_<timestamp>.wav

Directory Auto-Creation

Directory	Auto-Created?
`~/Audio/speak/`	✓ Yes
`~/.chatter/voices/`	✗ No
Custom directories	✗ No

Always create custom directories first:

mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/

Voice Cloning

Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.

Quality Expectations

Output captures general voice characteristics but is not a perfect replica
Quality depends heavily on sample quality
15-25 seconds is optimal (10s minimum, 30s maximum)

Recording Your Voice

Using QuickTime:

Open QuickTime Player → File → New Audio Recording
Record 20 seconds of clear speech
File → Export As → Audio Only (.m4a)
Convert to WAV (see below)

Using sox (command line):

# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25

Converting to Required Format

Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.

# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav

# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav

# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav

# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono

Using Your Voice

# Create directory
mkdir -p ~/.chatter/voices/

# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav

# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream

# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav

Path requirements:

✓ Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell)
✓ Works: /Users/name/.chatter/voices/my_voice.wav
✗ Fails: my_voice.wav (relative path)
✗ Fails: ./voices/my_voice.wav (relative path)

Voice Sample Tips

Good Sample	Bad Sample
Quiet room	Background noise
Natural pace	Rushed or monotone
Clear diction	Mumbling
Varied content	Repetitive phrases

Default Voice

When --voice is omitted, a built-in default voice is used:

speak "Hello world" --stream  # Uses default voice

Emotion Tags

Tags produce audible effects (actual sounds), not spoken words:

speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."

Tag	Effect
`[laugh]`	Laughter
`[chuckle]`	Light chuckle
`[sigh]`	Sighing
`[gasp]`	Gasping
`[groan]`	Groaning
`[clear throat]`	Throat clearing
`[cough]`	Coughing
`[crying]`	Crying
`[singing]`	Sung speech

NOT supported: [pause], [whisper] (ignored)

For pauses: Use punctuation: "Wait... let me think."

Batch Processing

mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav

# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk

# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing

Auto-Chunk Behavior

When using --auto-chunk with batch processing:

Each input file is chunked independently
Chunks are generated and automatically concatenated per file
Final output: one .wav per input file (e.g., ch01.wav)
Intermediate chunks deleted (unless --keep-chunks)

You don't need to manually concatenate chunks — only concatenate final chapter files.

Concatenating Audio

# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav

# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav

Zero-Padding Rules

Critical for correct concatenation order:

Files	Correct	Wrong
1-9	`01`, `02`, ..., `09`	`1`, `2`, ..., `9`
10-99	`01`, `02`, ..., `99`	`1`, `10`, `2`, ...
100+	`001`, `002`, ..., `999`	`1`, `100`, `2`, ...

Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt  # Note chapter page numbers

# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"

Step 2: Extract Chapters (Zero-Padded!)

# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters

Step 3: Estimate Time

speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed

# Quick estimates:
# 1 page ≈ 2 min audio ≈ 1 min generation
# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB

Step 4: Generate Audio

mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav

Step 5: Concatenate

speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav

PDF Troubleshooting

Issue	Solution
Empty/garbled text	Scanned PDF — use OCR: `brew install tesseract`
Wrong encoding	Try: `pdftotext -enc UTF-8 doc.pdf`
Check word count	`pdftotext doc.pdf - \| wc -w` (should be >100)

Multi-Voice Content

mkdir -p podcast/scripts podcast/wav

echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt

speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav

speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav

Options Reference

Option	Description	Default
`--stream`	Stream as it generates	false
`--play`	Play after complete	false
`--output <path>`	Output file	~/Audio/speak/
`--output-dir <dir>`	Batch output directory	-
`--voice <path>`	Voice sample (full path)	default
`--timeout <sec>`	Timeout per file	300
`--auto-chunk`	Split long documents	false
`--chunk-size <n>`	Chars per chunk	6000
`--resume <file>`	Resume from manifest	-
`--keep-chunks`	Keep intermediate files	false
`--skip-existing`	Skip if output exists	false
`--estimate`	Show duration estimate	false
`--dry-run`	Preview only	false
`--quiet`	Suppress output	false

Commands

Command	Description
`speak setup`	Set up environment
`speak health`	Check system status
`speak models`	List TTS models
`speak concat`	Concatenate audio
`speak daemon kill`	Stop TTS server
`speak config`	Show configuration

Performance

Metric	Value
Cold start	~4-8s
Warm start	~3-8s
Speed	0.3-0.5x RTF (faster than real-time)
Storage	~2.5 MB/min, ~150 MB/hour

Resume Capability

For interrupted long generations:

# Single file with auto-chunk — use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json

# Batch processing — use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing

Common Errors

Error	Cause	Solution
"Voice file not found"	Relative path	Use full path: `~/.chatter/voices/x.wav`
"Invalid WAV format"	Wrong specs	Convert: `ffmpeg -i in.wav -ar 24000 -ac 1 out.wav`
"Voice sample too short"	<10 seconds	Record 15-25 seconds
"Output directory doesn't exist"	Not created	`mkdir -p dirname/`
"sox not found"	Not installed	`brew install sox`
Scrambled concat order	Non-zero-padded	Use `01`, `02`, not `1`, `2`
Timeout	>5 min generation	Use `--auto-chunk` or `--timeout 600`
"Server not running"	Stale daemon	`speak daemon kill && speak health`

Setup

speak "test"     # Auto-setup on first run (downloads model ~500MB)
speak setup      # Or manual setup
speak health     # Verify everything works

Server Management

Server auto-starts and shuts down after 1 hour idle.

speak health        # Check status
speak daemon kill   # Stop manually

Related Skills

verify

243K

Use when you want to validate changes before committing, or when you need to check all React contribution requirements.

facebook

Holen

test

243K

Use when you need to run tests for React core. Supports source, www, stable, and experimental channels.

facebook

Holen

feature-flags

243K

Use when feature flag tests fail, flags need updating, understanding @gate pragmas, debugging channel-specific test failures, or adding new flags to React.

facebook

Holen

extract-errors

243K

Use when adding new error messages to React, or seeing "unknown error code" warnings.

facebook

Holen

flow

243K

Use when you need to run Flow type checking, or when seeing Flow type errors in React code.

facebook

Holen

flags

243K

Use when you need to check feature flag states, compare channels, or debug why a feature behaves differently across release channels.

facebook

Holen

speak-tts

speak - Talk to your Claude!

Prerequisites

Input Sources

Web Articles

Converting Formats

Output Modes

Default Behavior

Directory Auto-Creation

Voice Cloning

Quality Expectations

Recording Your Voice

Converting to Required Format

Using Your Voice

Voice Sample Tips

Default Voice

Emotion Tags

Batch Processing

Auto-Chunk Behavior

Concatenating Audio

Zero-Padding Rules

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

Step 2: Extract Chapters (Zero-Padded!)

Step 3: Estimate Time

Step 4: Generate Audio

Step 5: Concatenate

PDF Troubleshooting

Multi-Voice Content

Options Reference

Commands

Performance

Resume Capability

Common Errors

Setup

Server Management

You Might Also Like

Related Skills

verify

test

feature-flags

extract-errors

flow

flags