listenhub

Explain anything — turn ideas into podcasts, explainer videos, or voice narration. Use when the user wants to "make a podcast", "create an explainer video", "read this aloud", "generate an image", or share knowledge in audio/visual form. Supports: topic descriptions, YouTube links, article URLs, plain text, and image prompts.

2星標

0分支

更新於 1/29/2026

獲取技能源代碼

SKILL.md

readonlyread-only

name

listenhub

description

**The Hook**: Paste content, get audio/video/image. That simple.

Four modes, one entry point:

Podcast — Two-person dialogue, ideal for deep discussions
Explain — Single narrator + AI visuals, ideal for product intros
TTS/Flow Speech — Pure voice reading, ideal for articles
Image Generation — AI image creation, ideal for creative visualization

Users don't need to remember APIs, modes, or parameters. Just say what you want.

⛔ Hard Constraints (Inviolable)

The scripts are the ONLY interface. Period.

┌─────────────────────────────────────────────────────────┐
│  AI Agent  ──▶  ./scripts/*.sh  ──▶  ListenHub API     │
│                      ▲                                  │
│                      │                                  │
│            This is the ONLY path.                       │
│            Direct API calls are FORBIDDEN.              │
└─────────────────────────────────────────────────────────┘

MUST:

Execute functionality ONLY through provided scripts in **/skills/listenhub/scripts/
Pass user intent as script arguments exactly as documented
Trust script outputs; do not second-guess internal logic

MUST NOT:

Write curl commands to ListenHub/Marswave API directly
Construct JSON bodies for API calls manually
Guess or fabricate speakerIds, endpoints, or API parameters
Assume API structure based on patterns or web searches
Hallucinate features not exposed by existing scripts

Why: The API is proprietary. Endpoints, parameters, and speakerIds are NOT publicly documented. Web searches will NOT find this information. Any attempt to bypass scripts will produce incorrect, non-functional code.

Script Location

Scripts are located at **/skills/listenhub/scripts/ relative to your working context.

Different AI clients use different dot-directories:

Claude Code: .claude/skills/listenhub/scripts/
Other clients: may vary (.cursor/, .windsurf/, etc.)

Resolution: Use glob pattern **/skills/listenhub/scripts/*.sh to locate scripts reliably, or resolve from the SKILL.md file's own path.

Private Data (Cannot Be Searched)

The following are internal implementation details that AI cannot reliably know:

Category	Examples	How to Obtain
API Base URL	`api.marswave.ai/...`	✗ Cannot — internal to scripts
Endpoints	`podcast/episodes`, etc.	✗ Cannot — internal to scripts
Speaker IDs	`cozy-man-english`, etc.	✓ Call `get-speakers.sh`
Request schemas	JSON body structure	✗ Cannot — internal to scripts
Response formats	Episode ID, status codes	✓ Documented per script

Rule: If information is not in this SKILL.md or retrievable via a script (like get-speakers.sh), assume you don't know it.

Design Philosophy

Hide complexity, reveal magic.

Users don't need to know: Episode IDs, API structure, polling mechanisms, credits, endpoint differences.
Users only need: Say idea → wait a moment → get the link.

Environment

ListenHub API Key

API key stored in $LISTENHUB_API_KEY. Check on first use:

source ~/.zshrc 2>/dev/null; [ -n "$LISTENHUB_API_KEY" ] && echo "ready" || echo "need_setup"

If setup needed, guide user:

Visit https://listenhub.ai/settings/api-keys
Paste key (only the lh_sk_... part)
Auto-save to ~/.zshrc

Image Generation API Key

Image generation uses the same ListenHub API key stored in $LISTENHUB_API_KEY.
Image generation output path defaults to the user downloads directory, stored in $LISTENHUB_OUTPUT_DIR.

On first image generation, the script auto-guides configuration:

Visit https://listenhub.ai/settings/api-keys (requires subscription)
Paste API key
Configure output path (default: ~/Downloads)
Auto-save to shell rc file

Security: Never expose full API keys in output.

Mode Detection

Auto-detect mode from user input:

→ Podcast (1-2 speakers)
Supports single-speaker or dual-speaker podcasts. Debate mode requires 2 speakers.
Default mode: quick unless explicitly requested.
If speakers are not specified, call get-speakers.sh and select the first speakerId
matching the chosen language.
If reference materials are provided, pass them as --source-url or --source-text.
When the user only provides a topic (e.g., "I want a podcast about X"), proceed with:

detect language from user input,
set mode=quick,
choose one speaker via get-speakers.sh matching the language,
create a single-speaker podcast without further clarification.

Keywords: "podcast", "chat about", "discuss", "debate", "dialogue"
Use case: Topic exploration, opinion exchange, deep analysis

Feature: Two voices, interactive feel

→ Explain (Explainer video)

Keywords: "explain", "introduce", "video", "explainer", "tutorial"
Use case: Product intro, concept explanation, tutorials
Feature: Single narrator + AI-generated visuals, can export video

→ TTS (Text-to-speech)
TTS defaults to FlowSpeech direct for single-pass text or URL narration.
Script arrays and multi-speaker dialogue belong to Speech as an advanced path, not the default TTS entry.
Text-to-speech input is limited to 10,000 characters; split or use a URL when longer.

Keywords: "read aloud", "convert to speech", "tts", "voice"
Use case: Article to audio, note review, document narration
Feature: Fastest (1-2 min), pure audio

Ambiguous "Convert to speech" Guidance

When the request is ambiguous (e.g., "convert to speech", "read aloud"), apply:

Default to FlowSpeech and prioritize direct to avoid altering content.
Input type: URL uses type=url, plain text uses type=text.
Speaker: if not specified, call get-speakers and pick the first speakerId matching language.
Switch to Speech only when multi-line scripts or multi-speaker dialogue is explicitly requested, and require scripts.

Example guidance:

“This request can use FlowSpeech with the default direct mode; switch to smart for grammar and punctuation fixes. For per-line speaker assignment, provide scripts and switch to Speech.”

→ Image Generation

Keywords: "generate image", "draw", "create picture", "visualize"
Use case: Creative visualization, concept art, illustrations
Feature: AI image generation via Labnana API, multiple resolutions and aspect ratios

Reference Images via Image Hosts
When reference images are local files, upload to a known image host and use the direct image URL in --reference-images.
Recommended hosts: imgbb.com, sm.ms, postimages.org, imgur.com.
Direct image URLs should end with .jpg, .png, .webp, or .gif.

Default: If unclear, ask user which format they prefer.

Explicit override: User can say "make it a podcast" / "I want explainer video" / "just voice" / "generate image" to override auto-detection.

Interaction Flow

Step 1: Receive input + detect mode

→ Got it! Preparing...
  Mode: Two-person podcast
  Topic: Latest developments in Manus AI

For URLs, identify type:

youtu.be/XXX → convert to https://www.youtube.com/watch?v=XXX
Other URLs → use directly

Step 2: Submit generation

→ Generation submitted

  Estimated time:
  • Podcast: 2-3 minutes
  • Explain: 3-5 minutes
  • TTS: 1-2 minutes

  You can:
  • Wait and ask "done yet?"
  • Use check-status via scripts
  • View outputs in product pages:
    - Podcast: https://listenhub.ai/app/podcast
    - Explain: https://listenhub.ai/app/explainer
    - Text-to-Speech: https://listenhub.ai/app/text-to-speech
  • Do other things, ask later

Internally remember Episode ID for status queries.

Step 3: Query status

When user says "done yet?" / "ready?" / "check status":

Success: Show result + next options
Processing: "Still generating, wait another minute?"
Failed: "Generation failed, content might be unparseable. Try another?"

Step 4: Show results

Podcast result:

✓ Podcast generated!

  "{title}"

  Episode: https://listenhub.ai/app/episode/{episodeId}

  Duration: ~{duration} minutes

  Download audio: provide audioUrl or audioStreamUrl on request

One-stage podcast creation generates an online task. When status is success,
the episode detail already includes scripts and audio URLs. Download uses the
returned audioUrl or audioStreamUrl without a second create call. Two-stage
creation is only for script review or manual edits before audio generation.

Explain result:

✓ Explainer video generated!

  "{title}"

  Watch: https://listenhub.ai/app/explainer

  Duration: ~{duration} minutes

  Need to download audio? Just say so.

Image result:

✓ Image generated!

  ~/Downloads/labnana-{timestamp}.jpg

Image results are file-only and not shown in the web UI.

Important: Prioritize web experience. Only provide download URLs when user explicitly requests.

Script Reference

Scripts are shell-based. Locate via **/skills/listenhub/scripts/.
Dependency: jq is required for request construction.
The AI must ensure curl and jq are installed before invoking scripts.

⚠️ Long-running Tasks: Generation may take 1-5 minutes. Use your CLI client's native background execution feature:

Claude Code: set run_in_background: true in Bash tool
Other CLIs: use built-in async/background job management if available

Invocation pattern:

$SCRIPTS/script-name.sh [args]

Where $SCRIPTS = resolved path to **/skills/listenhub/scripts/

Podcast (One-Stage)

Default path. Use unless script review or manual editing is required.

$SCRIPTS/create-podcast.sh --query "The future of AI development" --language en --mode deep --speakers cozy-man-english
$SCRIPTS/create-podcast.sh --query "Analyze this article" --language en --mode deep --speakers cozy-man-english --source-url "https://example.com/article"

Podcast (Two-Stage: Text → Audio)

Advanced path. Use only when script review or edits are explicitly requested:

# Stage 1: Generate text content
$SCRIPTS/create-podcast-text.sh --query "AI history" --language en --mode deep --speakers cozy-man-english,travel-girl-english

# Stage 2: Generate audio from text
$SCRIPTS/create-podcast-audio.sh --episode "<episode-id>"
$SCRIPTS/create-podcast-audio.sh --episode "<episode-id>" --scripts modified-scripts.json

Speech (Multi-Speaker)

$SCRIPTS/create-speech.sh --scripts scripts.json
echo '{"scripts":[{"content":"Hello","speakerId":"cozy-man-english"}]}' | $SCRIPTS/create-speech.sh --scripts -

# scripts.json format:
# {
#   "scripts": [
#     {"content": "Script content here", "speakerId": "speaker-id"},
#     ...
#   ]
# }

Get Available Speakers

$SCRIPTS/get-speakers.sh --language zh
$SCRIPTS/get-speakers.sh --language en

Guidance:

若用户未指定音色，必须先调用 get-speakers.sh 获取可用列表。
默认值兜底：取与 language 匹配的列表首个 speakerId 作为默认音色。

Response structure (for AI parsing):

{
  "code": 0,
  "data": {
    "items": [
      {
        "name": "Yuanye",
        "speakerId": "cozy-man-english",
        "gender": "male",
        "language": "zh"
      }
    ]
  }
}

Usage: When user requests specific voice characteristics (gender, style), call this script first to discover available speakerId values. NEVER hardcode or assume speakerIds.

Explain

$SCRIPTS/create-explainer.sh --content "Introduce ListenHub" --language en --mode info --speakers cozy-man-english
$SCRIPTS/generate-video.sh --episode "<episode-id>"

TTS

$SCRIPTS/create-tts.sh --type text --content "Welcome to ListenHub" --language en --mode smart --speakers cozy-man-english

Image Generation

$SCRIPTS/generate-image.sh --prompt "sunset over mountains" --size 2K --ratio 16:9
$SCRIPTS/generate-image.sh --prompt "style reference" --reference-images "https://example.com/ref1.jpg,https://example.com/ref2.png"

Check Status

$SCRIPTS/check-status.sh --episode "<episode-id>" --type podcast
$SCRIPTS/check-status.sh --episode "<episode-id>" --type flow-speech
$SCRIPTS/check-status.sh --episode "<episode-id>" --type explainer

Language Adaptation

Automatic Language Detection: Adapt output language based on user input and context.

Detection Rules:

User Input Language: If user writes in Chinese, respond in Chinese. If user writes in English, respond in English.
Context Consistency: Maintain the same language throughout the interaction unless user explicitly switches.
CLAUDE.md Override: If project-level CLAUDE.md specifies a default language, respect it unless user input indicates otherwise.
Mixed Input: If user mixes languages, prioritize the dominant language (>50% of content).

Application:

Status messages: "→ Got it! Preparing..." (English) vs "→ 收到！准备中..." (Chinese)
Error messages: Match user's language
Result summaries: Match user's language
Script outputs: Pass through as-is (scripts handle their own language)

Example:

User (Chinese): "生成一个关于 AI 的播客"
AI (Chinese): "→ 收到！准备双人播客..."

User (English): "Make a podcast about AI"
AI (English): "→ Got it! Preparing two-person podcast..."

Principle: Language is interface, not barrier. Adapt seamlessly to user's natural expression.

AI Responsibilities

Black Box Principle

You are a dispatcher, not an implementer.

Your job is to:

Understand user intent (what do they want to create?)
Select the correct script (which tool fits?)
Format arguments correctly (what parameters?)
Execute and relay results (what happened?)

Your job is NOT to:

Understand or modify script internals
Construct API calls directly
Guess parameters not documented here
Invent features that scripts don't expose

Mode-Specific Behavior

ListenHub modes (passthrough):

Podcast/Explain/TTS/Speech → pass user input directly
Server has full AI capability to process content
If user needs specific speakers → call get-speakers.sh first to list options

Labnana mode (enhance):

Image Generation → client-side AI optimizes prompt
Thin forwarding layer, needs client intelligence enhancement

Prompt Optimization (Image Generation)

When generating images, optimize user prompts by adding:

Style Enhancement:

"cyberpunk" → add "neon lights, futuristic, dystopian"
"ink painting" → add "Chinese ink painting, traditional art style"
"photorealistic" → add "highly detailed, 8K quality"

Scene Details:

Time: at night / at sunset / in the morning
Lighting: dramatic lighting / soft lighting / neon glow
Weather: rainy / foggy / clear sky

Composition Quality:

Composition: cinematic composition / wide-angle / close-up
Quality: highly detailed / 8K quality / professional photography

DO:

Understand user intent, add missing details
Use English keywords (models trained on English)
Add quality descriptors
Keep user's core intent unchanged
Show optimized prompt transparently

DON'T:

Drastically change user's original meaning
Add elements user explicitly doesn't want
Over-stack complex terminology
If user wants "simple", don't add "highly detailed"

Make a podcast about the latest AI developments → Got it! Preparing two-person podcast... Topic: Latest AI developments

→ Generation submitted, about 2-3 minutes

You can:
• Wait and ask "done yet?"
• Check listenhub.ai/app/library

Create an explainer video introducing Claude Code → Got it! Preparing explainer video... Topic: Claude Code introduction

→ Generation submitted, explainer videos take 3-5 minutes

Includes: Script + narration + AI visuals

Convert this article to speech https://blog.example.com/article → Got it! Parsing article...

→ TTS submitted, about 1-2 minutes

Wait a moment, or ask "done yet?" to check

Generate an image: cyberpunk city at night → Generating image...

Original: cyberpunk city at night

Optimized prompt:
"Cyberpunk city at night, neon lights reflecting on wet streets,
towering skyscrapers with holographic ads, flying vehicles,
cinematic composition, highly detailed, 8K quality"

Resolution: 4K (16:9)

✓ Image generated!
~/Downloads/labnana-20260121-143145.jpg

Generate an image in this style: https://example.com/style-ref.jpg, prompt: "a futuristic car" → Generating image with reference...

Prompt: a futuristic car
Reference images: 1
Reference image URL: https://example.com/style-ref.jpg
Resolution: 2K (16:9)

✓ Image generated!
~/Downloads/labnana-20260122-154230.jpg

Done yet? ✓ Podcast generated!

"AI Revolution: From GPT to AGI"

Listen: https://listenhub.ai/app/podcast

Duration: ~8 minutes

Need to download? Just say so.

Related Skills

songsee

88Kdesign

Generate spectrograms and feature-panel visualizations from audio with the songsee CLI.

moltbot

獲取

slack-gif-creator

48Kdesign

Knowledge and utilities for creating animated GIFs optimized for Slack. Provides constraints, validation tools, and animation concepts. Use when users request animated GIFs for Slack like "make me a GIF of X doing Y for Slack."

anthropics

獲取

algorithmic-art

48Kdesign

Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.

anthropics

獲取

brand-guidelines

47Kdesign

Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.

anthropics

獲取

theme-factory

47Kdesign

Toolkit for styling artifacts with a theme. These artifacts can be slides, docs, reportings, HTML landing pages, etc. There are 10 pre-set themes with colors/fonts that you can apply to any artifact that has been creating, or can generate a new theme on-the-fly.

anthropics

獲取

canvas-design

47Kdesign

Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.

anthropics

獲取

listenhub

⛔ Hard Constraints (Inviolable)

Script Location

Private Data (Cannot Be Searched)

Design Philosophy

Environment

ListenHub API Key

Image Generation API Key

Mode Detection

Ambiguous "Convert to speech" Guidance

Interaction Flow

Step 1: Receive input + detect mode

Step 2: Submit generation

Step 3: Query status

Step 4: Show results

Script Reference

Podcast (One-Stage)

Podcast (Two-Stage: Text → Audio)

Speech (Multi-Speaker)

Get Available Speakers

Explain

TTS

Image Generation

Check Status

Language Adaptation

AI Responsibilities

Black Box Principle

Mode-Specific Behavior

Prompt Optimization (Image Generation)

You Might Also Like

Related Skills

songsee

slack-gif-creator

algorithmic-art

brand-guidelines

theme-factory

canvas-design