Skip to content

AI Analysis

StoryFab uses multiple AI models to analyze your video: Whisper for transcription, SmartSegmenter for scene understanding, and in Commentary Mode, LLM for semantic understanding.

Transcription (Whisper)

StoryFab runs OpenAI Whisper entirely locally. No audio is ever sent to the cloud.

Supported Models

ModelSizeSpeedAccuracy
tiny~75 MB10x realtimeBaseline
base~140 MB7x realtimeGood
small~470 MB4x realtimeVery Good
medium~1.5 GB2x realtimeExcellent

The default model is base. You can change it in Settings → AI → Whisper Model.

How Transcription Works

  1. Audio is extracted from the video file via FFmpeg
  2. Audio is split into 30-second chunks
  3. Each chunk is fed to Whisper for transcription with timestamps
  4. Results are merged into a continuous subtitle track (SRT format)

Smart Segmentation(智能分段)

After transcription, StoryFab analyzes the audio energy, visual scene changes, and speech activity to segment the video into meaningful chunks.

Scoring Factors

  • Audio Energy — Loud, dynamic audio segments score higher (e.g., exclamation, applause, music peaks)
  • Scene Change — Sharp visual transitions often indicate topic changes or key moments
  • Speech Activity — Segments with clear speech (not silence) are preferred
  • Pause Detection — Natural breakpoints in speech are used as clip boundaries

Speed Derivation

Each segment gets a suggested playback speed based on its energy profile:

Energy Ratio(vs 平均)Suggested SpeedDescription
> 1.1x1xHigh energy — keep original pacing
0.85–1.1x2xNormal energy — mild acceleration
0.5–0.85x4xLow energy — skip dead time
< 0.5x6xNear silence — maximum compression

Tuning Detection Parameters

In Settings → AI → Highlight Detection, you can tune:

ParameterRangeDefaultEffect
Min clip duration5–60s15sLonger = fewer, more substantial clips
Max clips3–2010Upper limit on clips per video
SensitivityLow / Medium / HighMediumHigher = more clips detected

Semantic Segmentation(语义分段)🆕

Available in Commentary Mode only

In Commentary Mode, after Smart Segmentation, StoryFab uses an LLM to add semantic understanding to each segment.

What It Does

  1. Plot Understanding — LLM reads each segment's audio/transcript and summarizes what happens
  2. Character Tracking — Identifies which characters appear in each segment
  3. Emotional Tone — Classifies the scene's emotional tone (happy/sad/tense/comedic/surprising)
  4. Commentary Potential — Scores each segment on how compelling it would be as narration content

Semantic Segment Output

typescript
interface SemanticSegment {
  start_ms: number;
  end_ms: number;
  segment_type: string;        // "dialogue" | "action" | "transition" | "silence" | "content"
  plot_summary: string;        // 一两句话总结剧情
  characters: string[];        // 出现的人物
  emotional_tone: string;      // "happy" | "sad" | "tense" | "calm" | "comedic"
  commentary_tone: string;     // 建议的解说语气
  highlight_potential: number; // 0.0-1.0,作为解说的潜力
}

Example

SegmentPlot SummaryCharactersEmotional ToneCommentary Tone
0:15–0:32男主角刚进门就被女主撞到王霸天, 林小雨tense震惊版
0:32–1:05两人互相道歉,发现是误会王霸天, 林小雨comedic幽默版
1:05–1:20女主注意到男主的服装王霸天, 林小雨calm接地气版

Subtitle Generation

Transcription automatically produces SRT subtitle files. Subtitles are:

  • Word-level accurate
  • Time-synced to the audio
  • Stored locally in the project folder

You can also import existing subtitle files (.srt, .ass, .vtt).

基于 MIT 协议开源