Building DnD Scribe: Engineering a Scalable Discord Voice Transcription & AI Campaign Memory System

Adrian ChrysanthouFebruary 24, 202612 min read

DnD Scribe began as a simple question:

Can a Discord bot listen to a live Dungeons & Dragons session, transcribe everything automatically, and turn hours of chaotic role-play into structured, searchable campaign knowledge?

What emerged is a multi-service system that captures live voice, chunks and streams audio for transcription, stores structured session data in MongoDB, and applies large language models to transform raw dialogue into usable narrative artifacts.

Live site: https://www.dndscribe.com
Repository: https://github.com/f00d4tehg0dz/DiscordTranscribeDnD

This article focuses on how it works internally: the audio pipeline, chunking strategy, data modeling, AI usage, and scalability decisions.

System Overview

At a high level, DnD Scribe consists of:

Discord bot (Node.js + discord.js)
Audio ingestion + chunking layer
Speech-to-text pipeline using OpenAI Whisper
Summarization and enrichment using GPT models
Persistence layer using MongoDB
Web UI for browsing campaigns and sessions

Conceptual flow:

Discord Voice Channel
        ↓
Per-user audio capture
        ↓
PCM buffer → WAV chunk
        ↓
Whisper transcription
        ↓
MongoDB (raw segments)
        ↓
LLM summarization
        ↓
MongoDB (summaries + metadata)
        ↓
Web UI / Discord output

The key engineering challenge is handling long-running voice sessions reliably without exhausting memory, API limits, or losing context.

Capturing and Chunking Discord Audio

Discord voice data arrives as continuous PCM frames. Streaming that entire feed into a single transcription request is not viable:

Whisper has practical duration limits
Memory would grow unbounded
Network failures would corrupt large chunks

Instead, DnD Scribe uses a rolling chunk buffer.

Chunking Strategy

Each speaking user maintains:

An in-memory PCM buffer
Timestamp of the last received audio frame
Byte length counter

When either condition is met:

Buffer exceeds N seconds (e.g., 20–30s)
Silence gap exceeds M milliseconds

The buffer is flushed into a WAV file and queued for transcription.

Pseudo-logic:

if (bufferDuration >= MAX_CHUNK_SECONDS || silenceGap > MAX_SILENCE_MS) {
  flushBufferToWav();
  enqueueForTranscription(wavPath);
  resetBuffer();
}

This yields:

Predictable file sizes
Fast turnaround for transcripts
Minimal memory pressure

Why Not Stream Directly?

Streaming APIs are fragile under long sessions and transient network issues. File-based chunking provides:

Natural retry boundaries
Persistent audit trail
Easier debugging

If a chunk fails, it can simply be retried.

WAV Conversion Pipeline

Discord voice packets arrive as Opus → decoded into PCM → written to WAV.

Typical flow:

const opusDecoder = new prism.opus.Decoder({
  rate: 48000,
  channels: 2,
  frameSize: 960
});

pcmStream.pipe(opusDecoder).pipe(wavWriter);

Design choices:

48kHz stereo to preserve clarity
Standard WAV container for Whisper compatibility
Temporary filesystem storage instead of memory buffers

This decouples capture from transcription and prevents memory ballooning.

Transcription Queue Architecture

A lightweight in-process queue is used:

Audio Chunk → Queue → Worker → Whisper API

Workers:

Process chunks sequentially per guild
Apply exponential backoff on failures
Tag results with guildId, sessionId, and userId

This avoids:

Bursting too many API calls
Out-of-order transcripts
Partial session corruption

Pseudo-worker:

while(queue.hasItems()) {
  const job = queue.next();
  const text = await whisperTranscribe(job.file);
  storeTranscript(job.meta, text);
}

Using Whisper for Speech-to-Text

Whisper is used because:

Handles noisy audio well
Supports long-form speech
Performs reliably on multiple speakers

Each request includes:

Language hint (if known)
Temperature near zero
No prompt injection (pure transcription)

Result:

{
  "text": "I cast fireball at the goblins on the ridge..."
}

This raw output is never overwritten — only appended.

That decision enables:

Reprocessing with improved models later
Auditing
Debugging hallucinations in summaries

MongoDB Data Modeling

Rather than one massive “session” document, data is segmented into collections:

1. Guilds

{
  guildId,
  name,
  openaiKeyEncrypted,
  settings
}

2. Campaigns

{
  campaignId,
  guildId,
  name,
  createdAt
}

3. Sessions

{
  sessionId,
  campaignId,
  startTime,
  endTime,
  status
}

4. Transcripts

{
  sessionId,
  userId,
  timestamp,
  text
}

5. Summaries

{
  sessionId,
  summaryType,   // "interval", "final"
  content,
  createdAt
}

Why This Matters

Transcripts scale linearly
Summaries can be regenerated
Sessions stay lightweight
Queries remain fast

No document grows without bounds.

Periodic Summarization

Instead of summarizing only at session end, DnD Scribe performs interval summarization.

Example:

Every 30 minutes
Or after N transcript chunks

Flow:

Pull the last N transcript entries
Feed into LLM
Store interval summary

Prompt structure:

Summarize the following Dungeons & Dragons session dialogue.
Focus on:
- Major events
- NPC interactions
- Player decisions
- Combat outcomes
Return structured bullet points.

Benefits:

Reduces context window size
Enables near-real-time summaries
Prevents a single huge prompt at the end

Final summary is then generated from summaries + transcripts, not raw text alone.

Hallucination Mitigation

Several guardrails are used:

Only feed the actual transcript text
No worldbuilding prompts
No “creative writing” instructions
Temperature kept low

The AI is instructed to summarize, not embellish.

This keeps the output as factual as possible.

Horizontal Scalability

DnD Scribe can be scaled along three axes:

Bot Instances

Multiple bot processes can run:

Same code
Different guilds
Shared MongoDB

Transcription Workers

The queue can be externalized (Redis, SQS, RabbitMQ later).

Stateless Design

All states stored in DB:

If the bot crashes, the session resumes
If the worker crashes, the queue continues

This makes containerized deployment straightforward.

Cost Controls

Per-guild API keys
Chunk size limits
Summarization intervals
No continuous streaming to LLMs

Guilds control their own usage footprint.

Security Considerations

API keys encrypted at rest
No transcripts exposed publicly
Guild isolation by ID
Environment variables for secrets

Why This Architecture Works Well for D&D

Tabletop sessions are:

Long
Unstructured
Multi-speaker
Noisy

Chunked audio + append-only transcripts + layered summarization matches that reality.

The system does not try to “understand” the story in real time.

It records first, then reasons later.

That separation is the core design principle.

Closing Thoughts

DnD Scribe is not just a Discord bot. It is a campaign memory engine.

By treating voice as a data stream, transcripts as immutable logs, and summaries as derived artifacts, the system stays reliable, debuggable, and scalable.

Future expansions (NPC extraction, timeline graphs, character arcs, vector search, RAG) build naturally on top of this foundation.

Bonus: A quick stat page!