How to Build a Video Q&A Search App with OpenAI API in 2025

Prerequisites and Project Setup Checklist

Before writing a single line of application code, make sure your environment is fully equipped. Missing a dependency mid-build — especially ffmpeg — causes subtle failures that waste hours.

  • [ ] Node.js 20+ — required for native fetch, top-level await, and stable ESM support
  • [ ] OpenAI API key — needs access to Whisper (whisper-1), text-embedding-3-small, and gpt-4o
  • [ ] ffmpeg 6.x — must be on your system PATH; verify with ffmpeg -version
  • [ ] faiss-node 0.5+ — local FAISS bindings for Node.js (or Supabase account for pgvector alternative)
  • [ ] Sample MP4/WebM files — at least one video to test the full pipeline
  • [ ] 8GB RAM minimum — FAISS in-memory index for a few hundred hours of video fits comfortably here

Estimated time: 45–60 minutes for a working end-to-end pipeline.

Required Tools and Accounts

| Tool | Version | Purpose | |---|---|---| | Node.js | 20.x LTS | Runtime for all pipeline scripts | | openai (npm) | 4.x | Whisper, embeddings, GPT-4o API calls | | ffmpeg | 6.x | Audio extraction and segmentation | | faiss-node | 0.5.x | Local vector similarity search | | langchain | 0.2.x | Optional: document loaders and text splitters | | dotenv | 16.x | Environment variable management |

Project Directory Structure and Dependency Installation

mkdir video-qa-search && cd video-qa-search
npm init -y
npm install openai faiss-node dotenv express langchain @langchain/openai

Your directory should look like:

video-qa-search/
├── src/
│   ├── transcribe.js
│   ├── embed.js
│   ├── query.js
│   └── server.js
├── data/
│   ├── videos/          # raw MP4/WebM files
│   ├── audio/           # extracted audio segments
│   └── transcripts/     # JSON transcript documents
├── index/               # persisted FAISS index files
├── .env
└── package.json

Environment Variables Configuration

# .env
OPENAI_API_KEY=sk-...
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4o
FAISS_INDEX_PATH=./index/transcripts.index
TRANSCRIPTS_DIR=./data/transcripts
VIDEOS_DIR=./data/videos
AUDIO_DIR=./data/audio
PORT=3001
// src/config.js
import 'dotenv/config';
export const config = {
  openaiKey: process.env.OPENAI_API_KEY,
  embeddingModel: process.env.OPENAI_EMBEDDING_MODEL,
  chatModel: process.env.OPENAI_CHAT_MODEL,
  faissIndexPath: process.env.FAISS_INDEX_PATH,
  transcriptsDir: process.env.TRANSCRIPTS_DIR,
  videosDir: process.env.VIDEOS_DIR,
  audioDir: process.env.AUDIO_DIR,
  port: parseInt(process.env.PORT, 10) || 3001,
};

Step 1 — Extracting and Transcribing Audio from Video Files

OpenAI Whisper accepts audio files up to 25MB. A typical 1080p MP4 at 128kbps audio is around 57MB per hour — so you need to extract audio-only tracks and split them into segments before sending anything to the API.

Using ffmpeg to Extract Audio Tracks from MP4/WebM Files

The Node.js child_process.spawn approach gives you streaming stderr output, which is useful for logging progress on long videos.

// src/transcribe.js
import { spawn } from 'child_process';
import path from 'path';
import { config } from './config.js';

export async function extractAndSegmentAudio(videoFile) {
  const baseName = path.basename(videoFile, path.extname(videoFile));
  const outputPattern = path.join(config.audioDir, `${baseName}_seg%03d.mp3`);

  return new Promise((resolve, reject) => {
    // Extract audio as MP3, segment every 600 seconds (10 minutes)
    // 10-minute segments at 64kbps mono = ~4.8MB — well under the 25MB Whisper limit
    const ffmpeg = spawn('ffmpeg', [
      '-i', videoFile,
      '-vn',                    // no video
      '-acodec', 'libmp3lame', // encode as MP3
      '-ab', '64k',            // 64kbps bitrate keeps files small
      '-ac', '1',              // mono audio (Whisper handles mono fine)
      '-ar', '16000',          // 16kHz sample rate — Whisper's native rate
      '-f', 'segment',
      '-segment_time', '600',  // 600 seconds = 10 minutes
      '-reset_timestamps', '1',
      outputPattern,
    ]);

    const segmentFiles = [];
    ffmpeg.stderr.on('data', (data) => {
      const line = data.toString();
      // ffmpeg writes progress to stderr, not stdout
      if (line.includes('Opening')) {
        const match = line.match(/Opening '(.+?)' for writing/);
        if (match) segmentFiles.push(match[1]);
      }
    });

    ffmpeg.on('close', (code) => {
      if (code !== 0) return reject(new Error(`ffmpeg exited with code ${code}`));
      resolve(segmentFiles);
    });

    ffmpeg.on('error', reject);
  });
}

Sending Audio Chunks to OpenAI Whisper API for Transcription

import OpenAI from 'openai';
import fs from 'fs';
import { config } from './config.js';

const openai = new OpenAI({ apiKey: config.openaiKey });

export async function transcribeSegment(audioFilePath, segmentIndex, videoMeta) {
  const fileStream = fs.createReadStream(audioFilePath);

  const response = await openai.audio.transcriptions.create({
    file: fileStream,
    model: 'whisper-1',
    response_format: 'verbose_json',  // gives us word-level timestamps
    timestamp_granularities: ['segment'],
    // prompt helps Whisper handle domain-specific terms correctly
    prompt: videoMeta.whisperPrompt || '',
    language: 'en',
  });

  // Each segment object in verbose_json has: id, seek, start, end, text
  // We add the segment offset so timestamps refer to the full video, not just the chunk
  const segmentOffsetSeconds = segmentIndex * 600;

  return response.segments.map((seg) => ({
    text: seg.text.trim(),
    start: seg.start + segmentOffsetSeconds,
    end: seg.end + segmentOffsetSeconds,
    chunkIndex: segmentIndex,
  }));
}

Note: The timestamp_granularities: ['segment'] option is required to get the segments array in the response. Without it, verbose_json still returns full-response metadata but omits per-segment timing.


Step 2 — Structuring Transcripts with Timestamps and Metadata

Raw Whisper output is useful but not queryable. The goal here is to enrich each transcript segment with the speaker identity, topic tags, and source file reference — the same metadata structure used by Q&A archives like Ask an Astronaut, which organizes 333+ hours of footage by crew member, mission, and topic so viewers can jump directly to relevant moments.

Parsing Whisper verbose_json Response and Associating Metadata

// src/transcribe.js (continued)
import fs from 'fs/promises';
import path from 'path';
import { config } from './config.js';
import { extractAndSegmentAudio } from './transcribe.js';

export async function buildTranscriptDocument(videoFile, meta) {
  // meta shape: { speaker: string, date: string, topics: string[], whisperPrompt: string }
  const segmentFiles = await extractAndSegmentAudio(videoFile);
  const allSegments = [];

  for (let i = 0; i < segmentFiles.length; i++) {
    const segments = await transcribeSegment(segmentFiles[i], i, meta);
    allSegments.push(...segments);
  }

  const document = {
    id: path.basename(videoFile, path.extname(videoFile)),
    sourceFile: videoFile,
    speaker: meta.speaker,
    date: meta.date,
    topics: meta.topics,
    durationSeconds: allSegments.at(-1)?.end ?? 0,
    segments: allSegments,
    createdAt: new Date().toISOString(),
  };

  const outPath = path.join(config.transcriptsDir, `${document.id}.json`);
  await fs.writeFile(outPath, JSON.stringify(document, null, 2));
  console.log(`Saved transcript: ${outPath}`);
  return document;
}

A saved JSON document looks like this:

{
  "id": "iss-expedition47-qa",
  "sourceFile": "./data/videos/iss-expedition47-qa.mp4",
  "speaker": "Cmdr. Jeff Williams",
  "date": "2016-06-15",
  "topics": ["EVA", "microgravity", "life support"],
  "durationSeconds": 3842,
  "segments": [
    { "text": "The SAFER unit is your last resort if you become untethered.", "start": 142.3, "end": 147.1, "chunkIndex": 0 },
    { "text": "EMU gloves lose dexterity below minus 150 Fahrenheit.", "start": 147.1, "end": 151.8, "chunkIndex": 0 }
  ],
  "createdAt": "2025-01-10T09:23:11.000Z"
}

Step 3 — Embedding Transcripts and Building a Vector Index

Semantic search only works when your transcript segments are represented as dense vectors in the same embedding space as incoming user questions. We use text-embedding-3-small — it's 5x cheaper than text-embedding-ada-002 and performs better on retrieval benchmarks.

Chunking Strategy

Rather than embedding individual Whisper segments (which can be 3–5 words), we group them into ~300-token passages with a 50-token overlap. This ensures context isn't cut off at boundaries.

FAISS vs. Supabase pgvector

| Option | Cost | Query Speed | SQL Filtering | Setup Effort | |---|---|---|---|---| | FAISS (local) | Free | Very fast (in-memory) | No | Low | | Supabase pgvector | ~$25/mo (Pro) | Fast (indexed) | Yes | Medium |

For archives under ~50 hours, FAISS is unbeatable. Beyond that, Supabase pgvector lets you filter by speaker = 'X' at the SQL level before vector search, which dramatically improves precision.

Batch Embedding with Exponential Backoff

// src/embed.js
import OpenAI from 'openai';
import { IndexFlatL2 } from 'faiss-node';
import fs from 'fs/promises';
import path from 'path';
import { config } from './config.js';

const openai = new OpenAI({ apiKey: config.openaiKey });

async function embedWithRetry(texts, retries = 5) {
  for (let attempt = 0; attempt < retries; attempt++) {
    try {
      const response = await openai.embeddings.create({
        model: config.embeddingModel,  // text-embedding-3-small
        input: texts,
      });
      return response.data.map((d) => d.embedding);
    } catch (err) {
      if (err.status === 429 && attempt < retries - 1) {
        const delay = Math.pow(2, attempt) * 1000 + Math.random() * 500;
        console.warn(`Rate limited. Retrying in ${Math.round(delay)}ms...`);
        await new Promise((r) => setTimeout(r, delay));
      } else {
        throw err;
      }
    }
  }
}

function chunkSegments(segments, maxTokens = 300, overlapTokens = 50) {
  // Rough token estimate: 1 token ≈ 4 characters
  const chunks = [];
  let currentChunk = [];
  let currentLen = 0;

  for (const seg of segments) {
    const tokenEst = Math.ceil(seg.text.length / 4);
    if (currentLen + tokenEst > maxTokens && currentChunk.length > 0) {
      chunks.push([...currentChunk]);
      // Keep last overlap worth of segments for continuity
      const overlapChars = overlapTokens * 4;
      let trimLen = 0;
      const overlapSegs = [];
      for (let i = currentChunk.length - 1; i >= 0; i--) {
        trimLen += currentChunk[i].text.length;
        overlapSegs.unshift(currentChunk[i]);
        if (trimLen >= overlapChars) break;
      }
      currentChunk = overlapSegs;
      currentLen = overlapSegs.reduce((s, seg) => s + Math.ceil(seg.text.length / 4), 0);
    }
    currentChunk.push(seg);
    currentLen += tokenEst;
  }
  if (currentChunk.length > 0) chunks.push(currentChunk);
  return chunks;
}

export async function buildIndex(transcriptDocuments) {
  const DIMENSION = 1536; // text-embedding-3-small output dimension
  const index = new IndexFlatL2(DIMENSION);
  const metadata = []; // parallel array to FAISS vectors

  for (const doc of transcriptDocuments) {
    const chunks = chunkSegments(doc.segments);
    const BATCH_SIZE = 100;

    for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
      const batch = chunks.slice(i, i + BATCH_SIZE);
      const texts = batch.map((segs) => segs.map((s) => s.text).join(' '));
      const embeddings = await embedWithRetry(texts);

      embeddings.forEach((vector, j) => {
        index.add(vector);
        const segs = batch[j];
        metadata.push({
          docId: doc.id,
          sourceFile: doc.sourceFile,
          speaker: doc.speaker,
          startSeconds: segs[0].start,
          endSeconds: segs[segs.length - 1].end,
          text: texts[j],
        });
      });
    }
  }

  index.write(config.faissIndexPath);
  await fs.writeFile(
    config.faissIndexPath + '.meta.json',
    JSON.stringify(metadata, null, 2)
  );
  console.log(`Index built: ${metadata.length} chunks from ${transcriptDocuments.length} videos`);
  return { index, metadata };
}

Note: index.write(path) persists the FAISS index to disk. Call IndexFlatL2.read(path) on server startup to reload it without rebuilding.


Step 4 — Building the Semantic Search and Answer Generation Pipeline

This is the RAG core: embed the user's question in the same vector space as your transcript chunks, retrieve the most relevant passages, and feed them to GPT-4o with a strict system prompt that anchors the answer to the provided context.

// src/query.js
import OpenAI from 'openai';
import { IndexFlatL2 } from 'faiss-node';
import fs from 'fs/promises';
import { config } from './config.js';

const openai = new OpenAI({ apiKey: config.openaiKey });

let _index = null;
let _metadata = null;

async function loadIndex() {
  if (_index && _metadata) return { index: _index, metadata: _metadata };
  _index = IndexFlatL2.read(config.faissIndexPath);
  const raw = await fs.readFile(config.faissIndexPath + '.meta.json', 'utf-8');
  _metadata = JSON.parse(raw);
  return { index: _index, metadata: _metadata };
}

export async function queryVideoArchive(question, topK = 3) {
  const { index, metadata } = await loadIndex();

  // Step 1: Embed the user question
  const qEmbedResponse = await openai.embeddings.create({
    model: config.embeddingModel,
    input: [question],
  });
  const questionVector = qEmbedResponse.data[0].embedding;

  // Step 2: Cosine similarity search (FAISS IndexFlatL2 uses L2, which ranks identically
  // to cosine similarity when vectors are L2-normalized — text-embedding-3-small outputs
  // are already normalized)
  const { labels } = index.search(questionVector, topK);
  const retrievedChunks = labels[0].map((idx) => metadata[idx]).filter(Boolean);

  // Step 3: Build the system prompt with retrieved passages
  const contextBlock = retrievedChunks
    .map((chunk, i) =>
      `[Source ${i + 1}] Speaker: ${chunk.speaker} | Video: ${chunk.docId} | Time: ${chunk.startSeconds.toFixed(1)}s\n"${chunk.text}"`
    )
    .join('\n\n');

  const systemPrompt = `You are an expert assistant answering questions about a video archive of astronaut Q&A sessions.
Answer the user's question using ONLY the transcript excerpts provided below.
If the answer is not present in the excerpts, say: "I couldn't find a clear answer in the indexed videos."
Always cite the source number (e.g., [Source 1]) when using information from a specific excerpt.
Do not speculate beyond what the transcripts say.

TRANSCRIPT EXCERPTS:
${contextBlock}`;

  // Step 4: Call GPT-4o
  const completion = await openai.chat.completions.create({
    model: config.chatModel,  // gpt-4o
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: question },
    ],
    temperature: 0.2,  // low temperature = factual, deterministic answers
    max_tokens: 512,
  });

  const answer = completion.choices[0].message.content;

  return {
    answer,
    sources: retrievedChunks.map((chunk) => ({
      docId: chunk.docId,
      sourceFile: chunk.sourceFile,
      speaker: chunk.speaker,
      startSeconds: chunk.startSeconds,
      text: chunk.text,
    })),
  };
}

Note: Setting temperature: 0.2 on the GPT-4o call is critical. Higher values cause the model to "helpfully" extrapolate beyond what's in the transcripts, defeating the purpose of RAG grounding.


Step 5 — Surfacing Deep Links Back to Source Video Timestamps

The real value of this system isn't just the answer — it's that users can verify the answer by jumping directly to the moment in the video where it was said. This mirrors how archives like Ask an Astronaut let viewers navigate 333+ hours of footage without watching everything.

Generating HTML5 Video Seek Links

HTML5 video elements support the #t= fragment identifier: video.mp4#t=142.3 tells the browser to seek to 142.3 seconds on load. For YouTube, the equivalent is ?t=142 (integer seconds only).

Express.js API Endpoint

// src/server.js
import express from 'express';
import path from 'path';
import { queryVideoArchive } from './query.js';
import { config } from './config.js';

const app = express();
app.use(express.json());

// Serve local video files so the frontend can display them with seek links
app.use('/videos', express.static(path.resolve(config.videosDir)));

app.post('/api/ask', async (req, res) => {
  const { question } = req.body;
  if (!question || typeof question !== 'string' || question.trim().length < 3) {
    return res.status(400).json({ error: 'question must be a non-empty string' });
  }

  try {
    const result = await queryVideoArchive(question.trim(), 3);

    // Enrich each source with a deep-link URL the frontend can use directly
    const enrichedSources = result.sources.map((src) => {
      const fileName = path.basename(src.sourceFile);
      const videoUrl = `/videos/${fileName}#t=${Math.floor(src.startSeconds)}`;

      return {
        file: fileName,
        startSeconds: src.startSeconds,
        speaker: src.speaker,
        excerpt: src.text.slice(0, 200),  // preview for UI tooltip
        videoUrl,                          // HTML5 seek link
        youtubeUrl: null,                  // populate if you store YT video IDs in metadata
      };
    });

    res.json({
      answer: result.answer,
      sources: enrichedSources,
    });
  } catch (err) {
    console.error('Query error:', err);
    res.status(500).json({ error: 'Internal server error' });
  }
});

app.listen(config.port, () => {
  console.log(`Video Q&A API running on http://localhost:${config.port}`);
});

A response from POST /api/ask looks like:

{
  "answer": "According to Cmdr. Williams [Source 1], the SAFER unit is your last resort if you become untethered during an EVA. It provides short-duration propulsion to return to the airlock.",
  "sources": [
    {
      "file": "iss-expedition47-qa.mp4",
      "startSeconds": 142.3,
      "speaker": "Cmdr. Jeff Williams",
      "excerpt": "The SAFER unit is your last resort if you become untethered.",
      "videoUrl": "/videos/iss-expedition47-qa.mp4#t=142",
      "youtubeUrl": null
    }
  ]
}

On the frontend, wire videoUrl directly to an <video src="..."> element or an anchor tag — the browser handles the seek automatically.


Common Issues and Fixes

| Issue | Fix | |---|---| | Whisper returns garbled text for acronyms (ISS, EVA, SAFER, EMU) | Pass a prompt parameter with domain vocabulary: prompt: "ISS, EVA, SAFER, EMU, ECLSS, spacewalk, microgravity" | | 429 Too Many Requests on embedding batches | Use exponential backoff (already in embedWithRetry above); also reduce BATCH_SIZE from 100 to 20 | | FAISS index lost on server restart | Call index.write(path) after every build; call IndexFlatL2.read(path) on startup | | GPT-4o answers questions not covered in retrieved chunks | Add to system prompt: "If the answer is not present in the excerpts, say 'I couldn't find a clear answer'" and set temperature: 0.2 | | ffmpeg No such file or directory error | Confirm ffmpeg is on PATH with which ffmpeg; on Windows, add ffmpeg bin/ folder to system environment variables |

Error: Whisper Returns Garbled Domain-Specific Text

Whisper's language model has never seen your organization's internal acronyms. The prompt parameter works as a priming hint — Whisper biases toward words it sees in the prompt. For a space agency archive:

const NASA_PROMPT = 'ISS, EVA, SAFER, EMU, ECLSS, MMOD, RWS, JAXA, Roscosmos, Soyuz, spacewalk, microgravity, depressurization';
// Pass this as the prompt field in openai.audio.transcriptions.create

Error: FAISS Index Not Persisting Between Restarts

// After building the index:
index.write('./index/transcripts.index');  // binary FAISS file

// On server startup:
import { IndexFlatL2 } from 'faiss-node';
const index = IndexFlatL2.read('./index/transcripts.index');

Error: GPT-4o Hallucinating Answers Outside the Transcript

The fix is two-part: a strict system prompt instruction AND low temperature. The instruction alone isn't enough — at temperature: 1.0, GPT-4o will find creative ways to answer anyway.


FAQ

Q: Can I use this approach with YouTube videos instead of local files?

Yes — use yt-dlp to download audio directly: yt-dlp -x --audio-format mp3 -o '%(id)s.%(ext)s' <URL>. Store the YouTube video ID in your document metadata, then construct deep links as https://www.youtube.com/watch?v=VIDEO_ID&t=SECONDS. The rest of the pipeline is identical. Be aware of YouTube's Terms of Service regarding automated downloading for indexing purposes — review them before deploying at scale.

Q: How many hours of video can I index before hitting significant OpenAI costs?

Here's a rough cost table at January 2025 pricing:

| Operation | Cost | Per Unit | |---|---|---| | Whisper transcription | $0.006 | per minute of audio | | text-embedding-3-small | $0.020 | per 1M tokens | | GPT-4o (input) | $2.50 | per 1M tokens | | GPT-4o (output) | $10.00 | per 1M tokens |

For 100 hours of video: Whisper costs ~$36, embedding the resulting transcripts (~30M tokens) costs ~$0.60, and each user query costs roughly $0.01–$0.03 with GPT-4o depending on context length. Total indexing cost for 100 hours: under $40.

Q: Is GPT-4o necessary, or can I use gpt-3.5-turbo to reduce costs?

gpt-3.5-turbo works well for straightforward factual retrieval at 30x lower cost per token. The trade-off is instruction following: GPT-3.5 is more likely to ignore your "only answer from the provided context" instruction and hallucinate when retrieved chunks are partially relevant. For production archives where answer accuracy matters (like a public-facing Q&A tool), stay with GPT-4o. For internal prototyping or high-volume low-stakes queries, gpt-3.5-turbo is a reasonable choice — just tighten the system prompt further and add a confidence threshold check on your retrieval similarity scores.

Recommended Tools

  • SupabaseOpen source Firebase alternative with Postgres
  • AWSCloud computing services
  • VercelDeploy frontend apps instantly with zero config