Claude Skills Guide

Building a Chrome extension for live meeting transcription requires understanding browser permissions, Web Speech API integration, and audio capture mechanisms. This guide covers the technical implementation for developers and power users who want to create or customize real-time transcription tools.

Understanding the Core Components

A live meeting transcription extension relies on three main pillars: audio capture, speech recognition, and display/output. Chrome provides APIs for each layer, though each comes with specific constraints you need to work around.

The Web Speech API serves as the foundation for speech-to-text conversion. Modern Chrome versions support both SpeechRecognition and webkitSpeechRecognition interfaces. These APIs process audio streams locally within the browser, which has privacy implications worth considering for enterprise deployments.

// Initialize speech recognition
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();

recognition.continuous = true;
recognition.interimResults = true;
recognition.lang = 'en-US';

recognition.onresult = (event) => {
  for (let i = event.resultIndex; i < event.results.length; i++) {
    const transcript = event.results[i][0].transcript;
    const isFinal = event.results[i].isFinal;
    // Handle interim vs final results
  }
};

The continuous flag keeps recognition active across pauses, while interimResults provides real-time feedback as speakers talk. Setting the correct language code matters significantly for accuracy—always match it to your expected speakers.

Audio Source Handling

Meeting transcription extensions must handle multiple audio sources: microphone input, system audio (via tab capture), and external sources. Chrome’s getUserMedia API handles microphone access, while chrome.tabCapture enables capturing audio from browser tabs running video conferencing software.

// Request microphone access
async function getMicrophoneStream() {
  try {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    return stream;
  } catch (err) {
    console.error('Microphone access denied:', err);
  }
}

// Capture tab audio (requires permissions)
async function captureTabAudio(tabId) {
  const stream = await chrome.tabCapture.capture({
    audio: true,
    video: false
  });
  return stream;
}

The chrome.tabCapture API requires the tabCapture permission in your manifest and triggers a user prompt. Tab capture only works when the extension icon is clicked—this is a security restriction you cannot bypass.

For Zoom, Google Meet, and similar platforms, you may need to inject content scripts to access their audio streams directly. This approach works because the audio is already decoded in the page context:

// Content script injection pattern
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.action === 'getAudioData') {
    const audioElement = document.querySelector('audio');
    // Process audio element for transcription
    sendResponse({ success: true });
  }
});

Manifest Configuration

Your manifest.json must declare the right permissions. For a full-featured transcription extension:

{
  "manifest_version": 3,
  "name": "Live Meeting Transcriber",
  "version": "1.0",
  "permissions": [
    "tabCapture",
    "scripting",
    "activeTab"
  ],
  "host_permissions": [
    "*://*.zoom.us/*",
    "*://*.meet.google.com/*",
    "*://*.teams.microsoft.com/*"
  ],
  "background": {
    "service_worker": "background.js"
  }
}

Manifest V3 introduced stricter background worker limits. Speech recognition creates unique challenges because SpeechRecognition objects must remain instantiated. Consider using a persistent background page (via "background": {"page": "background.html"}} instead of service worker) if you need continuous recognition, though this approach has limitations.

Real-Time Display and Timestamping

Live transcription requires efficient UI updates. React the DOM directly from your content script or popup, but batch updates to prevent performance degradation:

class TranscriptionDisplay {
  constructor() {
    this.buffer = [];
    this.updateInterval = null;
  }

  start() {
    this.updateInterval = setInterval(() => this.flush(), 500);
  }

  addEntry(text, timestamp, speaker = 'Unknown') {
    this.buffer.push({ text, timestamp, speaker });
  }

  flush() {
    if (this.buffer.length === 0) return;
    
    const display = document.getElementById('transcription-output');
    this.buffer.forEach(entry => {
      const line = document.createElement('div');
      line.className = 'transcript-line';
      line.innerHTML = `<span class="timestamp">${entry.timestamp}</span> <span class="speaker">${entry.speaker}:</span> ${entry.text}`;
      display.appendChild(line);
    });
    
    this.buffer = [];
  }
}

Adding timestamps helps users navigate long transcripts. Store sessions in chrome.storage to enable searching past meetings:

// Save transcript to storage
async function saveTranscript(sessionId, transcriptData) {
  await chrome.storage.local.set({
    [`transcript_${sessionId}`]: {
      timestamp: Date.now(),
      entries: transcriptData
    }
  });
}

Handling API Limitations

The Web Speech API has constraints you should plan for. It requires an active internet connection for Chrome’s cloud-based recognition (free tier). Recognition accuracy varies by accent, audio quality, and background noise. The API may stop unexpectedly—implement robust reconnection logic:

recognition.onend = () => {
  // Automatically restart recognition
  if (shouldContinueRecording) {
    setTimeout(() => recognition.start(), 100);
  }
};

recognition.onerror = (event) => {
  console.error('Recognition error:', event.error);
  if (event.error === 'no-speech') {
    // Expected in silent moments, restart quietly
    recognition.start();
  }
};

Speaker Diarization Challenges

Distinguishing between speakers remains difficult with the base Web Speech API. Several approaches help:

  1. Spatial audio analysis: Use Web Audio API to analyze audio input directionality when multiple microphones exist
  2. Voice activity detection: Segment audio by pause patterns to estimate speaker changes
  3. External services: Integrate with AssemblyAI, Deepgram, or similar APIs that provide speaker diarization

For enterprise use, combining Chrome extension capture with backend AI services typically yields the best results:

// Send audio chunks to external service
async function sendToTranscriptionService(audioBlob) {
  const formData = new FormData();
  formData.append('audio', audioBlob, 'segment.webm');
  
  const response = await fetch('https://api.transcription.service/v1/transcribe', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`
    },
    body: formData
  });
  
  return response.json();
}

Extension Architecture Recommendations

For production deployments, structure your extension with clear separation:

This separation keeps memory usage low and makes debugging easier. Test extensively across Chrome versions, as speech API behavior varies.

Building a Chrome extension for live meeting transcription involves navigating browser APIs, managing audio streams, and handling the inherent limitations of client-side speech recognition. Start with the Web Speech API for prototyping, then evaluate external services for production accuracy requirements.

Built by theluckystrike — More at zovo.one