Voice Communication
Systems
From acoustic pressure waves to digital packets — complete study of voice capture, modulation, transmission, echo cancellation and reconstruction.
Voice communication converts acoustic energy into a transmittable form and reconstructs it at the destination. Every system — analog telephone, GSM, VoIP, satellite — follows: capture → process → encode → transmit → receive → reconstruct.
The engineering challenge is preserving intelligibility while minimizing delay, bandwidth, and noise — across media ranging from copper wire to radio waves to optical fiber.
Human voice spans 80 Hz – 8 kHz. Telephony uses only 300–3,400 Hz (ITU-T G.711) — sufficient for full intelligibility at just 64 kbps.
PSTN, AM/FM radio. Continuous signals, simple, susceptible to noise accumulation along the path.
GSM, VoIP, ISDN. Sampled & quantized. Noise-resistant, supports error correction, compression, encryption.
SIP/RTP, WebRTC. Voice fragmented into IP packets. Global routing, conferencing, data network integration.
| # | Stage | Process | Signal | Key Value |
|---|---|---|---|---|
| 1 | Vocal Cords | Larynx vibrates air forming pressure waves | Acoustic | 80–300 Hz fundamental |
| 2 | Microphone | Converts acoustic pressure → voltage | Analog Elec | Sens: –40 dBV/Pa |
| 3 | Pre-Amp + LPF | Amplify; anti-alias filter ≤ fs/2 | Analog | Gain 20–60 dB |
| 4 | ADC / Sampler | Sample 8 kHz; 8-bit μ-law quantization | Digital | 64 kbps PCM |
| 5 | Codec / Encoder | Compress PCM (G.711, G.729, Opus) | Digital | 8–64 kbps |
| 6 | Modulator | Impress signal on carrier (AM/FM/QAM) | RF Signal | Carrier varies |
| 7 | TX + Antenna | Amplify and radiate through medium | EM/RF | 10 mW – 1 kW |
| 8 | RX + Demod | Receive, demodulate, decode, DAC → speaker | →→ Acoustic | SNR >30 dB |
Voice changes representation at each stage. These five waveforms show the same speech segment across the communication chain — all rendered live on canvas.
Complex voice decomposes into sinusoidal components:
Fundamental + harmonics define timbre. Formant peaks F1–F4 determine vowel identity.
Sample rate ≥ twice the highest frequency for perfect reconstruction:
Telephony: fmax=4 kHz → fs=8,000 sps → 64 kbps G.711 PCM.
Echo is the most destructive quality problem in real-time voice. When a speaker’s voice travels over the network to a remote device, exits the loudspeaker, reflects off surfaces, re-enters the microphone, and is transmitted back — the original speaker hears their own voice with a 50–400 ms delay. This makes conversation impossible. AEC is therefore mandatory in every phone, conferencing system, and VoIP implementation.
Standard: ITU-T G.168 defines the digital network echo canceller specification. Target: Echo Return Loss Enhancement (ERLE) >40 dB. Filter convergence: <300 ms. Double-talk protection required to prevent filter divergence.
Speaker A’s voice → Network → Speaker B’s loudspeaker → room reflections → B’s microphone → Network → Speaker A hears own voice echoed with 50–400 ms delay. Conversations become unintelligible above 30 ms echo delay.
Adaptive filter continuously models room h(t) using loudspeaker signal x(t) as reference. 512–4096 tap FIR filter subtracts echo estimate ŷ(t) from mic input d(t). NLP suppresses residual echo. ERLE >40 dB achieved.
- Acoustic Wave PropagationLongitudinal pressure waves at 343 m/s in air. Frequency = pitch; amplitude = loudness (dB SPL). Formants F1–F4 encode vowel identity.
- Transduction (Microphone)Dynamic: coil moves in B-field. Condenser: C=ε₀A/d varies with diaphragm. MEMS: micro-fabricated silicon capacitor. All convert acoustic → electrical.
- Shannon Channel CapacityC = B·log₂(1+S/N). PSTN (B=3.1 kHz, SNR=30 dB): C≈30 kbps theoretical max. MIMO channels extend this significantly.
- Quantization & Compandingμ-law/A-law non-uniform quantization allocates more levels to small amplitudes, matching logarithmic hearing. Reduces quantization noise ~6 dB vs linear PCM.
- Perceptual CodingOpus, AMR-WB exploit auditory masking — loud sounds mask quieter nearby frequencies. Only perceptually significant components receive bit allocation.
- Delay BudgetEnd-to-end ≤150 ms (ITU G.114). Jitter buffers 20–80 ms absorb packet timing variation. Playout algorithms trade latency for continuity.
G.711 64 kbps. Circuit-switched. μ/A-law. <5 ms local.
TDMA/OFDMA. AMR 4.75–12.2 kbps. HD Voice AMR-WB.
RTP/UDP. Opus/G.729. Jitter buffer + AEC + PLC.
VSAT/Iridium. 250+ ms GEO. CELP codecs. AEC critical.
DMR/P25/TETRA. Emergency svcs. AMBE+. Encrypted PTT.
Opus codec. SRTP. ICE/STUN/TURN. Browser-native.
88–108 MHz. Pre-emphasis 50/75 μs. Stereo pilot 19 kHz.
HD voice, noise suppression, HIPAA SRTP. SIP/WebRTC.
| Codec | Bitrate | Bandwidth | Algorithm | Latency | Use Case |
|---|---|---|---|---|---|
| G.711 | 64 kbps | 300–3400 Hz | PCM + μ/A-law | <1 ms | PSTN, VoIP baseline |
| G.729A | 8 kbps | 300–3400 Hz | CS-ACELP | 10 ms | Low-BW VoIP |
| G.722 | 48–64 kbps | 50–7000 Hz | SB-ADPCM | <2 ms | Wideband VoIP |
| AMR-WB | 6.6–23.85 kbps | 50–7000 Hz | ACELP | 20 ms | 4G HD Voice |
| Opus | 6–510 kbps | 20–20000 Hz | SILK+CELT | 2.5–60 ms | WebRTC, Discord |
| EVS | 5.9–128 kbps | 50–20000 Hz | TCX/ACELP | 20–32 ms | 5G VoLTE Super HD |
Key Insight: Voice communication is the most complete intersection of acoustic physics, signal processing, information theory, and RF engineering. Every millisecond of delay, every dB of noise, and every Hz of bandwidth has been meticulously studied over 150 years — from Bell’s telephone to 5G EVS super-wideband codecs delivering near-CD quality voice over mobile networks globally.