Complete Communications Engineering

voice over internet protocol voip reference design banner

Analog Tubes for Infiltrating Spoofing Detection Systems

Introduction

Machine learning algorithms and Automatic Speaker Identification (ASI) systems are commonly used to protect against artificially generated voices. These systems typically operate under the assumption that spoofing requires digital voice synthesis. However, live human attackers can use analog acoustic tubes to mimic target voices and bypass ASI systems by physically manipulating how speech propagates. Unlike digital spoofing, this approach exploits the vulnerability of ASI systems to changes introduced by analog signal transformations. As a result, it challenges the notion that human speech is uniquely identifiable. This white paper reviews recent research exploring the use of analog tubes as a physical-layer attack vector against ASI systems, with particular acknowledgment of the work by (Fawaz & Ahmed, 2024)

Human Speech Production

Human speech begins with air pressure from the lungs passing through the vocal folds, or glottis, producing vibrations known as glottal excitation. This excitation determines the pitch of the voice and can be modeled either as an impulse train, g(t), in the time domain, or as a harmonic spectrum, G(f), in the frequency domain. The airflow then travels through the articulators—such as the vocal tract, tongue, lips, and nasal cavity—which act as a linear acoustic filter, Hv(f). The resulting speech signal can be described as:

This model, where the vocal tract functions as a filter, is key to understanding how analog tubes can be used to replicate the resonance characteristics of the human vocal tract. Figure 1 shows a summary of a human speech system.

Figure 1. Human Speech system
Figure 1. Human Speech system

Automatic Speaker Identification (ASI)

ASI systems are designed to classify speakers based on their voice signatures. During enrollment, users typically speak a predefined text, and the system uses deep learning algorithms to extract distinguishing voice features. Many systems also include spoof-detection mechanisms for added security, especially in contexts like home protection or phone banking. However, these systems often assume that spoofing must be done digitally. This assumption leaves them open to attacks by live human impersonators who use analog tools to manipulate speech acoustics. Figure 2 shows an image showing how using tubes can infiltrate spoof detection mechanisms.

Figure 2. Attacker infiltrating a spoof detection system via analog tubes (Fawaz & Ahmed, 2024)
Figure 2. Attacker infiltrating a spoof detection system via analog tubes (Fawaz & Ahmed, 2024)

Analog Tubes as Acoustic Filters

Acoustic tubes function as resonant bandpass filters, amplifying certain frequencies while reducing others. The resonance frequency f0 of a cylindrical tube is given by:

, where\lambda  =2(L+0.8d)

Here, cair is the speed of sound, L is the tube length, and d is the diameter. The sharpness of the filter, or how selective it is around its resonant frequency, is determined by the quality factor Q, which depends on resonance losses.

Impersonation Mechanism

An attacker can use such a tube to alter their voice so that it resembles the target’s voice from the perspective of an ASI system. This process involves several steps:

H_{res}(f) =\sum H_i(f)

Each harmonic filter Hi(f) has a frequency fi=i. f0 and a bandwidth ∆fi determined by Qi.

s_{out}(t) =F_{tube}\left(s_{in},p\right)=\mathcal{F}^{-1}(H_{res}\left(f\right).S_{in}(f))

where p = (L,d) and sin(t) is the frequency spectrum of the input speech.

Multi-Tube Systems

Using multiple tubes can increase the success rate of the impersonation. In interconnected tube systems, the resonance frequencies fi must satisfy the condition:

A_1\cot{({{2\pi fL}1}/{c_{air}})}=A_2\cot{({{2\pi fL}2}/{c_{air}})}

Here, A1 and A2 are the cross-sectional areas, and L1 and L2 are the lengths of the two tubes. This nonlinear equation is typically solved numerically to determine the resulting resonance frequencies fi.

Security Implications

ASI systems are susceptible to analog spoofing for several reasons:

Conclusion

Analog acoustic tubes undermine the assumption that spoofing must be digital. By physically altering speech through resonance filtering, live attackers can convincingly impersonate enrolled speakers and bypass ASI systems. This work shows that human speech is not inherently unforgeable and emphasizes the need for ASI designs that are robust against analog manipulation. Future research should explore countermeasures to defend against such physical spoofing techniques.