Analog Tubes for Infiltrating Spoofing Detection Systems

Introduction

Machine learning algorithms and Automatic Speaker Identification (ASI) systems are commonly used to protect against artificially generated voices. These systems typically operate under the assumption that spoofing requires digital voice synthesis. However, live human attackers can use analog acoustic tubes to mimic target voices and bypass ASI systems by physically manipulating how speech propagates. Unlike digital spoofing, this approach exploits the vulnerability of ASI systems to changes introduced by analog signal transformations. As a result, it challenges the notion that human speech is uniquely identifiable. This white paper reviews recent research exploring the use of analog tubes as a physical-layer attack vector against ASI systems, with particular acknowledgment of the work by (Fawaz & Ahmed, 2024)

Human Speech Production

Human speech begins with air pressure from the lungs passing through the vocal folds, or glottis, producing vibrations known as glottal excitation. This excitation determines the pitch of the voice and can be modeled either as an impulse train, g(t), in the time domain, or as a harmonic spectrum, G(f), in the frequency domain. The airflow then travels through the articulators—such as the vocal tract, tongue, lips, and nasal cavity—which act as a linear acoustic filter, H_v(f). The resulting speech signal can be described as:

This model, where the vocal tract functions as a filter, is key to understanding how analog tubes can be used to replicate the resonance characteristics of the human vocal tract. Figure 1 shows a summary of a human speech system.

Figure 1. Human Speech system

Automatic Speaker Identification (ASI)

ASI systems are designed to classify speakers based on their voice signatures. During enrollment, users typically speak a predefined text, and the system uses deep learning algorithms to extract distinguishing voice features. Many systems also include spoof-detection mechanisms for added security, especially in contexts like home protection or phone banking. However, these systems often assume that spoofing must be done digitally. This assumption leaves them open to attacks by live human impersonators who use analog tools to manipulate speech acoustics. Figure 2 shows an image showing how using tubes can infiltrate spoof detection mechanisms.

Figure 2. Attacker infiltrating a spoof detection system via analog tubes (Fawaz & Ahmed, 2024)

Analog Tubes as Acoustic Filters

Acoustic tubes function as resonant bandpass filters, amplifying certain frequencies while reducing others. The resonance frequency f₀ of a cylindrical tube is given by:

, where $\lambda$ =2(L+0.8d)

Here, c_air is the speed of sound, L is the tube length, and d is the diameter. The sharpness of the filter, or how selective it is around its resonant frequency, is determined by the quality factor Q, which depends on resonance losses.

Impersonation Mechanism

An attacker can use such a tube to alter their voice so that it resembles the target’s voice from the perspective of an ASI system. This process involves several steps:

The tube is modeled as a bandpass filter with a transfer function:

$H_{res}(f)$ = $\sum H_i(f)$

Each harmonic filter H_i(f) has a frequency f_i=i. f0 and a bandwidth ∆f_i determined by Q_i.

By adjusting the tube’s length (L) and diameter (d), the attacker can optimize the filter characteristics to fool the ASI into recognizing the attacker’s voice as that of the target.
The output speech is generated using the formula:

$s_{out}(t)$ = $F_{tube}\left(s_{in},p\right)=\mathcal{F}^{-1}(H_{res}\left(f\right).S_{in}(f))$

where p = (L,d) and s_in(t) is the frequency spectrum of the input speech.

Multi-Tube Systems

Using multiple tubes can increase the success rate of the impersonation. In interconnected tube systems, the resonance frequencies f_i must satisfy the condition:

$A_1\cot{({{2\pi fL}1}/{c_{air}})}=A_2\cot{({{2\pi fL}2}/{c_{air}})}$

Here, A₁ and A₂ are the cross-sectional areas, and L₁ and L₂ are the lengths of the two tubes. This nonlinear equation is typically solved numerically to determine the resulting resonance frequencies f_i.

Security Implications

ASI systems are susceptible to analog spoofing for several reasons:

Attackers do not need access to digital voice models—just knowledge of the victim’s enrollment process.
Analog tubes reshape the glottal excitation of the attacker’s voice into a form that closely resembles the target’s voice, at least from the ASI system’s perspective.
By optimizing tube parameters, attackers can create speech that is highly convincing in the ASI system’s feature space, even if it doesn’t sound exactly right to the human ear.

Conclusion

Analog acoustic tubes undermine the assumption that spoofing must be digital. By physically altering speech through resonance filtering, live attackers can convincingly impersonate enrolled speakers and bypass ASI systems. This work shows that human speech is not inherently unforgeable and emphasizes the need for ASI designs that are robust against analog manipulation. Future research should explore countermeasures to defend against such physical spoofing techniques.