Spatial Audio Introduction

Introduction

Audio played through headphones generally sounds like it originates from inside the listener’s head. One reason for this is because audio is played out at both ears simultaneously whereas a real sound source would arrive at each ear at slightly different times. VOCAL offers a spatial audio solution which can make an audio signal sound as if it’s coming from any desired direction.

Figure 1. Spherical coordinates.

The standard approach to 3D audio is to convolve the input signal with an appropriate head-related impulse response (HRIR). The term head-related transfer function (HRTF) is often used interchangeably with HRIR. A different HRIR exists for every direction $(\theta, \varphi)$ and for each ear. The left/right HRIR for the direction $(\theta, \varphi)$ describes what a user whose head is at the origin would hear in their left/right ear if a sound emanated from the direction $(\theta, \varphi)$ . Thus, if a signal $x$ were to play from direction $(\theta, \varphi)$ and the left/right HRIRs for that direction were $h_L$ and $h_R$ , then the signal heard at the left ear would be $x * h_L$ and at the right ear would be $x * h_R$ .

HRIRs are typically measured using a specialized setup and a lengthy procedure. An individual will sit in place for a couple hours, or however long measurement takes, while microphones at each of their ears record sound from a loudspeaker which rotates around their head. The differences between the audio at the loudspeaker and at each ear determines the HRTFs for each ear. These HRTFs contain all the information about how this listener’s perception of an audio source changes depending on the source’s location.

Generic Head-Related Transfer Functions and Spatial Localization

The standard approach to 3D audio is simple and effective, but it’s insufficient for general use. A major drawback of using HRIRs to synthesize spatial audio is that they’re highly individual due to idiosyncratic high-frequency spectral cues. No two people have the same HRIRs, so creating accurate 3D audio in this way requires measuring the individual HRIRs of each and every user. This is both time-consuming and requires special equipment, so generic HRTFs are often used instead.

Generic HRTFs are HRTFs that were measured using a different person or, in some cases, a dummy head. They can be used in the same way as individual HRTFs, but they introduce a variety of issues since they don’t contain the spectral characteristic of the user’s individual HRTFs. The most common issues are front/back errors where the angles $(\theta, \varphi)$ and $(\theta, 180^\circ - \varphi)$ are difficult for a user to distinguish from each other. Other common issues include up/down errors where the angles $(\theta, \varphi)$ and $(180^\circ - \theta, \varphi)$ are confused, and in-head localization where sound appears to be originating from inside the user’s head.

Generic HRTFs can be improved by incorporating additional spectral cues which exaggerate the differences between directions. Although HRTFs vary significantly from person to person, there are common features which can be taken advantage of to provide a better listening experience. A generic HRTF can be modified to provide a more pleasing listening experience while simultaneously improving spatial localization accuracy. VOCAL’s Spatial Audio system is able to significantly reduce the frequency of front/back errors, when compared to the standard approach, without adversely distorting the input signal or requiring the user’s individual HRTFs.

Psychoacoustics and Height Localization

When listening to binaural audio through headphones, it’s important to be aware of the intrinsic biases humans have when perceiving sound. Perhaps the most dramatic bias we have is to perceive high frequencies as originating from higher than our head and low frequencies as originating from lower than our head.

Figure 2. Effect of source spectrum and vertical location on vertical localization ability [1].

Figure 2 shows the results of an experiment where loudspeakers were played from five different elevations and participants were asked where they perceived sound to be coming from. As can be seen from the figure, participants perceived frequencies below 1000 Hz as originating from below the head regardless of the actual source location. Similarly, participants perceived sounds above 5000 Hz as originating from above the head regardless of the actual source location. The pink noise source shows this bias disappears when a broadband spectrum consisting of many different frequencies is used.

Biases such as this are valuable in providing insight into how 3D audio can be used most effectively. For example, if a 400 Hz tone were to be used as an alert sound, that tone would never be perceived as coming from above the listener’s head. If accurate elevation localization is required in this situation, it would be better to use a broadband warning sound that consists of a wide range of frequencies.

References

[1] Cabrera, Densil; Tilley, Steven; 2003; Vertical Localization and Image Size Effects in Loudspeaker Reproduction [PDF]; School of Architecture, Design Science and Planning, University of Sydney, NSW, Australia; Paper 46; Available from: https://aes2.org/publications/elibrary-page/?id=12269

Complete Communications Engineering

Spatial Audio Introduction

Introduction

Generic Head-Related Transfer Functions and Spatial Localization

Psychoacoustics and Height Localization

Related Pages

References