Audio fingerprinting is the process of representing an audio signal in a compact way by extracting relevant features of the audio content. It works on the principle of human fingerprinting. It records the fingerprint of ingested audio content and later can be used to match with the recorded audio or playlists on mobile, TV or any other device. Audio fingerprinting allows monitoring of the audio independent of its format and without the need for metadata. A robust acoustic fingerprinting identifies the audio track even after compression and any degradation in sound quality. Some of the major applications of acoustic fingerprinting include content-based audio retrieval, broadcast monitoring, etc. Shazam is a popular music retrieval application available in the market.
Why is Audio Fingerprinting Needed?
There can be multiple reasons to have audio fingerprinting technology in place. Along with audio track identification, one might want to know the singer/ writer of the song, wherein the audio track we are listening to. This allows us to synchronize the multiple audio pieces and provides more interactive experiences. It is an efficient mechanism to establish the perceptual equality of two audio objects. The major advantages in developing a system with audio fingerprints are reduced memory /storage requirements, valuable comparison, perceptual irrelevancies getting removed and useful search.
An effective audio fingerprinting algorithm involves the following qualities:
- Discriminative power
- Distortion invariance
- Computational simplicity
- Minimum length of query to retrieve
Applications of Audio Fingerprinting
Acoustic fingerprinting technology needs to be optimized to have greater accuracy as per each use case. Some of the major applications of audio fingerprinting technology include:
Music recognition is an integral application of audio fingerprinting. The specific feature of a song or music signal is captured as a fingerprint. This unique metadata made it possible to identify and retrieve the song from millions of databases. An ideal audio fingerprinting system will give accurate retrieval result even in a noisy environment. An efficient searching algorithm is also required to provide the best and precise results.
The simplest approach is the direct comparison of the digitalized waveforms, but it is neither efﬁcient nor valid. An efﬁcient implementation of this approach could use a hash method, such as MD5. This method compares the hash values instead of the whole ﬁles so that it will be more efficient.
Based on the method of querying, the two use cases of music retrieval are Query by example and Query by humming. In the former use case, a part of the original music is used as the query for retrieval while in the latter case, a humming part of the music is used as the query.
Digital audio watermarking allows to embed a secret digital signature to the audio signal. Audio fingerprinting allows the signal processing approach to generate the digital signature. This embedded signature can be used to verify the authenticity of the audio signal. The advantage of using fingerprinting techniques is that it is less vulnerable to attacks since any attempt to change the fingerprint will alter the quality of the audio.
Audio fingerprinting techniques are used to monitor a song or an advertisement broadcast by the TV or radio channel. This might be useful as the royalty for the song is applicable while it broadcasts on TV or radio. Similarly, advertisers pay based on the number of times it airs in the media. Audio fingerprinting techniques are commonly used for TV viewership analytics and content protection also.
Voice Identification – For Automotive Use-case
Audio fingerprinting techniques can be used to identify the voice. With the advancement of smart devices, nowadays, the demand for a robust speaker identification system has increased. Lots of automotive use cases like car infotainment systems, where voice-based commands with user profiles, can be realized with an accurate fingerprinting system.
Other applications include automatic music library organization using genre classifications, music trainer using vocal analysis, audio geographical mapping, etc.
Query by Example
There are multiple approaches to retrieve an audio file using fingerprints. The ultimate difference in these approaches is what kind of feature is used as a fingerprint/ metadata for an audio file. The core feature behind every approach is the audio spectrum, where the frequency bins of each frame is plotted across time axis.
Figure 1: Audio Spectrum
The simplest approaches to directly comparing the audio or spectrograms of the audio will not always give the result. The query and stored version of a song may be aurally similar but may have distinct bit representations. They may be recorded at different compression schemes or quality or equalization settings. Any of these factors would cause direct comparisons ineffective.
A simple and efficient model to extract relevant features is given below.
Figure 2: Blocks of feature extraction in a basic audio fingerprinting approach
The input audio (.wav) is split into frames of equal intervals, and each frame represents each time point in the output metadata.
Windowing is most often used in spectral analysis to view a short-time segment of a longer signal and analyze its frequency content. Windows are also used to create short sound segments of a few milli seconds duration called grains, which can be combined into granular sound clouds for unique sorts of synthesis. In general, one can think of any finite sound with a starting point and a stopping point as a windowed segment in time. Windowing technique is useful for shaping the amplitude envelope of sampled sounds or any other audio signals to avoid clicks and to create the sort of attack and release characteristics you want for a sound, whether it be a short-grain or a longer excerpt.
Equal Loudness Filter
This filter enhances frequencies we are perceptually more sensitive to and attenuates those to which we are less sensitive. Apart from making perceptual sense, it turns out this filter and enhances the frequency range in which the audio content is often found and attenuates the low-frequency range where we are less likely to find the audio content/ melody in case of music.
Short Term Fourier Transform
We apply the FFT to the preprocessed audio frame and find out the frequency pattern across the frame. Finally, all the frequency patterns of all the frames are stitched together applying post processing and then analyzing it.
Different audio sources might have different cut-off frequencies, and the audio content’s predominant patterns may be similar but might be in different frequency bands. So, the spectrum pattern derived from STFT analysis and stitching is split into multiple bands across frequency. Then the peaks (frequency bins with maximum amplitude) in each band for all the frames are calculated. Once the frequency peaks of all the frames across bands are calculated, the peaks in each band for all frames are joined together to form a contour. So finally, each audio file has same number of metadata contours to the number of frequency bands.
Figure 3: Polyphonic Analysis by splitting frequency band
Figure 4: Contour stitching along sub-bands
Once the metadata is collected for all the audio files in the database, the same process will be applied to the query and the metadata of the query is searched against the database.
A robust audio fingerprinting system can be designed using machine learning techniques. By adding noise to the trained data, we can train the system to get the best result for similar sound in the future. We can train the system based on multiple features like the characteristics of instrument’s frequency contours in a polyphonic signal, beat of the music, extracting the vocal melody patterns from a polyphonic music file, etc.
Audio fingerprinting is a technology that has lots of commercial values and use-cases in broadcasting, content retrieval applications. In this article, we focused on the major applications of audio fingerprinting and details of one of the most popular applications, Query by example, which is an audio retrieval solution.
With over 14+ years of experience in audio signal processing, Path Partner can provide you best design for your product development in audio fingerprinting. If you’d like to know more about PathPartner, reach out to us for a quick consultation at firstname.lastname@example.org