Research article

BAG-OF-AUDIO-VISUAL WORDS BASED APPROACH FOR SOUND EVENT AND ACOUSTIC SCENE RECOGNITION TASKS FOR INDUSTRIAL MACHINERIES

S Chandrakala, Sreenithi B, G Revathy & R Sathya

Online First: November 21, 2022


Sound Event Recognition(SER) and Acoustic Scene Recognition(ASR) tasks are gaining more importance due to its applications in personal and public security. Some of the factors complicating the SER and ASR tasks are the quality of audio recording devices, the number of audio sources in a particular environment, and overlapping sound and scene classes. Hence there is a demand to extract different kinds of information from audio to learn a more robust representation of sound events and acoustic scenes. This can be achieved by representing sound in multiple forms to utilize complementary information present in sound data. In this paper, we propose a Bag-of-Audio-Visual Words (BoAVW) approach for the sound event and acoustic scene recognition tasks. The proposed approach constructs Bag-of-Audio words from Mel Frequency Cepstral Coefficient (MFCC) features and Bag-of-Visual words from Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Moments-based visual features extracted from auditory images. The Support Vector Machine (SVM) classifier is used to recognize these representations as sound events and acoustic scenes. The proposed BoAVW approach shows improved results when trained on benchmark datasets such as ESC-50 (sound events), DCASE-2016 (sound events), and DCASE-2017 (acoustic scenes). The proposed approach gives 66.6%, 93.2% and 82.58% accuracy respectively when compared with few recent state-of-the-art methods.

Keywords

Sound Event Recognition(SER), Acoustic Scene Recognition(ASR), Bag-of-Audio-Visual Words(BoAVW),Mel-Frequency Cepstral Coefficients (MFCCs),Auditory image, Spectrogram, Scale Invariant Feature Transform(SIFT), Speeded Up Robust Features(SURF).