Sanjay Majumder
Published: 2022
Total Pages: 0
Get eBook
In the world of signal processing, although audio source separation is not a new concept, to date, it has remained a fascinatingly complex task. Because of the vast field of practical application, over the years, researchers from varied backgrounds have deployed advanced and sophisticated algorithms of deep learning, signal processing, data augmentation, and computer listening to isolate individual voices or instruments from the audio mixtures in precision and clarity. Among all these new technologies, neural networks, especially recurrent neural networks (RNN), have promising evidence of optimal results in multimedia problems. However, a series of projects are still going on to give the outcomes more accuracy. This thesis aims to contribute to this field of research by introducing the Bi-directional Gated Recurrent Unit (Bi-GRU) - a newer version of RNN to separate audio stems from the audio mixture in the Time-Frequency domain. The architecture of the GRU is robust yet simple to use compared to its predecessor Long Short Time Memory (LSTM), and most interestingly, it efficiently solves the problem of gradient exploding or gradient vanishing, which could previously result in data over-fitting and under-fitting, respectively. But as information only passes in the forward direction (left to right), both general RNN and GRU suffer from the lack of information from future cells. To resolve this issue, in this study, the bi-directionality feature of RNN has been exploited, which facilitates the accurate learning of the GRU from the previous as well as the future cells, producing a better result. The audio data are transformed into spectrograms, and the Bi-GRU model fetches the essential temporal and spectral information to train and test the system to separate four well-defined audio stems in a supervised manner. This newly developed source separation model is applied on the MUSDB18 [45] dataset to test, and the performance of the model is assessed by using the museval [61] evaluation toolbox and Mean Opinion Score (MOS). The measured performance is then compared with the other known model's performance. In addition, this thesis provides a detailed survey of the audio source separation work, and at the end of this paper, some observations and shortcomings of the system are discussed.