

With that, your model is completely oblivious to any musical relationships which unfold over more than, e.g., 3 seconds of audio. Lastly, the individual snippet predictions are usually aggregated using some sort of majority vote. Luckily, both of these steps are already included in my single-label audio processing pipeline SLAPP, which is freely available on GitHub. Moreover, the snippet-based approach requires a track-wise train-validation-test split to avoid intercorrelated training-, validation-, and test datasets.
#THE BEST FREE BEAT MAKING SOFTWARE 2018 FULL#
Firstly, it adds extra processing steps to split the audio files and to perform the aggregated predictions for a full track. There are three major downsides to using divide & conquer. And what if the input track is less than 30 seconds long? Is your model robust enough to deal with 10-second jingles which are zero-padded to reach 30 seconds? With divide & conquer, this is no issue. Instead of having to struggle with finding just the right 30-second snippet to extract from it, you just start drawing 3-second snippets until the whole track is covered.

If you want to apply your trained model in the real world, you are going to encounter tracks of all lengths. Side note: In case you are unaware of what spectrograms are or why and how they are used for audio classification, I recommend this article by The Experimental Writer for a nice and intuitive explanation. Depending on the machine learning model and architecture you are using, this can reduce the model complexity significantly. Naturally, a 3-second snippet with the same parameters will produce a ~(129 x 120) spectrogram with only 15k data points. With common parameters, one spectrogram can have a shape of (1290 x 120), i.e. Lower-Dimensional DataĪ 30-second snippet produces quite a large spectrogram. Even if you only have 30-second tracks available, you can get 14 snippets out of each of them with 3-second snippets and 1 second of overlap. In fact, I would argue that this is a form of natural and fairly seamless data augmentation.įor example, you can get more than 80x the training data from a 3-minute track if you draw 3-second snippets with an overlap of 1 second compared to drawing one 30-second snippet per track.

Moreover, by allowing for overlap between slices, even more snippets can be drawn. Using divide & conquer, most of the audio signal can be used. Given that you have full-length tracks available, taking a 30-second slice and throwing away the rest of a 3–4 minute track is very data inefficient. However, there are three major advantages to applying the divide & conquer approach: 1. This is partly because commonly used music data sources like the GTZAN or FMA datasets or the Spotify Web API provide tracks of this length. Often, genre classification is based on exactly one 30-second snippet of a track. However, other researchers like Nasrullah & Zhao (2019) have also used this method - just not by the same name. In music genre classification, the term was first used (to my knowledge) by Dong (2018). Table 1 - Divide & Conquer in Computer Science and Music Genre Classification.Īlthough the definitions are not the same, they do overlap substantially, and personally, I find the term divide & conquer very fitting in both cases.
