Deep Learning Approaches for Automatic Drum Transcription

  • Zakiya Azizah Cahyaningtyas Institut Teknologi Sepuluh Nopember, Indonesia
  • Diana Purwitasari Institut Teknologi Sepuluh Nopember, Indonesia
  • Chastine Fatichah Institut Teknologi Sepuluh Nopember, Indonesia
Keywords: Audio Classification, Automatic Drum Transcription, Deep Learning, Multi-Objective Optimization


Drum transcription is the task of transcribing audio or music into drum notation. Drum notation is helpful to help drummers as instruction in playing drums and could also be useful for students to learn about drum music theories. Unfortunately, transcribing music is not an easy task. A good transcription can usually be obtained only by an experienced musician. On the other side, musical notation is beneficial not only for professionals but also for amateurs. This study develops an Automatic Drum Transcription (ADT) application using the segment and classify method with Deep Learning as the classification method. The segment and classify method is divided into two steps. First, the segmentation step achieved a score of 76.14% in macro F1 after doing a grid search to tune the parameters. Second, the spectrogram feature is extracted on the detected onsets as the input for the classification models. The models are evaluated using the multi-objective optimization (MOO) of macro F1 score and time consumption for prediction. The result shows that the LSTM model outperformed the other models with MOO scores of 77.42%, 86.97%, and 82.87% on MDB Drums, IDMT-SMT Drums, and combined datasets, respectively. The model is then used in the ADT application. The application is built using the FastAPI framework, which delivers the transcription result as a drum tab.


Download data is not yet available.


Ian D., B. musical notation | Description, Systems, & Note Symbols | (1998).

Strayer, H. From Neumes to Notes: The Evolution of Music Notation. Musical Offerings 4, 1–14 (2013). DOI:

Hainsworth, S. W. & Macleod, M. D. The Automated Music Transcription Problem. 1–23 (2003).

Wu, C. W. et al. A Review of Automatic Drum Transcription. IEEE/ACM Transactions on Audio Speech and Language Processing vol. 26 1457–1483 Preprint at (2018). DOI:

Vogl, R. Deep Learning Methods for Drum Transcription and Drum Pattern Generation. (2018).

Miron, M., Davies, M. E. P. & Gouyon, F. An open-source drum transcription system for Pure Data and Max MSP. in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 221–225 (2013). doi:10.1109/ICASSP.2013.6637641. DOI:

Gillet, O. & Richard, G. Automatic transcription of drum loops. in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings vol. 4 (2004).

Blaszke, M. & Kostek, B. Musical Instrument Identification Using Deep Learning Approach. Sensors 22, 3033 (2022). DOI:

Haidar-Ahmad, L. Music and Instrument Classification using Deep Learning Technics. (2018).

Benetos, E., Dixon, S., Duan, Z. & Ewert, S. Automatic Music Transcription: An Overview. IEEE Signal Processing Magazine vol. 36 20–30 Preprint at (2019). DOI:

Klapuri, A. Introduction to music transcription. Signal Processing Methods for Music Transcription 3–20 Preprint at (2006). DOI:

Dittmar, C. & Gärtner, D. Real-time transcription and separation of drum recordings based on NMF decomposition. in DAFx 2014 - Proceedings of the 17th International Conference on Digital Audio Effects 8 (2014).

Southall, C., Stables, R. & Hockman, J. Player vs transcriber: A game approach to data manipulation for automatic drum transcription. in Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018 58–65 (2018).

Southall, C., Stables, R. & Hockman, J. Automatic drum transcription using bi-directional recurrent neural networks. in Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016 591–597 (2016).

Vogl, R., Dorfer, M. & Knees, P. Recurrent Neural Networks for Drum Transcription. Proc. of the International Society for Music Information Retrieval Conference (ISMIR) 730–736 (2016).

Vogl, R., Dorfer, M., Widmer, G. & Knees, P. Drum transcription via joint beat and drum modeling using convolutional recurrent neural networks. in Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017 150–157 (2017).

Bello, J. P. et al. A tutorial on onset detection in music signals. IEEE Transactions on Speech and Audio Processing 13, 1035–1046 (2005). DOI:

Yao, Y. et al. Complexity vs. Performance: Empirical analysis of machine learning as a service. in Proceedings of the ACM SIGCOMM Internet Measurement Conference, IMC vol. Part F1319 384–397 (Association for Computing Machinery, 2017). DOI:

Böck, S. & Widmer, G. Maximum filter vibrato suppression for onset detection. in DAFx 2013 - 16th International Conference on Digital Audio Effects (2013).

Kehtarnavaz, N. Frequency Domain Processing. in Digital Signal Processing System Design 175–196 (Academic Press, 2008). doi:10.1016/b978-0-12-374490-6.00007-6. DOI:

O’Shaughnessy, D. & Deng, L. Speech Processing: A Dynamic and Optimization-Oriented Approach. (2003).

Santurkar, S., Tsipras, D., Ilyas, A. & Madry, A. How does batch normalization help optimization? in Advances in Neural Information Processing Systems vols. 2018-Decem 2483–2493 (2018).

Huang, Y., Wang, W., Wang, L. & Tan, T. Multi-task deep neural network for multi-label learning. in 2013 IEEE International Conference on Image Processing, ICIP 2013 - Proceedings 2897–2900 (IEEE Computer Society, 2013). doi:10.1109/ICIP.2013.6738596. DOI:

Heaton, J. The Number of Hidden Layers. Introduction to Neural Networks for Java 157-158 Preprint at (2008).

Bengio, Y., Courville, A. & Vincent, P. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1798–1828 (2013). DOI:

Sebt, M. v., Ghasemi, S. H. & Mehrkian, S. S. Predicting the number of customer transactions using stacked LSTM recurrent neural networks. Social Network Analysis and Mining 11, 1–13 (2021). DOI:

Sachdev, H. S. Choosing number of Hidden Layers and number of hidden neurons in Neural Networks. (2020).

Scarpiniti, M., Comminiello, D., Uncini, A. & Lee, Y. C. Deep recurrent neural networks for audio classification in construction sites. in European Signal Processing Conference vols. 2021-Janua 810–814 (2021). DOI:

Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M. & Plumbley, M. D. Detection and Classification of Acoustic Scenes and Events. IEEE Transactions on Multimedia 17, 1733–1746 (2015). DOI:

Lee, J., Kim, T., Park, J. & Nam, J. Raw Waveform-based Audio Classification Using Sample-level CNN Architectures. (2017) doi:10.48550/arxiv.1712.00866.

Maccagno, A. et al. A CNN Approach for Audio Classification in Construction Sites. in Smart Innovation, Systems and Technologies vol. 184 371–381 (Springer, 2021). DOI:

Palanisamy, K., Singhania, D. & Yao, A. Rethinking CNN Models for Audio Classification. (2020) doi:10.48550/arxiv.2007.11154.

FrontlineSolvers. Training an Artificial Neural Network. (2020).

UtahDeptSociology. The Normal Distribution - Sociology 3112 - Department of Sociology - The University of utah. (2022).

How to Cite
Cahyaningtyas, Z. A., Purwitasari, D., & Fatichah, C. (2023). Deep Learning Approaches for Automatic Drum Transcription. EMITTER International Journal of Engineering Technology, 11(1), 21-34.