Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.
Published in | American Journal of Computer Science and Technology (Volume 6, Issue 1) |
DOI | 10.11648/j.ajcst.20230601.16 |
Page(s) | 42-49 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2023. Published by Science Publishing Group |
Vision Transformer, Musical Sentiment, Sentiment Classification
[1] | Xiao Xiaohong, Zhang Yi, Liu Dongsheng, Ouyang Chunjuan. Music Classification Based on Hidden Markov Model [J]. Computer Engineering and Applications, 2017, 53 (16): 138-143+165. |
[2] | KANG J, WANG H L, SU G B, et al. Survey of Music Emotion Recognition [J]. Computer Engineering and Applications, 2012, 58 (04): 64-72. |
[3] | Feng P Y. A Music Classification Recommendation Method Based on GRU and Attention Mechanism [D]. Guangdong university of technology, 2021. DOI: 10.27029/,dc nki.Ggdgu.2021.001410. |
[4] | JIA N, ZHEN C J. Model of Music Theme Recommendation Based on Attention LSTM [J]. COMPUTER SCIENCE, 2019, 46 (S2): 230-235. |
[5] | Chen Changfeng. Song Audio Emotion Classification Based on CNN-LSTM [J]. Communications Technology, 2019, 52 (05): 1114-1118. |
[6] | Zhang Yu-sha, JIANG Sheng-yi. Research on Speech Emotion Data Mining Classification and Recognition Method Based on MFCC Feature Extraction and Improved SVM [J]. Computer Applications and Software, 2020, 37 (08): 160-165+212. |
[7] | Cai X, Zhang H. Music genre classification based on auditory image, spectral and acoustic features [J]. Multimedia Systems, 2022, 28 (3): 779-791. |
[8] | TANG X, ZHANG C X, LI J F. Music Emotion Recognition Based on Deep Learning [J]. Computer Knowledge and Technology, 2019, 15 (11): 232-237. The DOI: 10.14004/j.carolcarroll nkiCKT.2019.1170. |
[9] | Tian Yong-Lin, Wang Yu-Tong, Wang Jian-Gong, Wang Xiao, Wang Fei-Yue. Key problems and progress of vision Transformers: The state of the art and prospects. Acta Automatica Sinica, 2022, 48 (4): 957−9. |
[10] | Hassani A, Walton S, Shah N, et al. Escaping the Big Data Paradigm with Compact Transformers [J]. 2021. |
[11] | Liu Wenting, Lu Xinming. Research Progress of Transformer Based on Computer Vision [J]. Computer Engineering and Applications, 2012, 58 (06): 1-16. |
[12] | Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [C]// 2020.6. |
[13] | Stergiou A, Poppe R, Kalliatakis G. Refining activation downsampling with SoftPool: 10.48550/arXiv. 2101.00440 [P]. 2021. |
[14] | Song Yang. Research on Mongolian music classification Based on Transformer [D]. Inner Mongolia normal university, 2022. DOI: 10.27230/,dc nki.Gnmsu.2022.001124. |
[15] | Dong Anming, Liu Zongyin, Yu Jiguo, Han Yubing, Zhou You. Automatic Music genre Classification Based on Visual Transformation Network [J]. Journal of Computer Applications, 2012, 42 (S1): 54-58. |
APA Style
Chen Zhen, Liu Changhui. (2023). Music Audio Sentiment Classification Based on Improvied Vision Transformer. American Journal of Computer Science and Technology, 6(1), 42-49. https://doi.org/10.11648/j.ajcst.20230601.16
ACS Style
Chen Zhen; Liu Changhui. Music Audio Sentiment Classification Based on Improvied Vision Transformer. Am. J. Comput. Sci. Technol. 2023, 6(1), 42-49. doi: 10.11648/j.ajcst.20230601.16
AMA Style
Chen Zhen, Liu Changhui. Music Audio Sentiment Classification Based on Improvied Vision Transformer. Am J Comput Sci Technol. 2023;6(1):42-49. doi: 10.11648/j.ajcst.20230601.16
@article{10.11648/j.ajcst.20230601.16, author = {Chen Zhen and Liu Changhui}, title = {Music Audio Sentiment Classification Based on Improvied Vision Transformer}, journal = {American Journal of Computer Science and Technology}, volume = {6}, number = {1}, pages = {42-49}, doi = {10.11648/j.ajcst.20230601.16}, url = {https://doi.org/10.11648/j.ajcst.20230601.16}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajcst.20230601.16}, abstract = {Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.}, year = {2023} }
TY - JOUR T1 - Music Audio Sentiment Classification Based on Improvied Vision Transformer AU - Chen Zhen AU - Liu Changhui Y1 - 2023/03/31 PY - 2023 N1 - https://doi.org/10.11648/j.ajcst.20230601.16 DO - 10.11648/j.ajcst.20230601.16 T2 - American Journal of Computer Science and Technology JF - American Journal of Computer Science and Technology JO - American Journal of Computer Science and Technology SP - 42 EP - 49 PB - Science Publishing Group SN - 2640-012X UR - https://doi.org/10.11648/j.ajcst.20230601.16 AB - Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%. VL - 6 IS - 1 ER -