LLMs Are Not a Cure-All: Practical Test for Music Classification Based on Metadata

The question was whether current Large Language Models (LLMs) such as GPT-4 or DeepSeek are able to automatically and reliably classify music tracks, specifically salsa songs, based on title, artist, lyrics, and metadata into „Salsa Cubana“ or „Salsa Línea“. It was known that the available information (metadata, genre tags, lyrics) is incomplete and partly inconsistent. The test was explicitly designed to determine the practical limits of today’s LLMs in this context.

Approach
For each song in a Spotify playlist, all available metadata were collected: title, artist, album, genres, release date, lyrics (via Genius), additional tags and biographies (via Last.fm). This data was provided to the LLM in a structured form. The prompt was detailed: in addition to the role definition and classification criteria, it included examples and clear instructions so that the model would choose exclusively „Cubana“ or „Línea“.

Observations and Insights

The classification by the LLM was purely text-based. A genuine musical understanding is therefore excluded.
The available metadata were insufficient for distinguishing styles: genre tags were generic, labels like „Salsa Cubana“ or „Línea“ were mostly missing.
Lyrics rarely provided clear clues, as the thematic range in salsa is very broad.
LLMs produced partly inconsistent results with identical prompts and data. Different models (e.g. DeepSeek, GPT-4o) sometimes arrived at different outcomes with the same input.
Even by adding more metadata and external sources like Last.fm, the quality of the assignment could not be improved. The results remained unreliable, often „Cubana“ was assigned indiscriminately, regardless of actual stylistic features.

Conclusion A reliable music-style classification based solely on metadata and lyrics is not possible with current LLMs. The results are inconsistent, the error rate high and the assignments mostly inexplicable. Additional metadata and elaborate prompt templates with examples and instructions could not overcome the models’ limitations. For precise stylistic music classification, human expertise or a specific musical analysis remains required.