While listening to a piece of music, listeners automatically build a mental image of the song by abstracting its most prominent music elements. The mental representation of these elements is used to compare characteristics of different songs. Experiments have shown that musical timbre, tempo and genre play an important role in the perception of both inter- and intra-song similarity. However, it is not clear which musical cues dominate the listeners’ perception of similarity, and no theoretical or experimental framework has addressed the problem of establishing a hierarchical description of cue relevance. One of the main limitations in previous studies is in the narrow experimental methodology and the small number of songs and genres typically used in the perceptual experiments. Recent literature suggests that a large perceptual data-base could improve the performance ceiling reached by existing signal-based music-similarity algorithms. The aims of the present thesis are to gain a better understanding of the listener’s perception of similarity between songs of Western popular music and to collect perceptual data on an extended music data-base for both the test of theoretical models and the implementation of algorithmic applications. To investigate the perception of music similarity and collect a perceptual data-base, two perceptual experiments were conducted: a lab-based exploratory experiment to test and optimize the experimental method; and a larger-scale web-based experiment to extend the experimental paradigm to a larger set of stimuli and control variables. Both experiments used triadic comparisons of song excerpts selected from several genres of Western popular music: the participants listened iteratively to three song excerpts and chose the most similar and least similar pair. The experimental method was conceived to maximize the stimulus set size while keeping a reasonable experimental time for the participants. Data analyses include an examination of participant concordance to evaluate the existence of a stable and common perception of music similarity across and within participants, a comparison of the relative influence of control variables, and the investigation of factors underlying the organization of the participants’ perceptual space. The first part of this thesis focuses on the description of the experimental design used to collect the perceptual data. Several cross-checks of participants’ concordance in various conditions, and side experiments support the overall robustness of the experimental design and the simplicity of the task for the participant. The statistically significant concordance found within and across participants suggests the existence of a stable and common basis for the perception of music similarity. No difference was found in consistency between musicians and non musicians, and between participants classified as familiar and unfamiliar with the stimulus material. Within our experimental and selected-song context, we found a statistically significant evidence for a hierarchical salience of the control variables used in the stimulus selection on participants’ rankings: genre > tempo > timbre. The second part of the thesis includes a deeper analysis of the participants’ perceptual space using features calculated from the rankings of the second large-scale experiment. A quadratic discriminant analysis quantitatively confirmed the qualitative hierarchy of relevance in control variables found in the first experiment. We defined and labeled three axes "slow-fast", "vocal-non vocal", "synthetic-acoustic" that show significant separation of the excerpt classes. On the tempo axis, we found high correlation between the logarithm of the excerpt beats per minute and the projected positions of the excerpts. Finally, we found that the hierarchical order of relevance of control variables differs if evaluated globally, on the whole set of stimuli, or contextually, on a specific stimulus subset. In the last part of the thesis, we used commonly available feature-extraction algorithms to map the physical properties of each song signal to the participants’ perceptual space, in order to build an algorithm able to predict participant behavior. In this process, we evaluated the performance of the specific feature-extraction algorithms and the relevance of musicologically-grouped feature subsets: pitch, loudness, rhythm and timbre. A trained linear model can correctly predict 52:3 ?? 0:5% of the rankings on the most similar pair within song triads. This is a good result considering that the theoretical limit of algorithmic performance is 78 ?? 8%, estimated from participant concordance in the perceptual experiment. In predicting the perceptual similarity data, our model outperforms the current state of the art algorithm from the MIREX 2006 competition. Timbre features were found to be the most important subset for the prediction of inter-song perceptual similarity.
|Qualification||Doctor of Philosophy|
|Award date||9 Jun 2009|
|Place of Publication||Eindhoven|
|Publication status||Published - 2009|