Wednesday, May 23, 2007

Kmeans clustering


After a great deal of research and meetings with professor Belew and de Sa I finally decided to use k-means to cluster the tags. Initially I wanted to use a fuzzy clustering algorithm but Belew deterred me from that method and suggested going with a simpler method.

During my discussion with Belew he also pointed out that it might be more meaningful to use the tags from the songs not from the artists to learn how people conceptualize genres; since it is the song not the artist that is the unit of experience. While I agree that he is probably correct, I think I have spent too long going down this path to switch back now.

Despite this I don't think it posses a major problem for the project because it seems that the tags applied to the artist are metonymically pointing to the music that the artist produces. So, in a sense the artist tags are for the music. However, because the tags for the songs are going to be more specific I intend to use the tags applied to the songs if I can continue this research.

My meeting with professor deSa was very informative. She suggested using a 'cityblock' parameter for clustering the data, which makes a lost of sense becasue the data is noncontinuous. She also suggested adding clusters until I found an elbow in the amount of error.

I ran the algorithm on the IDM data set and found nice breaking points in the summed error at 7 and 11 clusters. I think that 11 clusters is too many for a array with 50 parameters so I am inclined to use 7. I also think that 7 will be a good number when I use the clusters to organize the features in the models.

Wednesday, May 9, 2007

Filtering

In order to eliminate extraneous attributes from the arrays and create a more manageable data set I have been working on filtering the tag arrays. I took the normalized values and filtered all of the tags that were more than .5 z scores below the mean. .5 z scores below the norm seemed like a good cut off point because it excluded all tags that only one artist has and most where only two shared a tag.
This, so far, has produced arrays for each genre with about 50 tags. This is about how many tags I was hoping to work with. Now that I have normalized arrays for all of the data I can move on to working on performing a fuzzy clustering analysis.

Monday, May 7, 2007

Normalization

In order to get an idea of how the different concepts for each tag fit together I felt that it was necessary to normalize all of the tags. To normalize the tag data for the musicians I created an script that takes the minimum and maximum for each tag subtracts the max from the min defines that values as the range. It then takes each value subtracts it from the minimum and divides it by the range.

After filtering the tags this way I realized that this didn't weight the tags evenly if a tag was shared by all of the musicians. To compensate for this I went back and changed the minimum value to zero for the script and re sorted the arrays. Below are some examples of the filtered data.

Wednesday, April 25, 2007

Preliminary Data Analysis






These are the figures based on the number of tags for each musician. Each different symbol represents a different musician.

Methodology

In order to investigate the question of how people use tags to conceptualize musical genres, I have been downloading tag information made available through lastFM’s Audioscrobbler Web Services. The information is available as XML pages, which I have been extracting the tag values from. I chose to download the tag information from the top 10 musicians for each genre. The reason for using the top 10 musicians of the genre is because as the top tagged members they are more central to the genre and therefore should be the best exemplars for modeling.

After downloading and parsing the data I then removed all of the tags that were only applied one time. I made this decision because the purpose of the project is to model how the users of lastFM conceptualizes a given genre, and if only one person feels that something is deserving of a tag than that is not representative enough of our culture to be accounted for.

This week I have begun conducting descriptive statistics of the data to get an idea of what I have to work with. I intend to continue investigating the information by doing an analysis based on fuzzy set theory. This will involve writing a script in matlab that will compare the values of each of the tags on the musician to see how central that tag is for the musician to the genre in general. By scaling it as a series of fuzzy sets this will remove the weighting that would occur by just doing an analysis of the number of tags. The tags that appear the most frequently and appear to be good predictors of centrality will be used to inform the parameters that are modeled for.

Motivation

After considering which would be interesting genres to explore and then model I settled on the electronic, electronica, ambient, house and IDM.

Electronic- Used 338,561 times by 44,618 people
Ambient- Used 116,422 times by 22,782 people
IDM- Used 41,921 times by 6,679 people
Electronica- Used 164,348 times by 25,479 people
House- Used 53,008 times by 9,899 people

I choose these tags because of the ways in which they overlap and compliment each other. Electronic is the umbrella genre that covers all of the others. The other genres could be seen as subgenres of electronic music, but because of how popular electronic music has become a variety of ill-defined subgenres exist. Unlike many genres, such as rock, punk and punk-rock these electronic genres are not considered derivatives of others one another. IDM is thought of as distinct from house or ambient while still a sub-genre of electronic. For that reason I felt that these would be interesting genres to model in order to see if the distinctly different concept that individuals have of the genres can be made visually apparent.

Introduction

I decided to create this blog so that I could have any easy way to display my progress on the independent research project that I am doing for Professor Hollan in the Human Computer Interaction lab at UCSD. In this project I set out to investigate how our cultural concept of music and genres is represented by the tags applied to music in the social/music networking site lastFM.

By exploring the phenomenon of tagging music I intend to develop a better understanding of what motivates the phenomenon of tagging and how we can use this rich source of user generated content to visually represent complicated ubiquitous ideas like musical genre.