Wednesday, May 23, 2007

Kmeans clustering


After a great deal of research and meetings with professor Belew and de Sa I finally decided to use k-means to cluster the tags. Initially I wanted to use a fuzzy clustering algorithm but Belew deterred me from that method and suggested going with a simpler method.

During my discussion with Belew he also pointed out that it might be more meaningful to use the tags from the songs not from the artists to learn how people conceptualize genres; since it is the song not the artist that is the unit of experience. While I agree that he is probably correct, I think I have spent too long going down this path to switch back now.

Despite this I don't think it posses a major problem for the project because it seems that the tags applied to the artist are metonymically pointing to the music that the artist produces. So, in a sense the artist tags are for the music. However, because the tags for the songs are going to be more specific I intend to use the tags applied to the songs if I can continue this research.

My meeting with professor deSa was very informative. She suggested using a 'cityblock' parameter for clustering the data, which makes a lost of sense becasue the data is noncontinuous. She also suggested adding clusters until I found an elbow in the amount of error.

I ran the algorithm on the IDM data set and found nice breaking points in the summed error at 7 and 11 clusters. I think that 11 clusters is too many for a array with 50 parameters so I am inclined to use 7. I also think that 7 will be a good number when I use the clusters to organize the features in the models.

No comments: