Wednesday, May 23, 2007

Kmeans clustering


After a great deal of research and meetings with professor Belew and de Sa I finally decided to use k-means to cluster the tags. Initially I wanted to use a fuzzy clustering algorithm but Belew deterred me from that method and suggested going with a simpler method.

During my discussion with Belew he also pointed out that it might be more meaningful to use the tags from the songs not from the artists to learn how people conceptualize genres; since it is the song not the artist that is the unit of experience. While I agree that he is probably correct, I think I have spent too long going down this path to switch back now.

Despite this I don't think it posses a major problem for the project because it seems that the tags applied to the artist are metonymically pointing to the music that the artist produces. So, in a sense the artist tags are for the music. However, because the tags for the songs are going to be more specific I intend to use the tags applied to the songs if I can continue this research.

My meeting with professor deSa was very informative. She suggested using a 'cityblock' parameter for clustering the data, which makes a lost of sense becasue the data is noncontinuous. She also suggested adding clusters until I found an elbow in the amount of error.

I ran the algorithm on the IDM data set and found nice breaking points in the summed error at 7 and 11 clusters. I think that 11 clusters is too many for a array with 50 parameters so I am inclined to use 7. I also think that 7 will be a good number when I use the clusters to organize the features in the models.

Wednesday, May 9, 2007

Filtering

In order to eliminate extraneous attributes from the arrays and create a more manageable data set I have been working on filtering the tag arrays. I took the normalized values and filtered all of the tags that were more than .5 z scores below the mean. .5 z scores below the norm seemed like a good cut off point because it excluded all tags that only one artist has and most where only two shared a tag.
This, so far, has produced arrays for each genre with about 50 tags. This is about how many tags I was hoping to work with. Now that I have normalized arrays for all of the data I can move on to working on performing a fuzzy clustering analysis.

Monday, May 7, 2007

Normalization

In order to get an idea of how the different concepts for each tag fit together I felt that it was necessary to normalize all of the tags. To normalize the tag data for the musicians I created an script that takes the minimum and maximum for each tag subtracts the max from the min defines that values as the range. It then takes each value subtracts it from the minimum and divides it by the range.

After filtering the tags this way I realized that this didn't weight the tags evenly if a tag was shared by all of the musicians. To compensate for this I went back and changed the minimum value to zero for the script and re sorted the arrays. Below are some examples of the filtered data.