Tuesday, June 4, 2013

Learning with Categorical Features

I recently found out about Kaggle, which is a really neat website that hosts machine learning contests in a similar style to TopCoder's marathon matches. You are given training data and asked to make predictions on a test set using essentially whatever tools you want. They have a live scoreboard and a forum associated with each contest so people can see how they are doing and collaborate with others. I think it is an awesome idea and an excellent way to put all of the machine learning theory that you learn about in classrooms to a practical test. The first contest I decided to take a look at was the Amazon Employee Access Challenge; the problem is to, given information about an employee such as their manager, role, department, etc and a resource in question, determine whether the employee should be granted access to that resource. For example, a software engineer needs access to the development servers while a salesperson does not.

Since there are a lot of readily available machine learning libraries and packages out there, most of the interesting things happen in the data preparation and feature selection stage. Software like libsvm and liblinear provide good implementations of algorithms so it is a matter of figuring out how to manipulate the data to be usable by them. In the case of the Amazon contest, the interesting thing is that all of the information provided about the employees is categorical; two different departments, for example, are not comparable numerically. The data provided is also already prepared in such a way that the features are all integers representing the different possible values (e.g. software engineers might have role 123), but we are given no further information about what those integers mean. In that sense, choosing the category values as the features makes no sense in the context of algorithms like support vector machines (SVM) or logistic regression. If they were given as strings, it would be even less obvious how to produce numerical features.

So how does one handle categorical features? It turns out that the most common method is to take the feature and turn it into N different binary features, one for each of the N categories. Then each instance has exactly one of the features set to 1 and the rest to 0. For example, if we had color as a feature and potential values of red, blue, and green, we would create three new binary features (is_red, is_blue, is_green) which correspond to what the original feature was set to. So red would have be (1, 0, 0), blue would be (0, 1, 0), and green would be (0, 0, 1). In this way, there is a meaningful numerical relationship between 0 and 1 that SVM and logistic regression can handle. It is important to note, however, that if the number of potential categories is large, the number of resulting features also becomes very large. For the Amazon contest, there are eight categorical features provided, but after processing them in the described manner each instance has over 16,000 features. It is also the case that each instance will only have eight of those features active, though, so they are very sparse, allowing many algorithms to still operate efficiently. Applying logistic regression to the new data with 16,000 features gives a pretty reasonable prediction score. Not bad!

No comments:

Post a Comment