In this post I present a paper that I wrote with Mathieu Dumoulin. We presents a fully scalable approach to improve classification by adding confidently labelable examples from a big dataset of unlabeled examples to a small original training set.
In real world applications of machine learning, companies have to expend high cost and money for experts to build a correctly labeled training dataset that is big enough for efficient learning of a classifier. For this reason, we usually only have access to a small training dataset and an enormous dataset of unlabeled examples. The problem of improving classification performance becomes even more difficult when classifiers have to deal with big data. This paper presents a fully scalable approach to improve classification by adding confidently labelable examples from a big dataset of unlabeled examples to the original training set. We used Mahout’s distributed machine learning framework to implement scalable classification algorithms, and we used Hadoop’s distributed platform to learn and classify on large-scale data. Using a dual truncated perceptron with a radial basis function kernel trained on the positive examples of the training set to find examples improving the full training dataset, we were able to train Mahout’s online logistic regression algorithm to get improved results compared to when training the same algorithm with the initial dataset. This make our solution particularly well suited for machine learning with big data when the training dataset is small.
If you want to see the whole article, contact me. I cannot make it available for everyone just yet, since it needs to be published before!
If you want to access the code, I would love to share it, but it’s a private project. Nevertheless, just contact me and I’ll be glad to share the repository with you.