Machine Learning

SciKit Random Forest – Brian Eoff


Matrix Factorization Resources

Matrix factorization has been used heavily in recommendation systems, text-mining, spectral data analysis. his post is just about keeping all the resources about matrix factorization that I have found. Just remember that the code listing here might not directly fulfill your requirements. If you have a sparse matrix then make sure weather the programs assume missing entries as zero or as unknowns, this will affect your result very drastically. Also, make sure that you have at least one element in each row and each column. Some theoretical results suggest that an order of (n log n) entries might be required to successfully recover the unknown matrix (These entries are very lose at the moment). Make sure you have enough data.


  1. Matrix Factorization Techniques for Recommender Systems: An article by Koren, Bell and Volinsky in IEEE computer magazine.

Algorithms and Code

  1. Netflix Update: Try this at Home: This is such a nice online resource that it has been cited by many research papers.
  2. Matrix Factorization: A Simple Tutorial and Implementation in Python: This page explains all the math around a stochastic descent for matrix factorization.
  3. Timely Development: Some Analysis of the Netflix data with C++ code.

Large Scale Implementation

  1. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent: KDD 2011 paper by R Gemulla et. al.
  2. Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce: WWW 2010 paper by C Liu et. al.

Machine Learning Course

I have started attending online machine learning lectures by Andrew Ng. The lectures are available at These lectures give very good intuition and understanding about the topic in a collection of  short videos (not more than 15 min). The course also includes assignments and programming exercises.If you are interested in machine learning and want to do it alone with very little knowledge of the subject then this is the course for you.

Wikipedia’s Participation Challenge

On June 28, ICDM 2011 announced the ICDM contest for this year. The contest is called the “Wikipedia’s Participation Challenge“, the cometition details are available at kaggle. The aim of the project is to predict the number of edits a user will make in a 5 month period.

I have downloaded the dataset. This will be my first experience with a dataset this large and machine learning algorithms in practice.  Lets see how it goes.

List of data sets

Here I will keep a list of various datasets that are available to public for research in machine learning and related areas.

  1. Movielens dataset:  This is a movies rating dataset.
  2. Yahoo! Music: This is music rating dataset.
  3. Heritage Health Prize: The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data.

Above I have just mentioned a few data sets. A lot of data sets are available from the website infochimps. Kaggle is one website dedicated to host large dataset competitions.