Matrix Factorization Resources
Matrix factorization has been used heavily in recommendation systems, text-mining, spectral data analysis. his post is just about keeping all the resources about matrix factorization that I have found. Just remember that the code listing here might not directly fulfill your requirements. If you have a sparse matrix then make sure weather the programs assume missing entries as zero or as unknowns, this will affect your result very drastically. Also, make sure that you have at least one element in each row and each column. Some theoretical results suggest that an order of (n log n) entries might be required to successfully recover the unknown matrix (These entries are very lose at the moment). Make sure you have enough data.
Applications
- Matrix Factorization Techniques for Recommender Systems: An article by Koren, Bell and Volinsky in IEEE computer magazine.
Algorithms and Code
- Netflix Update: Try this at Home: This is such a nice online resource that it has been cited by many research papers.
- Matrix Factorization: A Simple Tutorial and Implementation in Python: This page explains all the math around a stochastic descent for matrix factorization.
- Timely Development: Some Analysis of the Netflix data with C++ code.
Large Scale Implementation
- Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent: KDD 2011 paper by R Gemulla et. al.
- Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce: WWW 2010 paper by C Liu et. al.
Numpy arrays and Matlab
I need to move some of the numpy and scipy matrices generated in python to matlab so that I can use the cvx package for optimization. I used the matrix market format to export data from python to matlab. To do so we need to do the following:
import scipy.io as sio #A is the required matrix sparse or dense sio.mmwrite(filename, A) #note the extension .mtx is given to the filename by scipy
To work with matrix market format in matlab we need to have the files – mminfo.m, mmread.m, mmewrite.m all them can be found from the matrix market website. These files must be present either in the present directory in matlab or in the path directories. Suppose the file ‘Mat1.txt.mtx’ contained our matrix that we saved from python to read it in matlab we need to just write the following code.
A = mmread('Mat1.txt.mtx')
The required matrix will be stored as in variable A.
Pythonic way: 1
Here is how to check if a list is sorted in python.
all(l[i] <= l[i+1] for i in xrange(len(l)-1))
Machine Learning Course
I have started attending online machine learning lectures by Andrew Ng. The lectures are available at www.ml-class.org. These lectures give very good intuition and understanding about the topic in a collection of short videos (not more than 15 min). The course also includes assignments and programming exercises.If you are interested in machine learning and want to do it alone with very little knowledge of the subject then this is the course for you.
Getting Python working in Emacs
Here is the list of dos to get Python running with auto complete on emacs in Ubuntu.
- Install the following packages: pymacs, python-mode, python-rope, python-ropemacs, auto-complete-el
- Update the .emacs file as give in : http://stackoverflow.com/questions/2855378/ropemacs-usage-tutorial/2855895#2855895
Filling missing data in python timeseries
I am using the ‘SCIKITS.TIMESERIES‘ python library for time series analysis. Here is how to fill the missing dates and the default data in the time series. The version 0.91.3 has bug in its timeseries.fill_missing_dates() method. One of the arguments it takes is fill_value, this is the default value we want to set for the missing data. But it does not work as intended. In fact the missing data is masked. To fill in the required data one must use the timeseries.filled(fill_value) method. Here is an example:
>>>import scikits.timeseries as ts
>>> datarr = ts.date_array(['2009-01-01', '2009-01-05'], freq='D')
>>> datarr
DateArray([01-Jan-2009, 05-Jan-2009],
freq='D')
>>> sr1 = ts.time_series([3,4], datarr)
>>> sr1
timeseries([3 4],
dates = [01-Jan-2009 05-Jan-2009],
freq = D)
>>> m1 = sr1.fill_missing_dates(fill_value=0)
>>> m1
timeseries([3 -- -- -- 4],
dates = [01-Jan-2009 ... 05-Jan-2009],
freq = D)
>>> m1.filled(0)
timeseries([3 0 0 0 4],
dates = [01-Jan-2009 ... 05-Jan-2009],
freq = D)