In this series of blog posts that I want to put up, I do not wish to get into all the theoretical details of the algorithms. I am working through the Machine Learning course offered by Coursera and want to record the implementation details. Since I’m new to both machine learning and octave, there could be some blatant errors. I’d very much like to correct them as I continue learning.
Let’s start off with the Linear Regression. The main modules that need to be implemented in any regression are the cost function and the parameter values. The cost function for a linear regression is defined as:
where the hypothesis is given by the linear model
Note that the parameters of our model are the values. These are the values that need to be adjusted to minimize the cost . One way to do this is to use the batch gradient descent algorithm. In batch gradient descent, each iteration performs the update:
The update happens simultaneously for all j. With each iteration of gradient descent the parameters come closer to the optimal values that will achieve the lowest cost .
Instead of iterating, we could use vectorization to solve the problem quicker. The normal equation is applied in this case:
One disadvantage of using the normal equation is that it is slower compared to the iterative method when the number of features become something like a million.
Let us compute the cost first. Say we have a matrix X which contains the training examples. Let there be mtraining examples and n features. Our matrix X will have one example in each row, thus having m rows. Also let us add 1 before the first column to each row of X. Thus the dimensions of X will be . Let y be the output vector of size . Finally, our will be a vector denoting the parameters of our regression model.
Here’s the octave function to compute the cost of using the given parameters.
12345678910111213 |
|
In the above code, computes the value of . We then subtract the value of y for that training example. This value is then squared, summed and subsequently reduced by a factor of 2m.
The value of cost function can be used for debugging the iterations when the iterative batch gradient algorithm is used. A continuous reduction of the cost is a sign that the gradient descent is working properly. If the cost rises continuously instead then there is some error in the implementation – perhaps the value is too large?
Now that we have computed the cost, let us compute the parameters.
12345678910111213141516171819 |
|
Note that in our solutions we have assumed that X is a matrix with unspecified number of columns. The above solutions can be applied no matter the number of features.
One important step when running the gradient descent is feature normalization. When there is a huge variation in the values that two features can take, then our gradient descent could take a long long time. One way to avoid it is to normalize the each feature by subtracting it’s mean from each value and dividing by the standard deviation or the range of that feature. Here’s an octave function to normalize features:
1234567891011121314151617181920212223242526272829303132 |
|
In the iterative gradient descent algorithm, one has to guess an and also decide upon the number of iterations. It might take multiple runs of the algorithm to settle down on a suitable value for the above variables. Using the normal equation obviates the need to guess these values. It computes the value of in one shot. The code for the same is:
12345678 |
|
0 comments:
Post a Comment