Ridge Regression with Gradient Descent Converges to OLS estimates

I'm implementing a homespun version of Ridge Regression with gradient descent, and to my surprise it always converges to the same answers as OLS, not the closed form of Ridge Regression.

This is true regardless of what size alpha I'm using. I don't know if this is due to something wrong with how I'm setting up the problem, or something about how gradient descent works on different types of data.

The dataset I'm using is the boston housing dataset, which is very small (506 rows), and has a number of correlated variables.

BASIC INTUITION:

I'd love to be corrected if my priors are inaccurate, but this is how I frame the problem.

The cost function for OLS regression:

enter image description here

Where output is the dot product of the feature matrix and the weights for each column + intercept.

The cost function for regression with L2 regularization (ie, Ridge Regression):

enter image description here

Where alpha is the tuning parameter and omega represents the regression coefficient, squared and summed together.

The derivative of the above statement can be written like this:

enter image description here

We then take the result of this expression and multiply it by the learning rate, and adjust each weight appropriately.

To test this out on my own, I made my own python class for Ridge Regression according to the above postulates, and it looks like so:

RIDGE REGRESSION CODE:

class RidgeRegression():

    def __init__(self, alpha=1, eta =.0001, random_state=None, n_iter=10000):

        self.eta            = eta

        self.random_state   = random_state

        self.n_iter         = n_iter

        self.alpha          = alpha

        self.w_             = 



    def output(self, X):

        return X.dot(self.w_[1:]) + self.w_[0]



    def fit(self, X, y):

        rgen                   = np.random.RandomState(self.random_state)  # random number generator

        self.w_                = rgen.normal(0, .01, size= 1 + X.shape[1]) # fill weights w/ near 0 values

        self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # alpha * sum of squared weights

        self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # alpha * sum of weights



        for i in range(self.n_iter):

            gradient               = (y - self.output(X)) + self.l2_gradient   # the gradient

            self.w_[1:]           += (X.T.dot(gradient) * (self.eta))          # update each weight w.r.t. its gradient

            self.w_[0]            += (y - self.output(X)).sum() * self.eta     # update the intercept w.r.t. the gradient

            self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # adjust regularization term to reflect new values of weights

            self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # update derivate of regularization term to be used in gradient

This looks right to me, but I'm not sure how to interpret the fact that the results are always the same as OLS, vs what I'd get using the closed form of Ridge Regression.

EDIT:

Even if I set the value of alpha to something very large, like 1000, the results do not change.

CODE:

from sklearn.linear_model import Ridge

r = RidgeRegression(alpha=1000, n_iter=5000, eta=0.0001) # my algorithm

r.fit(X, y)

rreg = Ridge(alpha=1000)  # scikit learn's ridge algorithm

rreg.fit(X, y)

ols_coeffs = np.linalg.inv(X.T @ X) @ (X.T @ y) # ols algorithm

If we compare these, we can see the ridge coefficients in my homespun version match the OLS estimates almost exactly:

pd.DataFrame({

    'Variable'     : X.columns,

    'gd_Weight'    : r.w_[1:],     # coefficients for ridge regression w/ gd

    'cf_Weight'    : ridge_coeffs, # closed form version of ridge

    'ols_weight'   : coeffs        # regular OLS, closed form

})

Which gives the following dataframe:

enter image description here

They all have roughly the same intercept of 22.5.

edited 2 hours ago

asked 2 hours ago

Jonathan Bechtel

1234

$begingroup$
By "the weight of each variable in the dataset", do you mean the regression coefficient?
$endgroup$
– beta1_equals_beta2
2 hours ago

$begingroup$
Are you sure alpha is not zero? Your code doesn't show what it is.
$endgroup$
– Mark L. Stone
2 hours ago

$begingroup$
@beta1_equals_beta2 - yes. Changed to make it more clear. I also added some sample code to demonstrate what happens when alpha is set to 1000.
$endgroup$
– Jonathan Bechtel
2 hours ago

add a comment |

I'm implementing a homespun version of Ridge Regression with gradient descent, and to my surprise it always converges to the same answers as OLS, not the closed form of Ridge Regression.

The dataset I'm using is the boston housing dataset, which is very small (506 rows), and has a number of correlated variables.

BASIC INTUITION:

I'd love to be corrected if my priors are inaccurate, but this is how I frame the problem.

The cost function for OLS regression:

enter image description here

Where output is the dot product of the feature matrix and the weights for each column + intercept.

The cost function for regression with L2 regularization (ie, Ridge Regression):

enter image description here

Where alpha is the tuning parameter and omega represents the regression coefficient, squared and summed together.

The derivative of the above statement can be written like this:

enter image description here

We then take the result of this expression and multiply it by the learning rate, and adjust each weight appropriately.

To test this out on my own, I made my own python class for Ridge Regression according to the above postulates, and it looks like so:

RIDGE REGRESSION CODE:

class RidgeRegression():

    def __init__(self, alpha=1, eta =.0001, random_state=None, n_iter=10000):

        self.eta            = eta

        self.random_state   = random_state

        self.n_iter         = n_iter

        self.alpha          = alpha

        self.w_             = 



    def output(self, X):

        return X.dot(self.w_[1:]) + self.w_[0]



    def fit(self, X, y):

        rgen                   = np.random.RandomState(self.random_state)  # random number generator

        self.w_                = rgen.normal(0, .01, size= 1 + X.shape[1]) # fill weights w/ near 0 values

        self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # alpha * sum of squared weights

        self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # alpha * sum of weights



        for i in range(self.n_iter):

            gradient               = (y - self.output(X)) + self.l2_gradient   # the gradient

            self.w_[1:]           += (X.T.dot(gradient) * (self.eta))          # update each weight w.r.t. its gradient

            self.w_[0]            += (y - self.output(X)).sum() * self.eta     # update the intercept w.r.t. the gradient

            self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # adjust regularization term to reflect new values of weights

            self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # update derivate of regularization term to be used in gradient

This looks right to me, but I'm not sure how to interpret the fact that the results are always the same as OLS, vs what I'd get using the closed form of Ridge Regression.

EDIT:

Even if I set the value of alpha to something very large, like 1000, the results do not change.

CODE:

from sklearn.linear_model import Ridge

r = RidgeRegression(alpha=1000, n_iter=5000, eta=0.0001) # my algorithm

r.fit(X, y)

rreg = Ridge(alpha=1000)  # scikit learn's ridge algorithm

rreg.fit(X, y)

ols_coeffs = np.linalg.inv(X.T @ X) @ (X.T @ y) # ols algorithm

If we compare these, we can see the ridge coefficients in my homespun version match the OLS estimates almost exactly:

pd.DataFrame({

    'Variable'     : X.columns,

    'gd_Weight'    : r.w_[1:],     # coefficients for ridge regression w/ gd

    'cf_Weight'    : ridge_coeffs, # closed form version of ridge

    'ols_weight'   : coeffs        # regular OLS, closed form

})

Which gives the following dataframe:

enter image description here

They all have roughly the same intercept of 22.5.

edited 2 hours ago

asked 2 hours ago

Jonathan Bechtel

1234

$begingroup$
By "the weight of each variable in the dataset", do you mean the regression coefficient?
$endgroup$
– beta1_equals_beta2
2 hours ago

$begingroup$
Are you sure alpha is not zero? Your code doesn't show what it is.
$endgroup$
– Mark L. Stone
2 hours ago

$begingroup$
@beta1_equals_beta2 - yes. Changed to make it more clear. I also added some sample code to demonstrate what happens when alpha is set to 1000.
$endgroup$
– Jonathan Bechtel
2 hours ago

add a comment |

I'm implementing a homespun version of Ridge Regression with gradient descent, and to my surprise it always converges to the same answers as OLS, not the closed form of Ridge Regression.

The dataset I'm using is the boston housing dataset, which is very small (506 rows), and has a number of correlated variables.

BASIC INTUITION:

I'd love to be corrected if my priors are inaccurate, but this is how I frame the problem.

The cost function for OLS regression:

enter image description here

Where output is the dot product of the feature matrix and the weights for each column + intercept.

The cost function for regression with L2 regularization (ie, Ridge Regression):

enter image description here

Where alpha is the tuning parameter and omega represents the regression coefficient, squared and summed together.

The derivative of the above statement can be written like this:

enter image description here

We then take the result of this expression and multiply it by the learning rate, and adjust each weight appropriately.

To test this out on my own, I made my own python class for Ridge Regression according to the above postulates, and it looks like so:

RIDGE REGRESSION CODE:

class RidgeRegression():

    def __init__(self, alpha=1, eta =.0001, random_state=None, n_iter=10000):

        self.eta            = eta

        self.random_state   = random_state

        self.n_iter         = n_iter

        self.alpha          = alpha

        self.w_             = 



    def output(self, X):

        return X.dot(self.w_[1:]) + self.w_[0]



    def fit(self, X, y):

        rgen                   = np.random.RandomState(self.random_state)  # random number generator

        self.w_                = rgen.normal(0, .01, size= 1 + X.shape[1]) # fill weights w/ near 0 values

        self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # alpha * sum of squared weights

        self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # alpha * sum of weights



        for i in range(self.n_iter):

            gradient               = (y - self.output(X)) + self.l2_gradient   # the gradient

            self.w_[1:]           += (X.T.dot(gradient) * (self.eta))          # update each weight w.r.t. its gradient

            self.w_[0]            += (y - self.output(X)).sum() * self.eta     # update the intercept w.r.t. the gradient

            self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # adjust regularization term to reflect new values of weights

            self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # update derivate of regularization term to be used in gradient

This looks right to me, but I'm not sure how to interpret the fact that the results are always the same as OLS, vs what I'd get using the closed form of Ridge Regression.

EDIT:

Even if I set the value of alpha to something very large, like 1000, the results do not change.

CODE:

from sklearn.linear_model import Ridge

r = RidgeRegression(alpha=1000, n_iter=5000, eta=0.0001) # my algorithm

r.fit(X, y)

rreg = Ridge(alpha=1000)  # scikit learn's ridge algorithm

rreg.fit(X, y)

ols_coeffs = np.linalg.inv(X.T @ X) @ (X.T @ y) # ols algorithm

If we compare these, we can see the ridge coefficients in my homespun version match the OLS estimates almost exactly:

pd.DataFrame({

    'Variable'     : X.columns,

    'gd_Weight'    : r.w_[1:],     # coefficients for ridge regression w/ gd

    'cf_Weight'    : ridge_coeffs, # closed form version of ridge

    'ols_weight'   : coeffs        # regular OLS, closed form

})

Which gives the following dataframe:

enter image description here

They all have roughly the same intercept of 22.5.

edited 2 hours ago

asked 2 hours ago

Jonathan Bechtel

1234

I'm implementing a homespun version of Ridge Regression with gradient descent, and to my surprise it always converges to the same answers as OLS, not the closed form of Ridge Regression.

The dataset I'm using is the boston housing dataset, which is very small (506 rows), and has a number of correlated variables.

BASIC INTUITION:

I'd love to be corrected if my priors are inaccurate, but this is how I frame the problem.

The cost function for OLS regression:

enter image description here

Where output is the dot product of the feature matrix and the weights for each column + intercept.

The cost function for regression with L2 regularization (ie, Ridge Regression):

enter image description here

Where alpha is the tuning parameter and omega represents the regression coefficient, squared and summed together.

The derivative of the above statement can be written like this:

enter image description here

We then take the result of this expression and multiply it by the learning rate, and adjust each weight appropriately.

To test this out on my own, I made my own python class for Ridge Regression according to the above postulates, and it looks like so:

RIDGE REGRESSION CODE:

class RidgeRegression():

    def __init__(self, alpha=1, eta =.0001, random_state=None, n_iter=10000):

        self.eta            = eta

        self.random_state   = random_state

        self.n_iter         = n_iter

        self.alpha          = alpha

        self.w_             = 



    def output(self, X):

        return X.dot(self.w_[1:]) + self.w_[0]



    def fit(self, X, y):

        rgen                   = np.random.RandomState(self.random_state)  # random number generator

        self.w_                = rgen.normal(0, .01, size= 1 + X.shape[1]) # fill weights w/ near 0 values

        self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # alpha * sum of squared weights

        self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # alpha * sum of weights



        for i in range(self.n_iter):

            gradient               = (y - self.output(X)) + self.l2_gradient   # the gradient

            self.w_[1:]           += (X.T.dot(gradient) * (self.eta))          # update each weight w.r.t. its gradient

            self.w_[0]            += (y - self.output(X)).sum() * self.eta     # update the intercept w.r.t. the gradient

            self.l2_regularization = self.alpha * self.w_[1:].dot(self.w_[1:]) # adjust regularization term to reflect new values of weights

            self.l2_gradient       = self.alpha * np.sum(self.w_[1:])          # update derivate of regularization term to be used in gradient

This looks right to me, but I'm not sure how to interpret the fact that the results are always the same as OLS, vs what I'd get using the closed form of Ridge Regression.

EDIT:

Even if I set the value of alpha to something very large, like 1000, the results do not change.

CODE:

from sklearn.linear_model import Ridge

r = RidgeRegression(alpha=1000, n_iter=5000, eta=0.0001) # my algorithm

r.fit(X, y)

rreg = Ridge(alpha=1000)  # scikit learn's ridge algorithm

rreg.fit(X, y)

ols_coeffs = np.linalg.inv(X.T @ X) @ (X.T @ y) # ols algorithm

If we compare these, we can see the ridge coefficients in my homespun version match the OLS estimates almost exactly:

pd.DataFrame({

    'Variable'     : X.columns,

    'gd_Weight'    : r.w_[1:],     # coefficients for ridge regression w/ gd

    'cf_Weight'    : ridge_coeffs, # closed form version of ridge

    'ols_weight'   : coeffs        # regular OLS, closed form

})

Which gives the following dataframe:

enter image description here

They all have roughly the same intercept of 22.5.

regression python regularization ridge-regression numpy

edited 2 hours ago

asked 2 hours ago

Jonathan Bechtel

1234

edited 2 hours ago

asked 2 hours ago

Jonathan Bechtel

1234

edited 2 hours ago

asked 2 hours ago

Jonathan Bechtel

1234

asked 2 hours ago

Jonathan Bechtel

1234

asked 2 hours ago

Jonathan Bechtel

1234

$begingroup$
By "the weight of each variable in the dataset", do you mean the regression coefficient?
$endgroup$
– beta1_equals_beta2
2 hours ago

$begingroup$
Are you sure alpha is not zero? Your code doesn't show what it is.
$endgroup$
– Mark L. Stone
2 hours ago

$begingroup$
@beta1_equals_beta2 - yes. Changed to make it more clear. I also added some sample code to demonstrate what happens when alpha is set to 1000.
$endgroup$
– Jonathan Bechtel
2 hours ago

add a comment |

$begingroup$
By "the weight of each variable in the dataset", do you mean the regression coefficient?
$endgroup$
– beta1_equals_beta2
2 hours ago

$begingroup$
Are you sure alpha is not zero? Your code doesn't show what it is.
$endgroup$
– Mark L. Stone
2 hours ago

$begingroup$
@beta1_equals_beta2 - yes. Changed to make it more clear. I also added some sample code to demonstrate what happens when alpha is set to 1000.
$endgroup$
– Jonathan Bechtel
2 hours ago

By "the weight of each variable in the dataset", do you mean the regression coefficient?

– beta1_equals_beta2
2 hours ago

Are you sure alpha is not zero? Your code doesn't show what it is.

– Mark L. Stone
2 hours ago

@beta1_equals_beta2 - yes. Changed to make it more clear. I also added some sample code to demonstrate what happens when alpha is set to 1000.

– Jonathan Bechtel
2 hours ago

add a comment |

1 Answer
1

active

oldest

votes

I'm not very well versed in Python, but if I interpret your code correctly then X.T.dot(gradient) takes the gradient variable you computed on the previous line, and computes its dot product with $X$. This doesn't seem right since 'gradient' includes the "L2-gradient" which shouldn't be multiplied by $X$. Only the residual (y-self.output(X)) should be in that product. You want to add the L2-gradient only afterwards, and then multiply the result by eta.

Also, the L2-gradient shouldn't sum over the $w$'s (remember the gradient should be vector-valued since you're taking the derivative of a scalar-valued function w.r.t a vector). Together these errors probably explain the confusing results you're getting, although I would have expected with an incorrect gradient like that the output would actually just be garbage rather than the OLS-solution, so there is probably something subtle happening that I'm not seeing.

(Note that your mathematical expressions for the gradient also aren't quite right, but there you actually omitted the pre-multiplication of the residuals with $X^T$. And you also again have a sum over $w$'s in the L2-gradient which shouldn't be there.)

Going forward I would recommend first checking your implementation of each individual part of the cost function and gradient and satisfy yourself that they are correct, before running gradient descent.

answered 55 mins ago

Ruben van Bergen

3,8141923

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f399230%2fridge-regression-with-gradient-descent-converges-to-ols-estimates%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered 55 mins ago

Ruben van Bergen

3,8141923

add a comment |

answered 55 mins ago

Ruben van Bergen

3,8141923

add a comment |

answered 55 mins ago

Ruben van Bergen

3,8141923

answered 55 mins ago

Ruben van Bergen

3,8141923

answered 55 mins ago

Ruben van Bergen

3,8141923

answered 55 mins ago

Ruben van Bergen

3,8141923

answered 55 mins ago

Ruben van Bergen

3,8141923

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl