Let us say we are in the situation of modelling the data set below. Though the left graph is fitted well for a linear line, it does not best incorporate the trend of the dataset compared to the middle graph. This is called underfitting. The rightmost dataset is overfitted to the data set, as the line almost connects the dots. This first graph is a linear graph, the second is fitted to the second degree(see Fig 1), and the last is fitted to the 5th degree


Underfitting generally occurs when using too little features, and overfitting occurs when using too many features. The problem is how to find this perfect “Goldilocks”, giving us the right number of features to use. The two ways you can solve this is to reduce the number of features(manually go through and select which features are more important) or change the thetas relative to each other for factors that are not so important. The latter approach discussed is called regularization and is what this article is going to be about.
Regularization can be applied to both linear and logistic regression. First, let us discuss how the cost function changes with linear regression regularization (a bit of a tongue twister 😊).
Let us say we have an equation:

and we want to make this equation more quadratic. To do this we want to make the last two products in the equation above (theta3 * x^3 and theta4*x^4) as small as possible. So to decrease theta-3 and theta-4 by a large amount we change our cost function so it becomes:

Now when theta-3 and theta-4 are decided upon they must be really small because the goal of the cost function is to be as small as possible. Our problem is solved hurray!
Wait, we still need to see how to decide if you want to make a function more quadratic. (For example, how would the model know what theta’s to change because it does not know what type of line would fit the data the best). The answer is not going to really make sense to you, but the way to regularize the function is to make all the theta really small. Therefore, our new cost function will become:

Where lambda is our regularization parameter and tells us how much to inflate the costs of our parameters. It doesn’t even make sense to me, completely, why it works when you inflate all of the theta, but that’s the way it is.
Now connecting all the pieces together the new linear regression regularized gradient descent function is:

And with some rearranging you get to the final answer:

Now for logistic regression:

You do not want the decision boundary to be something like the blue line which overfits the data. Using the same ideology as before, just add the regularization term to get a new cost function that looks like:

Our gradient descent formula will look exactly the same as the linear regression regularized gradient descent formula; however, they are implicitly different as the hypotheses differ.