Basically there isn’t a lot to talk about this article because nothing has really changed from linear regression in gradient descent and the multiclass problem is actually very intuitively easy to grasp and implement.
Now gradient descent is exactly the same as linear regression, but the h(x) has changed!

And a vectorized approach:

Instead of doing gradient descent, there is a function called fminuc that can be used given the cost function and the derivative of the cost function. I am not going to explain it because it is out of scope for this article.
What if we want to perform logistic regression for more than 2 classes for the classification task. For example, if you wanted to tag an email for being junk, school, work, etc. Let’s use an example of an email classifier trying to differentiate between emails that are for work (class 1), school (class 2), or personal(class 3). In the figure below a “one versus all approach is taken”. Instead of creating one hypothesis we create three separate hypotheses and say what is the probability that y = i based on the values x for that specific class i. Hopefully it makes more sense when you look at the graph and the formula below:

Here’s a more decomposed version of the multiclass formula:
