Now the problem arises on how to create a cost function for our logistic regression model. You may think back to our linear regression article and wonder why we can not continue to use the squared error as the cost. The reason is because our hypothesis is now 1/(1-e^theta transpose * x), so the graph of the cost function with using squared difference is non-convex (no clean parabolic single local minima shape; like the one below).

To use the e term to our advantage, our cost function should be a piecewise function using logs.
First lets take our old cost function and break it down like so:

Now instead of our Cost(h(x), y) being that squared difference (like in linear regression) look at the example below:

The reason for choosing the functions is very smart. If you can envision the graph of the first equation (-log(x)) you get a graph similar to:

(Assume that the y-axis is J(theta), the cost function) This basically says if the training example’s output is one (the correct output) and if the hypothesis of the learning model outputs 0, the cost will go up to infinity, penalizing the model greatly. However, if the hypothesis is 1 there will be a cost of 0. As you get farther away from 1 (the correct answer) the more the learning model gets punished. Extremely smart solution.
In the similar vein if the training example’s correct answer is 0, envision the graph of the second equation above. It looks like:

Now the closer you get to 1 (the wrong answer) you get punished more greatly with your cost going to infinity. Remember the y-axis represents the cost function. Writing the cost function in this piecewise manner guarantees that our cost function J(theta) is convex.
Let us simplify our cost function so instead of the piecewise function we can take advantage that we know y = 0 or 1 to cancel out a certain term in a consolidated equation. Let us look at the image below for the equation.

If you want to code it up in MATLAB or Octave here is a vectorized approach:
