Standard Linear regression
Goal: Given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. Function h
Training example , where
x(i) = input variable (our data)
y(i) = target variable (what we’re trying to predict)
Training set is a list of m training examples {();i = 1,…,m}
X = Y = R, where X denotes the space of input values, and Y the space of output values
 Building the hypothesis function h(x)
To perform supervised learning, we must decide how we’re going to represent functions/hypotheses h in a computer. As an initial choice, let’s say we decide to approximate y as a linear function of x:
We let x0 = 1 (this is the intercept term), further simplifying to
 Selecting the best parameters θ to predict y
We could have many parameters in our training set (ie. number of rooms, lot size, taxes). So which parameters do we include in our hypothesis function in order to best predict our outcome (y)? Ans: Pick values of θ where the predicted value h(x) is close to the actual value y.
Cost Function: The following formula defines a function J of θ with an objective of seeking the minimal difference between h(x) and y.
This least mean squares function is a Parametric Learning algorithm, where even as our training set m approaches ∞ the size of h0(x) is constant. Thus memory sizes are constant.
How do we optimize J(θ)? Ans: We want to choose θ so as to minimize J(θ)
There are several ways to do this:
Gradient Descent
Normal Equations
 Gradient descent algorithm: A search algorithm that starts with some “initial guess” for θ, and that repeatedly changes θ to make J(θ) smaller, until hopefully we converge to a value of θ that minimizes J(θ).
Where α is called the learning rate and is typically a small number
1. For a single training example: WidrowHoff learning rule or LMS update rule.
simplifies to , this results in the update rule
Characteristics: The magnitude of the update is proportional to the error term , such that if hθ is close to y, the parameter changes by a small increment
Courtesy of Stanford SCPD – CS229 Machine Learning
Can be susceptible to local optima
For linear regression, the cost function J(θ) does not have a local optimum other than a local optimum.
2. For multiple training examples:
M1: Batch Gradient Descent
The quantity in the summation in the update rule above is just ∂J(θ)/∂θj (for the original definition of J). So, this is simply gradient descent on the original cost function J.
Evaluation: Batch gradient descent has to scan through the entire training set before taking a single step
M2: Stochastic Gradient Descent (also incremental gradient descent)
Evaluation: We repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only.
This method converges to the minimum more rapidly, but has the potential of overshooting the minimum and then oscillating around it. By slowly letting the learning rate α decrease to zero as the algorithm runs, it is also possible to ensure that the parameters will converge to the global minimum rather then merely oscillate around the minimum.
 Normal Equation: The goal is to minimize J by explicitly taking its derivatives with respect to the θj’s, and setting them to zero (gradient == 0 or minima).
Matrix derivatives
For a function mapping from m x n matrices to real numbers, the gradient of f w.r.t. A is ∇Af(A)
Trace is the sum of the diagonal entries for a square matrix
Properties of Traces
trAB = trBA
trABC = trCAB = trBCA
(AB)^{T}= B^{T}A^{T}
Additional rules
For (1), we could treat trace as inner product of two long vectors [A(1,1),…,A(1,n),…,A(n,1),…,A(n,n)] (by row) and [B(1,1),…,B(n,1),…,B(1,n),…,B(n,n)] (by column)
Thus, the trace is , which is linear function of all elements of A.
Since only one term in the RHS numerator contains Apq , after applying the partial derivative, we conclude (∇Atr(AB))pq = Bqp ⟹ ∇Atr(AB) = B⊤
Least Squares method (Applying Matrix rules)
We use a compressed vector notation forusing a design matrix X to depict
input values from our training example
*Proof that =
 Define the design matrix X which contains the training example’s input values in rows
 Let ⃗y be the mdimensional vector containing all the target values from the training set
 = =
 Since aka the Hypothesis is equivalent to ,
from the linear equation
*Note the transpose is simply a way of expressing x in rows
 =
 The inner product of a vector by itself is equivalent to the sum of the squares of its elements …
=
 With a simple transformation,
=

Combining Equations (2) and (3),
Given =
We set this derivative == 0 and solve for θ
NORMAL EQUATION
Pseudo Inverse is