### Standard Linear regression

**Goal**: Given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. Function h

**Training example** , where

x(i) = input variable (our data)

y(i) = target variable (what we’re trying to predict)

**Training set** is a list of m training examples {();i = 1,…,m}

X = Y = R, where X denotes the space of input values, and Y the space of output values

**Building the hypothesis function h(x)**

To perform supervised learning, we must decide how we’re going to represent functions/hypotheses h in a computer. As an initial choice, let’s say we decide to approximate y as a linear function of x:

We let x0 = 1 (this is the intercept term), further simplifying to

**Selecting the best parameters θ to predict y**

We could have many parameters in our training set (ie. number of rooms, lot size, taxes). So which parameters do we include in our hypothesis function in order to best predict our outcome (y)? Ans: Pick values of θ where the predicted value * h(x)* is close to the actual value

*.*

**y****Cost Function**: The following formula defines a function J of θ with an objective of seeking the minimal difference between h(x) and y.

*This least mean squares function is a Parametric Learning algorithm, where even as our training set m approaches ∞ the size of h0(x) is constant. Thus memory sizes are constant.*

*How do we optimize J(θ)? Ans: We want to choose θ so as to minimize J(θ)*

There are several ways to do this:

Gradient Descent

Normal Equations

**Gradient descent algorithm**: A search algorithm that starts with some “initial guess” for θ, and that repeatedly changes θ to make J(θ) smaller, until hopefully we converge to a value of θ that minimizes J(θ).

Where α is called the learning rate and is typically a small number

1. For a single training example: **Widrow-Hoff** learning rule or **LMS update rule**.

simplifies to , this results in the update rule

**Characteristics**: The magnitude of the update is proportional to the error term , such that if hθ is close to y, the parameter changes by a small increment

*Courtesy of Stanford SCPD – CS229 Machine Learning
*Can be susceptible to local optima

For linear regression, the cost function J(θ) does not have a local optimum other than a local optimum.

2. For multiple training examples:

**M1: Batch Gradient Descent **

The quantity in the summation in the update rule above is just ∂J(θ)/∂θj (for the original definition of J). So, this is simply gradient descent on the original cost function J.

Evaluation: Batch gradient descent has to scan through the entire training set before taking a single step

**M2: Stochastic Gradient Descent (also incremental gradient descent)**

Evaluation: We repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only.

This method converges to the minimum more rapidly, but has the potential of overshooting the minimum and then oscillating around it. By slowly letting the learning rate α decrease to zero as the algorithm runs, it is also possible to ensure that the parameters will converge to the global minimum rather then merely oscillate around the minimum.

**Normal Equation**: The goal is to minimize J by explicitly taking its derivatives with respect to the θj’s, and setting them to zero (gradient == 0 or minima).

**Matrix derivatives**

For a function mapping from m x n matrices to real numbers, the *gradient* of *f* w.r.t. A is **∇Af(A)**

**Trace** is the sum of the diagonal entries for a square matrix

Properties of Traces

trAB = trBA

trABC = trCAB = trBCA

(AB)^{T}= B^{T}A^{T}

Additional rules

For (1), we could treat trace as inner product of two long vectors [A(1,1),…,A(1,n),…,A(n,1),…,A(n,n)] (by row) and [B(1,1),…,B(n,1),…,B(1,n),…,B(n,n)] (by column)

Thus, the trace is , which is linear function of all elements of A.

Since only one term in the RHS numerator contains Apq , after applying the partial derivative, we conclude (∇Atr(AB))pq = Bqp ⟹ ∇Atr(AB) = B⊤

**Least Squares method (Applying Matrix rules)
**We use a compressed vector notation forusing a design matrix

**X**to depict

input values from our training example

***Proof that = **

- Define the design matrix X which contains the training example’s input values in rows

- Let ⃗y be the m-dimensional vector containing all the target values from the training set
- = =
- Since aka the Hypothesis is equivalent to ,

from the linear equation

*Note the transpose is simply a way of expressing x in rows - =
- The inner product of a vector by itself is equivalent to the sum of the squares of its elements …

= - With a simple transformation,

=

Combining Equations (2) and (3),

Given =

We set this derivative == 0 and solve for θ

**NORMAL EQUATION **

Pseudo Inverse is