Revision for “2. Locally Weighted Linear Regression” created on October 15, 2015 @ 12:13:47
Title  2. Locally Weighted Linear Regression 

Content  <h4><strong>Locally Weighted Linear Regression</strong></h4>
<em>This locally weighted linear regression function is a <strong>nonparametric Learning algorithm, </strong>where the size of h0(x) is linearly proportional to the size of our training set <strong>m</strong>. Thus memory sizes increase with the training set.</em>
Finding a new algorithm that is easy to fit curved lines
<p id="GjNObiy"><img class="alignnone wpimage1490 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561da7d8cde48.png" alt="" width="339" height="224" /></p>
<ol>
<li>Look at the data at a small point that you're interested in</li>
<li>Build a local hypothesis just for that section and try to predict that area<img class="alignnone sizefull wpimage1491 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561da80428dcd.png" alt="" /></li>
<li>Given location X where we want to make a prediction,
<img class="alignnone sizefull wpimage1492 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561da92cb9a5d.png" alt="" />, where
<img class="alignnone sizefull wpimage1493 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561da9512a40c.png" alt="" /></li>
<li>The weights depend on the particular point x at which we’re trying to evaluate x
if x(i) − x is small, then w(i) is close to 1
if x(i) − x is large, then w(i) is small (close to 0)</li>
<li>So how do we determine the appropriate values of θ?
We pick a θ that gives the highest weight based on training examples that are closest to the query point</li>
<li><strong>Bandwidth Parameter</strong>: The function is selected because we want a bellshaped curve that peaks close to x and then falls of quickly after
<img class="alignnone wpimage1495 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561dab16796dc.png" alt="" width="301" height="166" />
<p id="OvuubaZ"><img class="alignnone sizefull wpimage1498 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561dabe8ca075.png" alt="" /> helps to identify the shape of the curve (fat vs thin)</p>
</li>
</ol>
Regular Normal Equation: <img class="latex" title="W = (X^{T}X)^{1}(X^{T}Y) " alt="W = (X^{T}X)^{1}(X^{T}Y) " />
Normal: <img class="latex" title="W = (X^{T}Wei X)^{1}(X^{T}Wei Y) " alt="W = (X^{T}Wei X)^{1}(X^{T}Wei Y) " />
<strong>Probabilistic interpretation of data</strong>
<img class="alignnone sizefull wpimage1509 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db7c4415fb.png" alt="" />
Where <img class="alignnone sizefull wpimage1510 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db7d28ec31.png" alt="" /> is an error term which captures unmodeled effects or random noise
The density of the<img class="alignnone sizefull wpimage1510 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db7d28ec31.png" alt="" /> is given by
<p id="FYbyXdL"><img class="alignnone sizefull wpimage1511 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db8363fd14.png" alt="" /></p>
This implies that
<p id="RPYkSwL"><img class="alignnone sizefull wpimage1512 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db848c1694.png" alt="" /> where</p>
the distribution of y(i) <img class="alignnone sizefull wpimage1514 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db8a96d9ac.png" alt="" />
<h4><strong>Likelihood function</strong></h4>
Given the design matrix X which contains all the <img class="alignnone sizefull wpimage1517 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db8e05e669.png" alt="" />
<img class="alignnone sizefull wpimage1515 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db8d486e93.png" alt="" />
<p id="qbBgpds"><img class="alignnone sizefull wpimage1518 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db904736e3.png" alt="" /></p>
<h4><strong>Maximum likelihood estimation</strong></h4>
We should choose θ so as to make the data as high probability as possible.
We can maximize the log likelihood l(θ):
<p id="kIxcTGi"><img class="alignnone sizefull wpimage1519 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561db9cd0639f.png" alt="" /></p>
Maximizing <img class="alignnone sizefull wpimage1592 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561f061a5e715.png" alt="" /> is the same as minimizing <img class="alignnone sizefull wpimage1593 " src="http://theroadchimp.com/wpcontent/uploads/sites/3/2015/10/img_561f0ae9478fb.png" alt="" /> , which is the cost function J(θ).

Excerpt 