1. Least Mean Squares (LMS) algorithm

Revision for “1. Least Mean Squares (LMS) algorithm” created on October 15, 2015 @ 02:34:54

Title
1. Least Mean Squares (LMS) algorithm
Content
<h3>Standard Linear regression</h3> <div class="page" title="Page 3"> <div class="layoutArea"> <div class="column"> <p style="text-align: left;"><strong>Goal</strong>:  Given a training set, to learn a function h : X 􏰀→ Y so that h(x) is a “good” predictor for the corresponding value of y. Function h</p> <p style="text-align: left; padding-left: 30px;"><strong>Training example</strong> <img class="alignnone size-full wp-image-1426 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ab5b9bcd3a.png" alt="" />  ,  where x(i) = input variable (our data) y(i) = target variable (what we're trying to predict)</p> <p style="padding-left: 30px;"><strong>Training set</strong> is a list of m training examples {(<img class="alignnone size-full wp-image-1427 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ab6ebb7a26.png" alt="" />);i = 1,...,m}</p> <p style="padding-left: 30px;">X = Y = R, where X denotes the space of input values, and Y the space of output values</p> <ul> <li><strong>Building the hypothesis function h(x)</strong></li> </ul> To perform supervised learning, we must decide how we’re going to represent functions/hypotheses h in a computer. As an initial choice, let’s say we decide to approximate y as a linear function of x: <p id="jqKyzHJ"><img class="alignnone size-full wp-image-1424 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ab4af87fc8.png" alt="" /></p> </div> </div> </div> <div class="column"> We let x0 = 1 (this is the intercept term), further simplifying to <p id="hqxvGgV"><img class="alignnone size-full wp-image-1423 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ab49775862.png" alt="" /></p> <ul> <li><strong>Selecting the best parameters θ to predict y</strong></li> </ul> We could have many parameters in our training set (ie. number of rooms, lot size, taxes). So which parameters do we include in our hypothesis function in order to best predict our outcome (y)? Ans:  Pick values of θ where the predicted value <em><strong>h(x)</strong></em> is close to the actual value <em><strong>y</strong></em>. <strong>Cost Function</strong>: The following formula defines a function J of θ with an objective of seeking the minimal difference between h(x) and y. <p id="QRVgKBv"><img class="alignnone size-full wp-image-1429 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ab9bea5129.png" alt="" /></p> <em>This least mean squares function is a <strong>Parametric Learning algorithm, </strong>where even as our training set m approaches ∞ the size of h0(x) is constant. Thus memory sizes are constant.</em> <em>How do we optimize J(θ)? Ans: We want to choose θ so as to minimize J(θ)</em> There are several ways to do this: Gradient Descent Normal Equations <ul> <li><strong>Gradient descent algorithm</strong>: A search algorithm that starts with some “initial guess” for θ, and that repeatedly changes θ to make J(θ) smaller, until hopefully we converge to a value of θ that minimizes J(θ).</li> </ul> <p id="ImKRreK"><img class="alignnone size-full wp-image-1431 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561abd5477fc0.png" alt="" /> Where α is called the learning rate and is typically a small number</p> 1. For a single training example:  <strong>Widrow-Hoff</strong> learning rule or <strong>LMS update rule</strong>. <img class="alignnone size-full wp-image-1432 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561abdd37e7e3.png" alt="" /> simplifies to <img class="alignnone size-full wp-image-1433 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561abde35b2f8.png" alt="" />, this results in the update rule<img class="alignnone size-full wp-image-1434 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561abe614db81.png" alt="" /> <strong>Characteristics</strong>:  The magnitude of the update is proportional to the error term <img class="alignnone size-full wp-image-1435 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561abfbdae4e1.png" alt="" />, such that if hθ is close to y, the parameter changes by a small increment <img class="wp-image-1452 aligncenter" src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561aedc017d57.png" alt="" width="397" height="315" /> <p id="NBYUNqO" style="padding-left: 30px;"><em>Courtesy of Stanford SCPD - CS229  Machine Learning </em>Can be susceptible to local optima</p> <p id="lrZjAft" style="padding-left: 30px;"><img class="alignnone wp-image-1454 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561aee1e36c59.png" alt="" width="413" height="285" /></p> <p style="padding-left: 30px;">For linear regression, the cost function J(θ) does not have a local optimum other than a local optimum.</p> 2. For multiple training examples: <strong>M1: Batch Gradient Descent </strong> <p id="YJervYr"><img class="alignnone size-full wp-image-1436 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ac0c751684.png" alt="" /></p> <div class="page" title="Page 5"> <div class="layoutArea"> <div class="column"> The quantity in the summation in the update rule above is just ∂J(θ)/∂θj (for the original definition of J). So, this is simply gradient descent on the original cost function J. Evaluation: Batch gradient descent has to scan through the entire training set before taking a single step </div> </div> </div> </div> <div class="column"> <div class="page" title="Page 7"> <div class="layoutArea"> <div class="column"> <strong>M2: Stochastic Gradient Descent (also incremental gradient descent)</strong> </div> </div> </div> <p id="PtlPGKw"><img class="alignnone size-full wp-image-1437 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ac16e57f94.png" alt="" /></p> Evaluation: We repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only. This method converges to the minimum more rapidly, but has the potential of overshooting the minimum and then oscillating around it. By slowly letting the learning rate α decrease to zero as the algorithm runs, it is also possible to ensure that the parameters will converge to the global minimum rather then merely oscillate around the minimum. </div> <ul> <li class="column"><strong>Normal Equation</strong>: The goal is to minimize J by explicitly taking its derivatives with respect to the θj’s, and setting them to zero (gradient == 0 or minima).</li> </ul> <div class="page" title="Page 8"> <div class="layoutArea"> <div class="column"> <strong>Matrix derivatives</strong> For a function<img class="alignnone size-full wp-image-1444 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ad21b09490.png" alt="" /> mapping from m x n matrices to real numbers, the <em>gradient</em> of <em>f</em> w.r.t. A is  <strong>∇Af(A)</strong> <p id="ovcxcCg"><img class="alignnone size-full wp-image-1445 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ad244ecebc.png" alt="" /></p> </div> </div> </div> <strong>Trace</strong> <img class="alignnone size-full wp-image-1446 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ad2c36b005.png" alt="" />is the sum of the diagonal entries for a square matrix Properties of Traces <div class="page" title="Page 8"> <div class="layoutArea"> <div class="column"> <div class="page" title="Page 8"> <div class="layoutArea"> <div class="column"> <p style="padding-left: 30px;">trAB = trBA trABC = trCAB = trBCA (AB)<sup>T</sup>= B<sup>T</sup>A<sup>T</sup></p> <p style="padding-left: 30px;"><img class="alignnone size-full wp-image-1458 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6419c1e5e.png" alt="" /></p> </div> </div> </div> </div> </div> </div> Additional rules <p id="BIKdBsb"><img class="alignnone size-full wp-image-1447 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ad373bf48d.png" alt="" /></p> For (1), we could treat trace as inner product of two long vectors <span id="MathJax-Element-3-Frame" class="MathJax"><span id="MathJax-Span-28" class="math"><span id="MathJax-Span-29" class="mrow"><span id="MathJax-Span-30" class="mo">[</span><span id="MathJax-Span-31" class="mi">A</span><span id="MathJax-Span-32" class="mo">(</span><span id="MathJax-Span-33" class="mn">1</span><span id="MathJax-Span-34" class="mo">,</span><span id="MathJax-Span-35" class="mn">1</span><span id="MathJax-Span-36" class="mo">)</span><span id="MathJax-Span-37" class="mo">,</span><span id="MathJax-Span-38" class="mo">…</span><span id="MathJax-Span-39" class="mo">,</span><span id="MathJax-Span-40" class="mi">A</span><span id="MathJax-Span-41" class="mo">(</span><span id="MathJax-Span-42" class="mn">1</span><span id="MathJax-Span-43" class="mo">,</span><span id="MathJax-Span-44" class="mi">n</span><span id="MathJax-Span-45" class="mo">)</span><span id="MathJax-Span-46" class="mo">,</span><span id="MathJax-Span-47" class="mo">…</span><span id="MathJax-Span-48" class="mo">,</span><span id="MathJax-Span-49" class="mi">A</span><span id="MathJax-Span-50" class="mo">(</span><span id="MathJax-Span-51" class="mi">n</span><span id="MathJax-Span-52" class="mo">,</span><span id="MathJax-Span-53" class="mn">1</span><span id="MathJax-Span-54" class="mo">)</span><span id="MathJax-Span-55" class="mo">,</span><span id="MathJax-Span-56" class="mo">…</span><span id="MathJax-Span-57" class="mo">,</span><span id="MathJax-Span-58" class="mi">A</span><span id="MathJax-Span-59" class="mo">(</span><span id="MathJax-Span-60" class="mi">n</span><span id="MathJax-Span-61" class="mo">,</span><span id="MathJax-Span-62" class="mi">n</span><span id="MathJax-Span-63" class="mo">)</span><span id="MathJax-Span-64" class="mo">] </span></span></span></span>(by row) and <span id="MathJax-Element-4-Frame" class="MathJax"><span id="MathJax-Span-65" class="math"><span id="MathJax-Span-66" class="mrow"><span id="MathJax-Span-67" class="mo">[</span><span id="MathJax-Span-68" class="mi">B</span><span id="MathJax-Span-69" class="mo">(</span><span id="MathJax-Span-70" class="mn">1</span><span id="MathJax-Span-71" class="mo">,</span><span id="MathJax-Span-72" class="mn">1</span><span id="MathJax-Span-73" class="mo">)</span><span id="MathJax-Span-74" class="mo">,</span><span id="MathJax-Span-75" class="mo">…</span><span id="MathJax-Span-76" class="mo">,</span><span id="MathJax-Span-77" class="mi">B</span><span id="MathJax-Span-78" class="mo">(</span><span id="MathJax-Span-79" class="mi">n</span><span id="MathJax-Span-80" class="mo">,</span><span id="MathJax-Span-81" class="mn">1</span><span id="MathJax-Span-82" class="mo">)</span><span id="MathJax-Span-83" class="mo">,</span><span id="MathJax-Span-84" class="mo">…</span><span id="MathJax-Span-85" class="mo">,</span><span id="MathJax-Span-86" class="mi">B</span><span id="MathJax-Span-87" class="mo">(</span><span id="MathJax-Span-88" class="mn">1</span><span id="MathJax-Span-89" class="mo">,</span><span id="MathJax-Span-90" class="mi">n</span><span id="MathJax-Span-91" class="mo">)</span><span id="MathJax-Span-92" class="mo">,</span><span id="MathJax-Span-93" class="mo">…</span><span id="MathJax-Span-94" class="mo">,</span><span id="MathJax-Span-95" class="mi">B</span><span id="MathJax-Span-96" class="mo">(</span><span id="MathJax-Span-97" class="mi">n</span><span id="MathJax-Span-98" class="mo">,</span><span id="MathJax-Span-99" class="mi">n</span><span id="MathJax-Span-100" class="mo">)</span><span id="MathJax-Span-101" class="mo">] </span></span></span></span>(by column) Thus, the trace is <img class="alignnone size-full wp-image-1594 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561f1035ecad2.png" alt="" /><span id="MathJax-Element-5-Frame" class="MathJax"><span id="MathJax-Span-102" class="math"><span id="MathJax-Span-103" class="mrow"><span id="MathJax-Span-125" class="mo"> </span></span></span></span>, which is linear function of all elements of A. <img class="alignnone size-full wp-image-1595 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561f106ec6731.png" alt="" /> Since only one term in the RHS numerator contains <span id="MathJax-Element-13-Frame" class="MathJax"><span id="MathJax-Span-393" class="math"><span id="MathJax-Span-394" class="mrow"><span id="MathJax-Span-395" class="msubsup"><span id="MathJax-Span-396" class="mi">A</span><span id="MathJax-Span-397" class="texatom"><span id="MathJax-Span-398" class="mrow"><span id="MathJax-Span-399" class="mi">p</span><span id="MathJax-Span-400" class="mi">q </span></span></span></span></span></span></span>, after applying the partial derivative, we conclude <span id="MathJax-Element-14-Frame" class="MathJax"><span id="MathJax-Span-401" class="math"><span id="MathJax-Span-402" class="mrow"><span id="MathJax-Span-403" class="mo">(</span><span id="MathJax-Span-404" class="msubsup"><span id="MathJax-Span-405" class="mi">∇</span><span id="MathJax-Span-406" class="mi">A</span></span><span id="MathJax-Span-407" class="texatom"><span id="MathJax-Span-408" class="mrow"><span id="MathJax-Span-409" class="mtext">tr</span></span></span><span id="MathJax-Span-410" class="mo">(</span><span id="MathJax-Span-411" class="mi">A</span><span id="MathJax-Span-412" class="mi">B</span><span id="MathJax-Span-413" class="mo">)</span><span id="MathJax-Span-414" class="msubsup"><span id="MathJax-Span-415" class="mo">)</span><span id="MathJax-Span-416" class="texatom"><span id="MathJax-Span-417" class="mrow"><span id="MathJax-Span-418" class="mi">p</span><span id="MathJax-Span-419" class="mi">q </span></span></span></span><span id="MathJax-Span-420" class="mo">= </span><span id="MathJax-Span-421" class="msubsup"><span id="MathJax-Span-422" class="mi">B</span><span id="MathJax-Span-423" class="texatom"><span id="MathJax-Span-424" class="mrow"><span id="MathJax-Span-425" class="mi">q</span><span id="MathJax-Span-426" class="mi">p </span></span></span></span><span id="MathJax-Span-427" class="mspace"></span><span id="MathJax-Span-428" class="mo">⟹ </span><span id="MathJax-Span-429" class="mspace"></span><span id="MathJax-Span-430" class="msubsup"><span id="MathJax-Span-431" class="mi">∇</span><span id="MathJax-Span-432" class="mi">A</span></span><span id="MathJax-Span-433" class="texatom"><span id="MathJax-Span-434" class="mrow"><span id="MathJax-Span-435" class="mtext">tr</span></span></span><span id="MathJax-Span-436" class="mo">(</span><span id="MathJax-Span-437" class="mi">A</span><span id="MathJax-Span-438" class="mi">B</span><span id="MathJax-Span-439" class="mo">) </span><span id="MathJax-Span-440" class="mo">= </span><span id="MathJax-Span-441" class="msubsup"><span id="MathJax-Span-442" class="mi">B</span><span id="MathJax-Span-443" class="mi">⊤</span></span></span></span></span> <p id="IsPeFyr"></p> <strong>Least Squares method (Applying Matrix rules) </strong>We use a compressed vector notation for<img class="alignnone size-full wp-image-1466 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6f7cf15b2.png" alt="" />using a design matrix <strong>X</strong> to depict input values from our training example <strong>*Proof  that  <img class="alignnone size-full wp-image-1466 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6f7cf15b2.png" alt="" /> = <img class="alignnone size-full wp-image-1465 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6f658ec67.png" alt="" /></strong> <ol> <li>Define the design matrix X which contains the training example's input values in rows <img class="alignnone size-full wp-image-1460 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6af3f3a36.png" alt="" /></li> <li>Let ⃗y be the m-dimensional vector containing all the target values from the training set <img class="alignnone size-full wp-image-1462 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6b406d4bf.png" alt="" /></li> <li><img class="alignnone size-full wp-image-1467 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d700628f49.png" alt="" />= <img class="alignnone size-full wp-image-1468 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d701a760c7.png" alt="" /><img class="alignnone size-full wp-image-1469 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d702a9b8b0.png" alt="" /> = <img class="alignnone size-full wp-image-1470 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d7039a36b2.png" alt="" /></li> <li>Since <img class="alignnone size-full wp-image-1463 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6bd3c8450.png" alt="" /> aka the Hypothesis is equivalent to <img class="alignnone size-full wp-image-1471 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d70b4b7ba6.png" alt="" />, from the linear equation<img class="alignnone size-full wp-image-1424 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ab4af87fc8.png" alt="" /> *Note the transpose <img class="alignnone size-full wp-image-1474 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d72328f503.png" alt="" /> is simply a way of expressing x in rows</li> <li><img class="alignnone size-full wp-image-1472 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d720150d85.png" alt="" /> = <img class="alignnone size-full wp-image-1473 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d720c514b6.png" alt="" /></li> <li>The inner product of a vector by itself is equivalent to the sum of the squares of its elements ... <img class="alignnone size-full wp-image-1477 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d7359c5d4c.png" alt="" /> <img class="alignnone size-full wp-image-1478 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d738698e78.png" alt="" /> = <img class="alignnone wp-image-1479 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d73a2a8280.png" alt="" width="158" height="58" /></li> <li>With a simple transformation, <img class="alignnone size-full wp-image-1429 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ab9bea5129.png" alt="" /> = <img class="alignnone size-full wp-image-1464 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6ee9a5b8c.png" alt="" /></li> <li> <div class="page" title="Page 10"></div></li> </ol> Combining Equations (2) and (3), <img class="alignnone size-full wp-image-1484 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d8067c90f1.png" alt="" /> Given <img class="alignnone size-full wp-image-1466 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6f7cf15b2.png" alt="" /> = <img class="alignnone size-full wp-image-1465 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d6f658ec67.png" alt="" /> <p id="LHxVQUk"><img class="alignnone size-full wp-image-1485 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561d80e780e2a.png" alt="" /></p> &nbsp; We set this derivative == 0 and solve for θ <strong>NORMAL EQUATION </strong><img class="alignnone size-full wp-image-1448 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561ad44333ff3.png" alt="" /> Pseudo Inverse is<img class="alignnone size-full wp-image-1487 " src="http://theroadchimp.com/wp-content/uploads/sites/3/2015/10/img_561da3b9ba063.png" alt="" /> &nbsp;
Excerpt


OldNewDate CreatedAuthorActions
October 15, 2015 @ 02:34:54 roadchimp
October 15, 2015 @ 02:33:48 [Autosave] roadchimp
October 15, 2015 @ 01:43:40 roadchimp
October 14, 2015 @ 01:17:24 roadchimp
October 14, 2015 @ 00:38:23 roadchimp
October 14, 2015 @ 00:26:58 roadchimp
October 13, 2015 @ 21:41:05 roadchimp
October 13, 2015 @ 21:27:24 roadchimp
October 13, 2015 @ 21:27:06 roadchimp
October 13, 2015 @ 21:06:51 roadchimp
October 13, 2015 @ 20:28:10 roadchimp
October 11, 2015 @ 23:19:03 roadchimp
October 11, 2015 @ 21:41:03 roadchimp
October 11, 2015 @ 21:34:39 roadchimp
October 11, 2015 @ 20:43:14 roadchimp
October 11, 2015 @ 20:38:46 roadchimp
October 11, 2015 @ 20:36:41 roadchimp
October 11, 2015 @ 19:42:58 roadchimp
October 11, 2015 @ 19:26:47 roadchimp
October 11, 2015 @ 19:12:58 roadchimp
October 11, 2015 @ 19:06:33 roadchimp
October 11, 2015 @ 19:00:42 roadchimp