4. Exponential family

The Exponential family of distributions can be written as:

η: natural parameter (vector)
T(y): sufficient statistic (vector), usually T(y) = y
a(η): log partition function
e−a(η): normalization constant

By fixing a, b and T, p(y; η) defines a set of distributions which vary by  η
As we vary η, we get different distributions within the set

Bernoulli (classification regression) and Gaussian (logistic regression) examples can be expressed within the exponential family

Proof: Bernoulli Regression
We need to find values of a,b and T such that as we vary η, we get different distributions of y and prove that every member of the Bernoulli set has a corresponding member in the Exponential family

Bernoulli distribution with a mean (φ), specifies a distribution over y ∈ {0, 1}
p(y = 1;φ) = φ; p(y = 0;φ) = 1−φ

The Bernoulli distribution:

We can map these components to a,b, T via 

Now that we have values for a,b and T, we can find a value in one distribution and map to another

Proof: Gaussian Regression
We set σ2 = 1 since the value of σ2 had no effect on our final choice of θ and hθ(x)

We similarly can map a,b and T

Poisson distribution

Good for counts (visitors, hits) > lots of individual people and each has a minute chance of turning up in your count

Generalized Learning Model
A way to model y given some value of y and η

Where η is a linear 
Where η is a vector

 

Multinomial
Example where there are multiple categories of data (ie. ebay listing)
Can we come up with a decision boundary that separates categories?

 

 

 

 

3. Logistic regression

Binary Classification
Classification problems differ from regression problems in that takes on only a small number of discrete values. In the case of binary classification, y takes on values of ∈ {0, 1}.

Logistic Regression algorithm

We define our hypothesis h(x) as follows:

 where 

This is a sigmoid OR logistic function, which is rotationally symmetric

Estimating the output of the function:

We can restate this as 

With this, we can define the likelihood function as:

 =  =

We can state the log likelihood as:
 =  

Stochastic gradient ascent rule
To maximize log likelihood, we create a learning algorithm that updates as follows:

Note that this is the inverse of the gradient descent rule commonly applied to Linear regression to reduce the cost function. In this case, we’re trying to maximize the log likelihood.

Given that

Perceptron Algorithm

We modify the logistic learning algorithm to force an output of 1 or 0

We determine a hypothesis of 

We redefine g as

We change the gradient ascent update rule to become

Newton’s Method

Given a function : R 􏰀→ R, we want to find a value of θ so that f(θ)=0

If we want to maximize the likelihood function l(θ), the maxima of l correspond to points where its first derivative l′(θ) is zero.

By letting f(θ) = l′(θ), we derive the update rule

Vectorization of Newton’s method
The above formula applies to scalar value of θ. We update the formula to a vectorized format using the Newton-Raphson method:

H is an n-by-n matrix called the Hessian

When Newton’s method is applied to maximize the logistic regression log likelihood function l(θ), the resulting method is also called Fisher scoring.

 

 

Supervised learning setup