0%

Linear Regression

Linear Function

Let’s say we decide to approximate y as a linear function of x

hθ(x)=θ0x0+θ1x1+...+θnxn=i=0nθixi=(θ0θ1...θn)T(x0x1...θn)h_{\theta}(x) = \theta_0 * x_0 + \theta_1 * x_1 + ... + \theta_n * x_n = \sum_{i=0}^n \theta_i * x_i = \begin{pmatrix} \theta_0 \\ \theta_1 \\ ... \\ \theta_n \end{pmatrix}^T * \begin{pmatrix} x_0 \\ x_1 \\ ... \\ \theta_n \end{pmatrix}

where the θi\theta_i’s are the parameters (also called weights) parameterizing the space of linear functions mapping from XX to YY, θ\theta and xx are both n1n * 1 vectors
In case of a collection of dataset xx and yy, let’s say xjx^j and yjy^j is the jthj^{th} data, the linear function can be written as

Hθ(X)=(y0y1...yn)=(θ0x00+θ1x10+...+θnxn0θ0x01+θ1x11+...+θnxn1...θ0x0m+θ1x1m+...+θnxnm)=(i=0nθixi0i=0nθixi1...i=0mθixim)=(x00,x10,...,xn0x01,x11,...,xn1...x0m,x1m,...,xnm)(θ0,θ1,...,θn)T=XθH_{\theta}(X) = \begin{pmatrix} y_0 \\ y_1 \\ ...\\ y_n \end{pmatrix} = \begin{pmatrix} \theta_0 * x_{0}^{0} + \theta_1 * x_{1}^{0} + ... + \theta_n * x_{n}^{0} \\ \theta_0 * x_{0}^{1} + \theta_1 * x_{1}^{1} + ... + \theta_n * x_{n}^{1} \\ ... \\ \theta_0 * x_{0}^{m} + \theta_1 * x_{1}^{m} + ... + \theta_n * x_{n}^{m} \end{pmatrix} = \begin{pmatrix} \sum_{i=0}^n \theta_i * x_i^0 \\ \sum_{i=0}^n \theta_i * x_i^1 \\ ... \\ \sum_{i=0}^m \theta_i * x_i^m \end{pmatrix} = \begin{pmatrix} x_0^0, x_1^0, ..., x_n^0 \\ x_0^1, x_1^1, ..., x_n^1 \\ ... \\ x_0^m, x_1^m, ..., x_n^m \end{pmatrix} * \begin{pmatrix} \theta_0, \theta_1, ..., \theta_n \end{pmatrix} ^T = X * \theta

Cost Function

The cost function for a training dataset

Jx(θ)=12mi=0m(hθ(x(i))y(i))2J_{x}(\theta) = \frac{1}{2m}\sum_{i=0}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2

Gradient Descent

To minimizes J(θ)J(θ), let’s consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update. For a single training example,

θi=θiαθiJ(θ)=θiαm(hθ(x(i))y(i))xj(i)\theta_i = \theta_i - \alpha \frac{\partial}{\partial \theta_i}J(\theta) = \theta_i - \frac{\alpha}{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}

Here, α is called the learning rate.

Batch Gradient Descent

In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only.

Batch Gradient Descent Matrix Representation

Again, the cost function is

Jx(θ)=12mi=0m(hθ(x(i))y(i))2=12m(Xθy)T(Xθy)J_{x}(\theta) = \frac{1}{2m}\sum_{i=0}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m}(X\theta - y)^T(X\theta - y)

(Xθy)T(Xθy)=((Xθ)TyT)(Xθy)=(θTXTyT)(Xθy)=θTXTXθθTXTyyTXθ+yTy\begin{aligned} (X\theta - y)^T(X\theta - y) &= ((X\theta)^T - y^T)(X\theta - y) \\ &= (\theta^TX^T - y^T)(X\theta - y) \\ &= \theta^TX^TX\theta - \theta^TX^Ty - y^TX\theta + y^Ty \end{aligned}

The partial derivative term of cost function

dθ((Xθy)T(Xθy))=dθ(θTXTXθθTXTyyTXθ+yTy)=dθ(θTXTXθ)dθ(θTXTy)dθ(yTXθ)+dθ(yTy)=dθ(θTXTXθ)dθ(θTXTy)dθ(yTXθ)+0\begin{aligned} d_{\theta}((X\theta - y)^T(X\theta - y)) &= d_{\theta}(\theta^TX^TX\theta - \theta^TX^Ty - y^TX\theta + y^Ty) \\ &= d_{\theta}(\theta^TX^TX\theta) - d_{\theta}(\theta^TX^Ty) - d_{\theta}(y^TX\theta) + d_{\theta}(y^Ty) \\ &= d_{\theta}(\theta^TX^TX\theta) - d_{\theta}(\theta^TX^Ty) - d_{\theta}(y^TX\theta) + 0 \end{aligned}

dθ(θTXTy)=XTyd(θT)=((XTy)Td(θ))T=XTyd(θ)\begin{aligned} d_{\theta}(\theta^TX^Ty) = X^Tyd(\theta^T) = ((X^Ty)^Td(\theta))^T = X^Tyd(\theta) \end{aligned}

dθ(yTXθ)=yTXd(θ)d_{\theta}(y^TX\theta) = y^TXd(\theta)

  • Sidebar

d(XT)=d(X)Td(X^T) = d(X)^T

Linear Products
Let A is matrix/vector, x is vector

dx(xTA)=Ad_{x}(x^TA) = A

Proof:

dx(xTA)=Adx(xT)+xTdx(A)=AI+0=Ad_{x}(x^TA) = Ad_{x}(x^T) + x^Td_x(A) = A * I + 0 = A