Linear Regression

Linear Function

Let’s say we decide to approximate y as a linear function of x

h_{\theta}(x) = \theta_0 * x_0 + \theta_1 * x_1 + ... + \theta_n * x_n = \sum_{i=0}^n \theta_i * x_i = \begin{pmatrix} \theta_0 \\ \theta_1 \\ ... \\ \theta_n \end{pmatrix}^T * \begin{pmatrix} x_0 \\ x_1 \\ ... \\ \theta_n \end{pmatrix}

where the $\theta_i$ ’s are the parameters (also called weights) parameterizing the space of linear functions mapping from $X$ to $Y$ , $\theta$ and $x$ are both $n * 1$ vectors
In case of a collection of dataset $x$ and $y$ , let’s say $x^j$ and $y^j$ is the $j^{th}$ data, the linear function can be written as

H_{\theta}(X) = \begin{pmatrix} y_0 \\ y_1 \\ ...\\ y_n \end{pmatrix} = \begin{pmatrix} \theta_0 * x_{0}^{0} + \theta_1 * x_{1}^{0} + ... + \theta_n * x_{n}^{0} \\ \theta_0 * x_{0}^{1} + \theta_1 * x_{1}^{1} + ... + \theta_n * x_{n}^{1} \\ ... \\ \theta_0 * x_{0}^{m} + \theta_1 * x_{1}^{m} + ... + \theta_n * x_{n}^{m} \end{pmatrix} = \begin{pmatrix} \sum_{i=0}^n \theta_i * x_i^0 \\ \sum_{i=0}^n \theta_i * x_i^1 \\ ... \\ \sum_{i=0}^m \theta_i * x_i^m \end{pmatrix} = \begin{pmatrix} x_0^0, x_1^0, ..., x_n^0 \\ x_0^1, x_1^1, ..., x_n^1 \\ ... \\ x_0^m, x_1^m, ..., x_n^m \end{pmatrix} * \begin{pmatrix} \theta_0, \theta_1, ..., \theta_n \end{pmatrix} ^T = X * \theta

Cost Function

The cost function for a training dataset

J_{x}(\theta) = \frac{1}{2m}\sum_{i=0}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2

Gradient Descent

To minimizes $J(θ)$ , let’s consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update. For a single training example,

\theta_i = \theta_i - \alpha \frac{\partial}{\partial \theta_i}J(\theta) = \theta_i - \frac{\alpha}{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}

Here, α is called the learning rate.

Batch Gradient Descent

In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only.

Batch Gradient Descent Matrix Representation

Again, the cost function is

J_{x}(\theta) = \frac{1}{2m}\sum_{i=0}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m}(X\theta - y)^T(X\theta - y)

\begin{aligned} (X\theta - y)^T(X\theta - y) &= ((X\theta)^T - y^T)(X\theta - y) \\ &= (\theta^TX^T - y^T)(X\theta - y) \\ &= \theta^TX^TX\theta - \theta^TX^Ty - y^TX\theta + y^Ty \end{aligned}

The partial derivative term of cost function

\begin{aligned} d_{\theta}((X\theta - y)^T(X\theta - y)) &= d_{\theta}(\theta^TX^TX\theta - \theta^TX^Ty - y^TX\theta + y^Ty) \\ &= d_{\theta}(\theta^TX^TX\theta) - d_{\theta}(\theta^TX^Ty) - d_{\theta}(y^TX\theta) + d_{\theta}(y^Ty) \\ &= d_{\theta}(\theta^TX^TX\theta) - d_{\theta}(\theta^TX^Ty) - d_{\theta}(y^TX\theta) + 0 \end{aligned}

\begin{aligned} d_{\theta}(\theta^TX^Ty) = X^Tyd(\theta^T) = ((X^Ty)^Td(\theta))^T = X^Tyd(\theta) \end{aligned}

d_{\theta}(y^TX\theta) = y^TXd(\theta)

Sidebar

d(X^T) = d(X)^T

Linear Products
Let A is matrix/vector, x is vector

d_{x}(x^TA) = A

Proof:

d_{x}(x^TA) = Ad_{x}(x^T) + x^Td_x(A) = A * I + 0 = A