Linear Function
Let’s say we decide to approximate y as a linear function of x
hθ(x)=θ0∗x0+θ1∗x1+...+θn∗xn=i=0∑nθi∗xi=⎝⎜⎜⎜⎛θ0θ1...θn⎠⎟⎟⎟⎞T∗⎝⎜⎜⎜⎛x0x1...θn⎠⎟⎟⎟⎞
where the θi’s are the parameters (also called weights) parameterizing the space of linear functions mapping from X to Y, θ and x are both n∗1 vectors
In case of a collection of dataset x and y, let’s say xj and yj is the jth data, the linear function can be written as
Hθ(X)=⎝⎜⎜⎜⎛y0y1...yn⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛θ0∗x00+θ1∗x10+...+θn∗xn0θ0∗x01+θ1∗x11+...+θn∗xn1...θ0∗x0m+θ1∗x1m+...+θn∗xnm⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛∑i=0nθi∗xi0∑i=0nθi∗xi1...∑i=0mθi∗xim⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛x00,x10,...,xn0x01,x11,...,xn1...x0m,x1m,...,xnm⎠⎟⎟⎟⎞∗(θ0,θ1,...,θn)T=X∗θ
Cost Function
The cost function for a training dataset
Jx(θ)=2m1i=0∑m(hθ(x(i))−y(i))2
Gradient Descent
To minimizes J(θ), let’s consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update. For a single training example,
θi=θi−α∂θi∂J(θ)=θi−mα(hθ(x(i))−y(i))xj(i)
Here, α is called the learning rate.
Batch Gradient Descent
In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only.
Batch Gradient Descent Matrix Representation
Again, the cost function is
Jx(θ)=2m1i=0∑m(hθ(x(i))−y(i))2=2m1(Xθ−y)T(Xθ−y)
(Xθ−y)T(Xθ−y)=((Xθ)T−yT)(Xθ−y)=(θTXT−yT)(Xθ−y)=θTXTXθ−θTXTy−yTXθ+yTy
The partial derivative term of cost function
dθ((Xθ−y)T(Xθ−y))=dθ(θTXTXθ−θTXTy−yTXθ+yTy)=dθ(θTXTXθ)−dθ(θTXTy)−dθ(yTXθ)+dθ(yTy)=dθ(θTXTXθ)−dθ(θTXTy)−dθ(yTXθ)+0
dθ(θTXTy)=XTyd(θT)=((XTy)Td(θ))T=XTyd(θ)
dθ(yTXθ)=yTXd(θ)
d(XT)=d(X)T
Linear Products
Let A is matrix/vector, x is vector
dx(xTA)=A
Proof:
dx(xTA)=Adx(xT)+xTdx(A)=A∗I+0=A