跳转至

5

习题1

构建模型使得预测值与真实值的误差最小常用向量2-范数度量,求解模型过程中需要计算梯度,求梯度:

\(f(A)=\frac{1}{2}\lVert Ax+b-y\rVert_2^2\),求 \(\frac{\partial f}{\partial A}\)

\(f(x)=\frac{1}{2}\lVert Ax+b-y\rVert_2^2\),求 \(\frac{\partial f}{\partial x}\)

其中 \(A\in R^{m\times n}\)\(x\in R^n\)\(b,y\in R^m\)

\(df=Tr(df)=Tr\{d[\frac{1}{2}(Ax+b-y)^T(Ax+b-y)]\}\)

\(=\frac{1}{2}Tr\{d[(Ax)^T(Ax)+(b-y)^T(Ax)+(Ax)^T(b-y)+(b-y)^T(b-y)]\}\)

\(=\frac{1}{2}Tr\{d[(Ax)^T(Ax)]+d[(b-y)^T(Ax)]+d[(Ax)^T(b-y)]+d[(b-y)^T(b-y)]\}\)

\(=\frac{1}{2}Tr\{2(Ax)^Td(Ax)+(b-y)^Td(Ax)+d(b-y)^T(Ax)+(Ax)^Td(b-y)+d(Ax)^T(b-y)+d[(b-y)^T(b-y)]\}\)

\(=\frac{1}{2}Tr\{2(Ax)^Td(Ax)+(b-y)^Td(Ax)+d(Ax)^T\cdot(b-y)+(\text{与}dA\text{或}dx\text{无关的项})\}\)

\(=\frac{1}{2}Tr\{2(Ax)^Td(Ax)+(b-y)^Td(Ax)+d(Ax)^T\cdot[(b-y)^T]^T+(\text{与}dA\text{或}dx\text{无关的项})\}\)

\(=\frac{1}{2}Tr\{2(Ax)^Td(Ax)+(b-y)^Td(Ax)+[(b-y)^Td(Ax)]^T+(\text{与}dA\text{或}dx\text{无关的项})\}\)

\(=Tr\{(Ax)^Td(Ax)+(b-y)^Td(Ax)\}+(\text{与}dA\text{或}dx\text{无关的项})\)

\(=Tr\{(Ax+b-y)^Td(Ax)\}+(\text{与}dA\text{或}dx\text{无关的项})\)

\(=Tr\{(Ax+b-y)^T(Adx+dA\cdot x)\}+(\text{与}dA\text{或}dx\text{无关的项})\)

\(=Tr\{(Ax+b-y)^TAdx+x(Ax+b-y)^TdA\}+(\text{与}dA\text{或}dx\text{无关的项})\)

又因为 \(df=Tr\left[\left(\frac{\partial f}{\partial A}\right)^T dA\right]\)

所以

\(\frac{\partial}{\partial A}f=(Ax+b-y)x^T\)

\(\frac{\partial}{\partial x}f=A^T(Ax+b-y)\)

习题2

二次型是数据分析中常用函数,求 \(\frac{\partial x^TAx}{\partial x}\)\(\frac{\partial x^TAx}{\partial A}\),其中 \(A\in R^{m\times m}\)\(x\in R^m\)

\(\frac{\partial x^TAx}{\partial x}=(A+A^T)x\)(lec15 例9)

\(\frac{\partial x^TAx}{\partial A_{ij}}=x_ix_j\)\(\frac{\partial x^TAx}{\partial A}=xx^T\)

习题3

\(\frac{\partial |X^k|}{\partial X}\),其中 \(X\in R^{m\times m}\) 为可逆矩阵。

\(\frac{\partial |X^k|}{\partial X}=\frac{\partial |X^k|}{\partial |X|}\frac{\partial |X|}{\partial X}=k|X|^{k-1}(X^*)^T=k|X|^{k-1}|X|X^{-T}=k|X|^kX^{-T}\)(行列式求导运用到lec15 定理4)

习题4

\(\frac{\partial Tr(AXBX^TC)}{\partial X}\),其中 \(A\in R^{m\times n}\)\(X\in R^{n\times k}\)\(B\in R^{k\times k}\)\(C\in R^{n\times m}\)

\(d(Tr(AXBX^TC))=Tr[d(AXBX^TC)]\)

\(=Tr[Ad(XBX^TC)+dA\cdot(XBX^TC)]\)

\(=Tr[AXd(BX^TC)+AdX\cdot(BX^TC)+\text{其它与}dX\text{无关的项}]\)

\(=Tr[AXBd(X^TC)+AXdB\cdot(X^TC)+BX^TCAdX+\text{其它与}dX\text{无关的项}]\)

\(=Tr[AXBX^TdC+AXBdX^T\cdot C+BX^TCAdX+\text{其它与}dX\text{无关的项}]\)

\(=Tr[(dX)^T[(CAXB)^T]+BX^TCAdX+\text{其它与}dX\text{无关的项}]\)

\(=Tr\{[(CAXB)^TdX]^T+BX^TCAdX\}+\text{其它与}dX\text{无关的项}\)

\(=Tr\{[(CAXB)^T+BX^TCA]dX\}+\text{其它与}dX\text{无关的项}\)

\(\frac{\partial Tr(AXBX^TC)}{\partial X}=(BX^TCA)^T+CAXB\)

习题5

利用迹微分法求解 \(\frac{\partial Tr(W^{-1})}{\partial W}\),其中 \(W\in R^{m\times m}\)

因为

\(0=dI=d(WW^{-1})=dWW^{-1}+WdW^{-1}\)

\(WdW^{-1}=-dWW^{-1}\)

\(dW^{-1}=-W^{-1}dWW^{-1}\)

所以

\(dTr(W^{-1})=Tr(dW^{-1})\)

\(=Tr(-W^{-1}dWW^{-1})\)

\(=Tr(-(W^{-1})^2dW)\)

\(\frac{\partial Tr(W^{-1})}{\partial W}=-(W^{-T})^2\)

习题6

\(exp(z)=(exp(z_1),exp(z_2),...,exp(z_n))^T\),其中 \((exp(z))_i=exp(z_i)\)\(log(z)=(log(z_1),log(z_2),...,log(z_n))^T\),其中 \((log(z))_i=log(z_i)\)\(f(z)=\frac{exp(z)}{1^Texp(z)}\) 称为 \(softmax\) 函数,如果 \(q=f(z)\)\(J=-p^Tlog(q)\),其中 \(p,q,z\in R^n\),并且 \(1^Tp=1\)

(1) 证:\(\frac{\partial J}{\partial z}=q-p\)

(2) 若 \(z=Wx\),其中 \(W\in R^{n\times m}\)\(x\in R^m\)\(\frac{\partial J}{\partial W}=(q-p)x^T\) 是否成立。

(1) 证明

\(J=-p^Tlog\left(\frac{exp(z)}{1^Texp(z)}\right)\)

\(=-p^Tz+p^Tlog(1^Texp(z))\mathbf{1}\)

\(=-p^Tz+p^T\mathbf{1}log(1^Texp(z))\)

\(=-p^Tz+log(1^Texp(z))\)

\(\frac{\partial J}{\partial z}=-p+\frac{\partial log(1^Texp(z))}{\partial z}\)

\(=-p+\frac{\partial 1^Texp(z)}{\partial z}\frac{1}{1^Texp(z)}\)

\(=-p+\frac{exp(z)}{1^Texp(z)}\)

\(=-p+q\)

(2) 成立,证明如下:

\(dJ=dTr(J)=Tr(dJ)=Tr[(-p+q)^TdWx]=Tr[x(-p+q)^TdW]\)

\(\frac{\partial J}{\partial W}=(-p+q)x^T\)

习题7

(1)已知

\(f(t)=sin(log(t^Tt)),t\in R^m\)

\(f\)\(t\) 的梯度。

(2)已知

\(g(X)=Tr(AXB),A\in R^{m\times n},X\in R^{n\times p},B\in R^{p\times m}\)

\(g\)\(X\) 的梯度。

(1) 定义:

\(g(t)=t^Tt=\sum_{i=1}^n t_i^2\)

\(h(t)=log(g(t))=log(t^Tt)\)

\(f(t)=sin(h(t))\)

应用链式法则:

  1. 计算 \(\nabla_tg(t)\)

\(\nabla_tg=2t\)

  1. 计算 \(\nabla_th(t)\)

\(\nabla_th=\frac{1}{g}\nabla g=\frac{2t}{t^Tt}\)

  1. 计算 \(\nabla_tf(t)\)

\(\nabla_tf=cos(h)\cdot\nabla h=cos(log(t^Tt))\cdot\frac{2t}{t^Tt}\)

(2)给定:

\(g(X)=Tr(AXB)\)

其中,\(A\in R^{m\times n}\)\(X\in R^{n\times p}\)\(B\in R^{p\times m}\)

利用矩阵微积分中的性质:

\(Tr(AXB)=Tr(BAX)\)

对矩阵函数 \(g(X)=Tr(CX)\),其中 \(C=BA\),有:

\(\nabla_XTr(CX)=C^T\)

因此:

\(\nabla_Xg(X)=(BA)^T=A^TB^T\)