[ML] 機器學習技法：第三講 Kernel SVM

ML：基礎技法學習
Package：scikit-learn
課程：機器學習技法
簡介：第三講 Kernel SVM (support vector machine)

dual SVM model

$$ \begin{align*} \underset{\boldsymbol\alpha }{min}\ \ \ &\frac{1}{2}\boldsymbol\alpha ^T\mathbf{Q_D}\boldsymbol\alpha -(\mathbf{1}_{N \times 1})^T\boldsymbol\alpha \\ subject\ to\ \ \ &\mathbf{y}^T\boldsymbol\alpha =0\\ &\alpha _n \geq0,for\ n=1,2,\cdots ,N \end{align*} $$ $q_{n,m}=y_ny_m\mathbf{z}_n^T\mathbf{z}_m$ inner product in $\mathbb{R}^{\tilde{d}}$
$\mathbf{z}_n^T\mathbf{z}_m=\Phi (\mathbf{x}_n)^T\Phi (\mathbf{x}_m)$

如何簡化 $\mathbf{z}_n^T\mathbf{z}_m=\Phi (\mathbf{x}_n)^T\Phi (\mathbf{x}_m)$ ？將 $\Phi (\mathbf{x}_n)^T\Phi (\mathbf{x}_m)$ 簡化為 $K(\mathbf{x}_n,\mathbf{x}_m)$

Kernel Hard-Margin SVM Algorithm

General Polynomial Kernel

$$ K_Q(\mathbf{x},\mathbf{{x}'})=(\zeta +\gamma \mathbf{x}^T\mathbf{{x}'})^Q\ \ \ with\ \gamma>0,\zeta\geq 0 $$

先從二次轉換來看
假設 $\Phi _2(\mathbf{x})=(1,x_1,x_2,\cdots ,x_d,x_1^2,x_1x_2,\cdots ,x_1x_d,x_2x_1,x_2^2,\cdots ,x_2x_d,\cdots ,x_d^2)$
為求化簡，故同時包含 $x_1x_2$ & $x_2x_1$，即使兩者是完全一樣的
展開轉換與內積可得
$$ \begin{align*} \Phi _2(\mathbf{x})^T\Phi _2(\mathbf{{x}'}) &= 1+\sum_{d}^{i=1}x_i{x_i}'+\sum_{i=1}^{d}\sum_{j=1}^{d}x_ix_j{x_i}'{x_j}'\\ &= 1+\sum_{d}^{i=1}x_i{x_i}'+\sum_{i=1}^{d}x_i{x_i}'\sum_{j=1}^{d}x_j{x_j}'\\ &= 1+\mathbf{x}^T\mathbf{{x}'}+(\mathbf{x}^T\mathbf{{x}'})(\mathbf{x}^T\mathbf{{x}'}) \end{align*} $$ 可看到原本若轉換後再內積需要花 $O(\tilde{d})=O(d^2)$ 的時間，但展開後只需花原本維度的 $O(d)$

定義轉換 $\Phi$ $\Leftrightarrow$ kernel function: $K_\Phi (\mathbf{x},\mathbf{{x}'})\equiv \Phi (\mathbf{x})^T\Phi (\mathbf{{x}'})$
故 $\Phi _2\Leftrightarrow K_{\Phi_2} (\mathbf{x},\mathbf{{x}'})=1+(\mathbf{x}^T\mathbf{{x}'})+(\mathbf{x}^T\mathbf{{x}'})^2$
利用此特性，可將各個參數皆轉換與 $\phi$ 有關，如下

$q_{n,m}=y_ny_m\mathbf{z}_n^T\mathbf{z}_m=y_ny_mK(\mathbf{x}_n,\mathbf{x}_m)$
從 $SV(\mathbf{x}_s,y_s)$ 取出一個
$$ \begin{align*} b&=y_s-\mathbf{w}^T\mathbf{z}_s\\ &=y_s-(\sum_{n=1}^{N}\alpha _ny_n\mathbf{z}_n)^T\mathbf{z}_s\\ &=y_s-\sum_{n=1}^{N}\alpha _ny_n\mathbf{z}_n^T\mathbf{z}_s\\ &=y_s-\sum_{n=1}^{N}\alpha _ny_nK(\mathbf{x}_n,\mathbf{x}_s)\\ \end{align*} $$
最佳方程式可得

$$ \begin{align*} g_{SVM}(\mathbf{x})&=sign(\mathbf{w}^T\Phi (\mathbf{x})+b)\\ &=sign\left ((\sum_{n=1}^{N}\alpha _ny_n\mathbf{z}_n)^T\Phi (\mathbf{x})+b \right )\\ &=sign(\sum_{n=1}^{N}\alpha _ny_n\mathbf{z}_n^T\Phi (\mathbf{x})+b)\\ &=sign(\sum_{n=1}^{N}\alpha _ny_n\Phi (\mathbf{x}_n)^T\Phi (\mathbf{x})+b)\\ &=sign(\sum_{n=1}^{N}\alpha _ny_nK(\mathbf{x}_n,\mathbf{x})+b)\\ \end{align*} $$

故計算效率已與 $\tilde{d}$ 無關

那如何再小修一下轉換 $\phi$，加入 $\sqrt {2\gamma}$ 跟 $\gamma$，$\gamma > 0$ 不然開根號會為虛數
$\Phi _2(\mathbf{x})=(1,\sqrt {2\gamma} x_1,\cdots ,\sqrt {2\gamma}x_d,\gamma x_1^2,\gamma x_1x_2,\cdots , ,\gamma x_d^2)\Leftrightarrow K_2(\mathbf{x},{\mathbf{x}}')=1+2\gamma \mathbf{x}^T{\mathbf{x}}'+(\mathbf{x}^T{\mathbf{x}}')^2$

那理論上 1 也可以替換為變數 $\zeta$，$\zeta \geq 0$ 不然開根號會為虛數
$\zeta$ 可為 0 是因為它不影響轉換，但 $\gamma$ 為 0 會只剩下常數項，跟原來的目的完全違背
故可得到一簡單的算式
$$ \begin{align*} \Phi _2(\mathbf{x})&=(\zeta ,\sqrt {2\zeta\gamma} x_1,\cdots ,\sqrt {2\zeta\gamma}x_d,\gamma x_1^2,\gamma x_1x_2,\cdots , ,\gamma x_d^2)\\ \Rightarrow K_2(\mathbf{x},{\mathbf{x}}')&=\zeta ^2+2\zeta\gamma \mathbf{x}^T{\mathbf{x}}'+(\gamma \mathbf{x}^T{\mathbf{x}}')^2\\ &= (\zeta+\gamma \mathbf{x}^T{\mathbf{x}}')^2 \end{align*} $$ 此結論可推廣至高次項，也只是略調整係數，故
$$ K_Q(\mathbf{x},\mathbf{{x}'})=(\zeta +\gamma \mathbf{x}^T\mathbf{{x}'})^Q\ \ \ with\ \gamma>0,\zeta\geq 0 $$

因使用 Polynomial Kernel，故稱作 Polynomial SVM
事實上調整 kernel 就等同在調整 margin definition
因 kernel 隱含映射的函數，不同的映射函數，即使對應到同樣的 $\mathbb{R}^2$，$\mathbf{x}_n \Rightarrow \mathbf{z}_n$ 的位置也都不一樣
如下圖，SV 的點完全不同

因有最大化 margin 的限制，故即使是十次方，表示仍可接受，甚至 margin 為 0.1

Linear Kernel 就是最原本的線性轉換，$K_1(\mathbf{x},\mathbf{{x}'})=(0 +1\cdot \mathbf{x}^T\mathbf{{x}'})^1$
記住，若 Linear Kernel 已足夠好，那就無需往下做，越簡單越準確

Inﬁnite Dimensional Transform (Gaussian SVM)

$$ K(\mathbf{x},{\mathbf{x}}')=e^{-\gamma \left \| \mathbf{x}-{\mathbf{x}}' \right \|^2}\ \ \ with\ \gamma >0\\ \Phi (\mathbf{x})=e^{-\gamma \mathbf{x}^T\mathbf{x}} \left (1, \sqrt{\frac{2^1\gamma}{1!}}(x_1, x_2,\cdots ,x_d),\\ \sqrt{\frac{2^2\gamma}{2!}}(x_1^2, x_1x_2,\cdots ,x_d^2),\\ \sqrt{\frac{2^3\gamma}{3!}}(x_1^3, x_1^2x_2,\cdots ,x_d^3),\\ \sqrt{\frac{2^i\gamma}{i!}}(x_1^i, x_1^{i-1}x_2,\cdots ,x_d^i),\cdots \right ) $$

$$ \begin{align*} K(\mathbf{x},\mathbf{{x}'}) &= e^{-\gamma \left \| \mathbf{x}-{\mathbf{x}}' \right \|^2}\\ &= e^{-\gamma (\mathbf{x}^T\mathbf{x}-2\cdot \mathbf{x}^T{\mathbf{x}}'+{\mathbf{x}}'^T{\mathbf{x}}')} \\ &= e^{-\gamma \mathbf{x}^T\mathbf{x}}\cdot e^{-\gamma{\mathbf{x}}'^T{\mathbf{x}}'} \cdot e^{\gamma 2\cdot \mathbf{x}^T{\mathbf{x}}'}\\ by\ Taylor&= e^{-\gamma \mathbf{x}^T\mathbf{x}}\cdot e^{-\gamma{\mathbf{x}}'^T{\mathbf{x}}'} \cdot \sum_{i=0}^{\infty }\frac{(2\gamma \mathbf{x}^T{\mathbf{x}}')^i}{i!}\\ &= \sum_{i=0}^{\infty }e^{-\gamma \mathbf{x}^T\mathbf{x}}\cdot e^{-\gamma{\mathbf{x}}'^T{\mathbf{x}}'} \cdot \frac{(2\gamma \mathbf{x}^T{\mathbf{x}}')^i}{i!}\\ &= \sum_{i=0}^{\infty }e^{-\gamma \mathbf{x}^T\mathbf{x}}\cdot e^{-\gamma{\mathbf{x}}'^T{\mathbf{x}}'} \cdot \sqrt{\frac{2^i\gamma}{i!}} \sqrt{\frac{2^i\gamma}{i!}}(\mathbf{x}^T{\mathbf{x}}')^i\\ &= \sum_{i=0}^{\infty }\sqrt{\frac{2^i\gamma}{i!}}e^{-\gamma \mathbf{x}^T\mathbf{x}}\cdot \sqrt{\frac{2^i\gamma}{i!}}e^{-\gamma{\mathbf{x}}'^T{\mathbf{x}}'} \cdot (\sum_{n=1}^{d}x_n{x}'_n)^i\\ &= e^{-\gamma \mathbf{x}^T\mathbf{x}} \cdot e^{-\gamma{\mathbf{x}}'^T{\mathbf{x}}'}\sum_{i=0}^{\infty }\sqrt{\frac{2^i\gamma}{i!}}\cdot \sqrt{\frac{2^i\gamma}{i!}} \cdot (\sum_{n=1}^{d}x_n{x}'_n)^i\\ &= e^{-\gamma \mathbf{x}^T\mathbf{x}} \cdot e^{-\gamma{\mathbf{x}}'^T{\mathbf{x}}'} (1 + \sqrt{\frac{2^1\gamma}{1!}} \sqrt{\frac{2^1\gamma}{1!}} \sum_{n_1=1}^{d}x_{n_1}{x}'_{n_1} +\sqrt{\frac{2^2\gamma}{2!}} \sqrt{\frac{2^2\gamma}{2!}} \sum_{n_1=1}^{d}x_{n_1}{x}'_{n_1}\sum_{n_2=1}^{d}x_{n_2}{x}'_{n_2} +\sqrt{\frac{2^3\gamma}{3!}} \sqrt{\frac{2^3\gamma}{3!}} \sum_{n_1=1}^{d}x_{n_1}{x}'_{n_1}\sum_{n_2=1}^{d}x_{n_2}{x}'_{n_2}\sum_{n_3=1}^{d}x_{n_3}{x}'_{n_3}+\cdots )\\ &= e^{-\gamma \mathbf{x}^T\mathbf{x}} \cdot e^{-\gamma{\mathbf{x}}'^T{\mathbf{x}}'} (1 + \sqrt{\frac{2^1\gamma}{1!}} \sqrt{\frac{2^1\gamma}{1!}} \sum_{n_1=1}^{d}x_{n_1}{x}'_{n_1} +\sqrt{\frac{2^2\gamma}{2!}} \sqrt{\frac{2^2\gamma}{2!}} \sum_{n_1=1}^{d}\sum_{n_2=1}^{d}x_{n_1}x_{n_2}{x}'_{n_1}{x}'_{n_2} +\sqrt{\frac{2^3\gamma}{3!}} \sqrt{\frac{2^3\gamma}{3!}} \sum_{n_1=1}^{d}\sum_{n_2=1}^{d}\sum_{n_3=1}^{d}x_{n_1}x_{n_2}x_{n_3}{x}'_{n_1}{x}'_{n_2}{x}'_{n_3}+\cdots )\\ &= e^{-\gamma \mathbf{x}^T\mathbf{x}} \cdot e^{-\gamma{\mathbf{x}}'^T{\mathbf{x}}'} (1 + \sqrt{\frac{2^1\gamma}{1!}} \sqrt{\frac{2^1\gamma}{1!}} (x_1, x_2,\cdots ,x_d)\cdot ({x}'_1, {x}'_2,\cdots ,{x}'_d)\\ &+\sqrt{\frac{2^2\gamma}{2!}} \sqrt{\frac{2^2\gamma}{2!}} (x_1^2, x_1x_2,\cdots ,x_d^2)\cdot ({x}'^2_1, {x}'_1{x}'_2,\cdots ,{x}'^2_d)\\ &+\sqrt{\frac{2^3\gamma}{3!}} \sqrt{\frac{2^3\gamma}{3!}} (x_1^3, x_1^2x_2,\cdots ,x_d^3)\cdot ({x}'^3_1, {x}'^2_1{x}'_2,\cdots ,{x}'^3_d)+\cdots )\\ &= \Phi (\mathbf{x})^T\Phi (\mathbf{{x}'})\\ \\ \Phi (\mathbf{x})&=e^{-\gamma \mathbf{x}^T\mathbf{x}} \left (1, \sqrt{\frac{2^1\gamma}{1!}}(x_1, x_2,\cdots ,x_d),\sqrt{\frac{2^2\gamma}{2!}}(x_1^2, x_1x_2,\cdots ,x_d^2), \sqrt{\frac{2^3\gamma}{3!}}(x_1^3, x_1^2x_2,\cdots ,x_d^3), \sqrt{\frac{2^i\gamma}{i!}}(x_1^i, x_1^{i-1}x_2,\cdots ,x_d^i),\cdots \right ) \end{align*} $$ 可以看到其實就是各個次項的組合，從 0次項直到無限次項皆包含在內

$$ \begin{align*} g_{SVM}(\mathbf{x})&=sign\left ( \sum_{SV}\alpha _ny_nK(\mathbf{x}_n,\mathbf{x})+b \right )\\ &=sign\left ( \sum_{SV}\alpha _ny_ne^{-\gamma \left \| \mathbf{x}-\mathbf{x}_n \right \|^2}+b \right )\\ \end{align*} $$

如上，解其實就是在 SV 上的高斯函數的線性組合
因為此特性也叫做 Radial Basis Function (RBF) kernel
能做無限多維，又有 maximum margin 限制，是否就很美好呢？不
當 $\gamma$ 選太大時，仍會有 overfit 的風險
甚至無限大時就等同於 $ K_{\gamma \to \infty }(\mathbf{x}, \mathbf{{x}'}) = (\mathbf{x}==\mathbf{{x}'})$，完全跟訓練點一樣才會為 1

Kernel 比較

Kernel 代表著兩者之間的相似性，因為是內積
Linear Kernel

$K(\mathbf{x},\mathbf{{x}'})=\mathbf{x}^T\mathbf{{x}'}$
優點

安全
快速，使用特別的 QP solver
可被解釋

缺點

若不是線性可分無法分類

Polynomial Kernel

$K(\mathbf{x},\mathbf{{x}'})=(\zeta +\gamma \mathbf{x}^T\mathbf{{x}'})^Q$
Q 若很低次，有時候回到原本的 primal 解法，用特別的 solver，還比較快
優點

比 linear 限制少一點，不需要線性可分
透過 Q 表示限制轉換空間的維度

缺點

數值範圍太大會影響電腦處理
$|\zeta +\gamma \mathbf{x}^T\mathbf{{x}'}|$
小於 1 時，Q 比較大會近乎 0
大於 1 時，Q 比較大會是很大的值
參數太多

Gaussian Kernel

$K(\mathbf{x},\mathbf{{x}'})=e^{-\gamma \left \| \mathbf{x}-\mathbf{{x}'} \right \|^2}$
優點

最強大
數值範圍比 polynomial 好
參數只有一個

缺點

無法解釋在做什麼
計算慢
容易 overfit

Mercer's condition

可自定 kernel，只要 Kernel Matrix 滿足條件，以下為充要條件

symmetric
let $k_{ij}=K(\mathbf{x}_i,\mathbf{x}_j)$，此 Kernel Matrix 必定是半正定
$\begin{align*}
\mathbf{K}&=\begin{bmatrix}
\Phi (\mathbf{x}_1)^T\Phi (\mathbf{x}_1) & \Phi (\mathbf{x}_1)^T\Phi (\mathbf{x}_2) & \cdots & \Phi (\mathbf{x}_1)^T\Phi (\mathbf{x}_N)\\
\Phi (\mathbf{x}_2)^T\Phi (\mathbf{x}_1) & \Phi (\mathbf{x}_2)^T\Phi (\mathbf{x}_2) & \cdots & \Phi (\mathbf{x}_1)^T\Phi (\mathbf{x}_N)\\
\vdots & \vdots & \ddots & \vdots\\
\Phi (\mathbf{x}_N)^T\Phi (\mathbf{x}_1) & \Phi (\mathbf{x}_N)^T\Phi (\mathbf{x}_2) & \cdots & \Phi (\mathbf{x}_1)^T\Phi (\mathbf{x}_N)
\end{bmatrix}\\
&= \begin{bmatrix}\mathbf{z}_1 & \mathbf{z}_2 & \cdots & \mathbf{z}_N\end{bmatrix}^T\begin{bmatrix}\mathbf{z}_1 & \mathbf{z}_2 & \cdots & \mathbf{z}_N\end{bmatrix}\\
&= \mathbf{ZZ}^T
\end{align*}$

程式碼

參數定義

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

# 訓練資料
X = np.c_[(.4, -.7),
          (-1.5, -1),
          (-1.4, -.9),
          (-1.3, -1.2),
          (-1.1, -.2),
          (-1.2, -.4),
          (-.5, 1.2),
          (-1.5, 2.1),
          (1, 1),
          # --
          (1.3, .8),
          (1.2, .5),
          (.2, -2),
          (.5, -2.4),
          (.2, -2.3),
          (0, -2.7),
          (1.3, 2.1)].T
Y = [0] * 8 + [1] * 8

# C 越大表示越無法容忍錯誤，故設定一很大的值 for hard margin
C = 1e10
# fit 表示訓練
# x^Tx'
g_svm_linear = svm.SVC(kernel='linear', C=C).fit(X, Y)
# (5+10x^Tx')^3
g_svm_poly = svm.SVC(kernel='poly', degree=3, coef0=5, gamma=10, C=C).fit(X, Y)
# exp(-0.3|x-x'|^2)
g_svm_rbf = svm.SVC(kernel='rbf', gamma=.3, C=C).fit(X, Y)

# title for the plots
titles = ['linear kernel => $\mathbf{x}^T\mathbf{x}\'$',
          'polynomial kernel => $(5+10\mathbf{x}^T\mathbf{x}\')^3$',
          'Gaussian kernel => $e^{(-0.3|\mathbf{x}-\mathbf{x}\'|^2)}$',]

# 產生 meshgrid，共 200x100 的點
XX1, XX2 = np.mgrid[-2:2:200j, -3:3:100j]

for i, g_svm in enumerate([g_svm_linear, g_svm_poly, g_svm_rbf]):
    plt.subplot(3, 1, i + 1)
    # 調整之間的空白高度
    plt.subplots_adjust(hspace=.6)
    
    # 代進值，得到距離
    # 回傳判斷結果
    YY = g_svm.predict(np.c_[XX1.ravel(), XX2.ravel()])
    # 重新排列結果，符合 XX1
    YY = YY.reshape(XX1.shape)  

    # 畫圖，contour 不填滿顏色，contourf 填滿顏色
    plt.contourf(XX1, XX2, YY, cmap=plt.cm.bone, alpha=0.5)

    # 畫出所有點
    plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
    # 將 support vector 標示出來
    plt.scatter(g_svm.support_vectors_[:, 0], g_svm.support_vectors_[:, 1], color='g', linewidths=3, linestyle='-', s=100, facecolors='none')

    # 得到 SV 的距離
    margin = np.mean(abs(g_svm.decision_function(np.c_[g_svm.support_vectors_[:, 0],g_svm.support_vectors_[:, 1]])))
    # 回傳距離
    YY = g_svm.decision_function(np.c_[XX1.ravel(), XX2.ravel()])
    # 重新排列結果，符合 XX1
    YY = YY.reshape(XX1.shape)  
    # 畫出界線
    plt.contour(XX1, XX2, YY, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'], levels=[-margin, 0, margin])
    # 標題
    plt.title(titles[i])

plt.show()

參考

支持向量机（三）核函数
sklearn.svm.SVC

子風的知識庫

搜尋此網誌