Sklearn特征工程之Embedded嵌入法

陈华 • 2022年04月28日 • 人工智能 • 阅读 1815

嵌入法是一种让算法自己决定使用哪些特征的方法，即特征选择和算法训练同时进行。相比于过滤法，嵌入法的结果会更加精确到模型的效用本身，对于提高特定模型的效力有更好的效果。由于考虑特征对模型的贡献，将低贡献的特征删除，本质上还是特征过滤。

但嵌入法中使用的加权系数是不固定的，对模型完全没有作用的加权系数会为0，当大量特征都对模型有贡献，且贡献不一时，我们很难去界定一个有效的临界值。此时，权值系数系数只能通过学习曲线去判断，因此计算量会很大，会非常耗时。

嵌入法基本用法

import pandas as pd
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import cross_val_score

data = pd.read_csv('./datas/digit_recognizor_simple.csv')

x = data.iloc[:, 1:]
y = data.iloc[:, 0]

RFC_ = RFC(random_state=42)

x_embedded = SelectFromModel(RFC_, threshold=0.0005).fit_transform(x, y)
print(x_embedded.shape)  #(1000, 351)

score = cross_val_score(RFC_, x_embedded, y, cv=10).mean()
print(score)  # 0.88

学习曲线调参

import numpy as np
import matplotlib.pyplot as plt

scores = []
thresholds = np.linspace(0, RFC_.fit(x, y).feature_importances_.max(), 20)
for ts in thresholds:
    x_embedded = SelectFromModel(RFC_, threshold=ts).fit_transform(x, y)
    score = cross_val_score(RFC_, x_embedded, y, cv=10).mean()
    scores.append(score)

plt.plot(thresholds, scores)
plt.xticks(thresholds)
plt.show()

通过画学习曲线，threshold=0.00045取得最优效果，准确率0.92左右，得出特征数据依然小于方差筛选，而且模型表现比没有筛选之前更高。然而，在算法本身很复杂的情况下，过滤法的计算远比嵌入法快，所以在大型数据中，还是会优先考虑过滤法。

本文为陈华原创，欢迎转载，但请注明出处：http://edu.ichenhua.cn/read/272

Sklearn特征工程之Embedded嵌入法

嵌入法基本用法

学习曲线调参

陈华编程

关于我们

合作平台

相关网站

联系我们