赞
踩
决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干,故称决策树。在机器学习中,决策树是一个预测模型,他代表的是对象属性与对象值之间的一种映射关系。Entropy = 系统的凌乱程度,使用算法ID3, C4.5和C5.0生成树算法使用熵。
1. 信息增益 在 ID3 决策树中使用
”信息熵“是度量样本集合纯度最常用的指标,假设当前样本集合D中第k类样本所占比例为p_k(k=1,2,3,…,|y|),则D 的信息熵定义为:
E
n
t
(
D
)
=
−
∑
k
=
1
∣
y
∣
∗
p
k
l
o
g
2
p
k
Ent(D)=-\sum_{k=1}^{|y|}*p_klog_2p_k
Ent(D)=−k=1∑∣y∣∗pklog2pk . Ent(D)的值越小,则D的纯度越高。
信息增益:
G
a
i
n
(
X
,
A
)
=
E
n
t
(
X
)
−
E
n
t
(
X
∣
A
)
Gain(X,A) = Ent(X) - Ent(X|A)
Gain(X,A)=Ent(X)−Ent(X∣A)
E
n
t
(
X
∣
A
)
=
∑
a
∈
A
p
(
a
)
E
n
t
(
X
∣
A
=
a
)
Ent(X∣A)= \sum_{a∈A} p(a)Ent(X∣A=a)
Ent(X∣A)=a∈A∑p(a)Ent(X∣A=a)
import pandas as pd import math df = pd.read_excel(r'c:/Users/Lenovo/Desktop/test3.xlsx') """ 计算不同种类中不同类别的信息熵 """ every_project_type_dict = {} for i in df.columns[0:6]: df_sel = df[[i, '好瓜']] # 同一类别添加在一个字典中 every_dicr = {} # 遍历每个类别,分别计算其占比情况,计算对应的熵值 for j in set(df_sel[i]): y = df_sel[(df_sel[i] == j) & (df_sel['好瓜'] == '是')].shape[0] x = df_sel[(df_sel[i] == j) & (df_sel['好瓜'] == '否')].shape[0] rate_y = y / (x + y) rate_x = x / (x + y) """ 熵值公式:sum(-p_i*log2*(p_i)) """ if rate_x == 0.0: ent = -rate_y * math.log2(rate_y) elif rate_y == 0.0: ent = -rate_x * math.log2(rate_x) elif rate_x != 0.0 and rate_y != 0.0: ent = -rate_y * math.log2(rate_y) - rate_x * math.log2(rate_x) print('大类:{},类别:{},熵值为:{}'.format(i, j, str(ent))) every_dicr[j] = ent every_project_type_dict[i] = every_dicr ## 计算每个类别不同种类的占比 type_rate_dict = {} for i in df.columns[0:6]: ssd_dict = {} df_sel = df[[i]] for j in set(df_sel[i]): cnt = df_sel[df_sel[i] == j].count() ssd_dict[j] = cnt[0] / df_sel.shape[0] type_rate_dict[i] = ssd_dict print(type_rate_dict) # 好坏瓜的信息熵 y = df[df['好瓜'] == '是'].shape[0] x = df[df['好瓜'] == '否'].shape[0] rate_y = y / (x + y) rate_x = x / (x + y) b = -rate_y * math.log2(rate_y) - rate_x * math.log2(rate_x) # 遍历每个大类 ent_num = {} for s_keys in every_project_type_dict.keys(): ent_sum = 0 # 遍历每个小类别 for ss_keys in every_project_type_dict[s_keys]: ss = every_project_type_dict[s_keys][ss_keys] * type_rate_dict[s_keys][ss_keys] ent_sum += ss ent = b - ent_sum ent_num[s_keys] = ent print(ent_num) {'色泽': 0.10812516526536531, '根蒂': 0.14267495956679277, '敲声': 0.14078143361499584, '纹理': 0.3805918973682686, '脐部': 0.28915878284167895, '触感': 0.006046489176565584}
2. 增益率 在 C4.5 决策树中使用
G
a
i
n
r
a
t
i
o
(
D
,
a
)
=
G
a
i
n
(
D
,
a
)
I
V
(
a
)
Gain_ratio(D,a)=\frac{Gain(D,a)}{IV(a)}
Gainratio(D,a)=IV(a)Gain(D,a)
其中,
I
V
(
a
)
=
−
∑
v
=
1
v
∣
D
v
∣
∣
D
∣
l
o
g
2
∣
D
v
∣
∣
D
∣
IV(a)=-\sum_{v=1}^{v}\frac{|D^v|}{|D|}log_2\frac{|D^v|}{|D|}
IV(a)=−v=1∑v∣D∣∣Dv∣log2∣D∣∣Dv∣
其中,a的取值数目越多(即V越大),则IV(a)的值通常会越大。
### IV 值 IV_dict = {} for key in type_rate_dict.keys(): ent_ = 0 for valus in type_rate_dict[key].values(): ent_sum = -valus * math.log2(valus) ent_ += ent_sum IV_dict[key] = ent_ print(IV_dict) # 信息熵增益率 ent_num_rate = {} for s_keys in ent_num.keys(): ss = ent_num[s_keys] / IV_dict[s_keys] ent_num_rate[s_keys] = ss print(ent_num_rate) {'色泽': 0.06843956584615815, '根蒂': 0.10175939805373684, '敲声': 0.10562670944314426, '纹理': 0.2630853587192754, '脐部': 0.1867268991844879, '触感': 0.0069183298534003}
3. 基尼系数在 CART 决策树中使用
Gini越小,数据集D的纯度越高。表示方法如下:
G
i
n
i
(
D
)
=
1
−
∑
k
=
1
∣
y
∣
p
k
2
Gini(D)=1-\sum_{k=1}^{|y|}p_k^2
Gini(D)=1−k=1∑∣y∣pk2
属性a的基尼指数定义为:
G
i
n
i
i
n
d
e
x
(
D
,
a
)
=
∑
v
=
1
v
G
i
n
i
(
D
v
)
Gini_index(D,a)=\sum_{v=1}^{v}Gini(D^v)
Giniindex(D,a)=v=1∑vGini(Dv)
## 每一项目中 不同类别的基尼系数 every_project_type_dict = {} for i in df.columns[0:6]: df_sel = df[[i, '好瓜']] every_type_dict = {} for j in set(df_sel[i]): y = df_sel[(df_sel[i] == j) & (df_sel['好瓜'] == '是')].shape[0] x = df_sel[(df_sel[i] == j) & (df_sel['好瓜'] == '否')].shape[0] rate_y = y / (x + y) rate_x = x / (x + y) gini = 1 - rate_x ** 2 - rate_y ** 2 print('项目:{},类别:{},基尼系数:{}'.format(i, j, str(gini))) every_type_dict[j] = gini every_project_type_dict[i] = every_type_dict every_project_type_dict ### 整体类别的基尼系数 every_pro_num = {} for keys in every_project_type_dict.keys(): gini = 0 for ss_keys in every_project_type_dict[keys].keys(): ss = every_project_type_dict[keys][ss_keys] * type_rate_dict[keys][ss_keys] gini += ss every_pro_num[keys] = gini print(every_pro_num) {'色泽': 0.42745098039215684, '根蒂': 0.42226890756302526, '敲声': 0.4235294117647059, '纹理': 0.2771241830065359, '脐部': 0.3445378151260504, '触感': 0.49411764705882355}
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。