当前位置:   article > 正文

机器学习基础之《分类算法(7)—案例:泰坦尼克号乘客生存预测》_泰坦尼克号乘客生存预测数据集

泰坦尼克号乘客生存预测数据集

一、泰坦尼克号数据

1、案例背景
泰坦尼克号沉没是历史上最臭名昭着的沉船之一。1912年4月15日,在她的处女航中,泰坦尼克号在与冰山相撞后沉没,在2224名乘客和机组人员中造成1502人死亡。这场耸人听闻的悲剧震惊了国际社会,并为船舶制定了更好的安全规定。 造成海难失事的原因之一是乘客和机组人员没有足够的救生艇。尽管幸存下沉有一些运气因素,但有些人比其他人更容易生存,例如妇女,儿童和上流社会。 在这个案例中,我们要求您完成对哪些人可能存活的分析。特别是,我们要求您运用机器学习工具来预测哪些乘客幸免于悲剧。

2、数据集字段
Pclass:乘客班(1,2,3)是社会经济阶层的代表
Age:数据有缺失

二、流程分析

1、获取数据
2、数据处理
  缺失值处理
  特征值 --> 字典类型
3、准备好特征值、目标值
4、划分数据集
5、特征工程:字典特征抽取
  决策树不需要做标准化
6、决策树预估器流程
7、模型评估

三、文件数据说明

1、gender_submission.csv  test.csv  train.csv,文件内容例子如下
train.csv:

  1. PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
  2. 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
  3. 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
  4. 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
  5. 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
  6. 5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
  7. 6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
  8. 7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
  9. 8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
  10. 9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
  11. ......

test.csv:

  1. PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
  2. 892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
  3. 893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
  4. 894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
  5. 895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
  6. 896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
  7. 897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
  8. 898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q
  9. 899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,248738,29,,S
  10. 900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18,0,0,2657,7.2292,,C
  11. ......

 gender_submission.csv:

  1. PassengerId,Survived
  2. 892,0
  3. 893,1
  4. 894,0
  5. 895,0
  6. 896,1
  7. 897,0
  8. 898,1
  9. 899,0
  10. 900,1
  11. ......

2、字段说明

四、代码

  1. import pandas as pd
  2. # 1、获取数据
  3. data_train = pd.read_csv("titanic_泰坦尼克数据集/train.csv")
  4. data_test = pd.read_csv("titanic_泰坦尼克数据集/test.csv")
  5. data_train.head()
  6. # 筛选特征值和目标值
  7. x = data_train[["Pclass", "Age", "Sex"]]
  8. y = data_train["Survived"]
  9. x.head()
  10. y.head()
  11. # 2、数据处理
  12. # (1)缺失值处理
  13. # 填补平均值,就地修改原对象
  14. x["Age"].fillna(x["Age"].mean(), inplace=True)
  15. x
  16. # DataFrame转换为字典
  17. x = x.to_dict(orient="records")
  18. x
  19. # (2)划分数据集
  20. # 文件已经给了train.csv、test.csv
  21. # 对test.csv做同样的处理,只获取特征值
  22. m = data_test[["Pclass", "Age", "Sex"]]
  23. m["Age"].fillna(m["Age"].mean(), inplace=True)
  24. m = m.to_dict(orient="records")
  25. m
  26. # 3、字典特征抽取
  27. from sklearn.feature_extraction import DictVectorizer
  28. transfer = DictVectorizer()
  29. x = transfer.fit_transform(x)
  30. m = transfer.transform(m)
  31. # 4、决策树预估器
  32. from sklearn.tree import DecisionTreeClassifier, export_graphviz
  33. estimator = DecisionTreeClassifier(criterion='entropy',max_depth=8)
  34. estimator.fit(x, y)
  35. # 5、获取测试集的目标值
  36. data_test_n = pd.read_csv("titanic_泰坦尼克数据集/gender_submission.csv")
  37. n = data_test_n["Survived"]
  38. n.head()
  39. # 6、模型评估
  40. # 方法1:直接比对真实值和预测值
  41. y_predict = estimator.predict(m)
  42. print("y_predict:\n", y_predict)
  43. print("直接比对真实值和预测值:\n", n == y_predict)
  44. # 方法2:计算准确率
  45. score = estimator.score(m, n)
  46. print("准确率为:\n", score)
  47. # 7、# 可视化决策树
  48. export_graphviz(estimator, out_file='titanic_tree.dot', feature_names=transfer.get_feature_names())

运行结果:

运行后生成titanic_tree.dot文件:

  1. digraph Tree {
  2. node [shape=box] ;
  3. 0 [label="Sex=male <= 0.5\nentropy = 0.961\nsamples = 891\nvalue = [549, 342]"] ;
  4. 1 [label="Pclass <= 2.5\nentropy = 0.824\nsamples = 314\nvalue = [81, 233]"] ;
  5. 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
  6. 2 [label="Age <= 2.5\nentropy = 0.299\nsamples = 170\nvalue = [9, 161]"] ;
  7. 1 -> 2 ;
  8. 3 [label="Pclass <= 1.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
  9. 2 -> 3 ;
  10. 4 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
  11. 3 -> 4 ;
  12. 5 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
  13. 3 -> 5 ;
  14. 6 [label="Age <= 23.5\nentropy = 0.276\nsamples = 168\nvalue = [8, 160]"] ;
  15. 2 -> 6 ;
  16. 7 [label="entropy = 0.0\nsamples = 40\nvalue = [0, 40]"] ;
  17. 6 -> 7 ;
  18. 8 [label="Age <= 27.5\nentropy = 0.337\nsamples = 128\nvalue = [8, 120]"] ;
  19. 6 -> 8 ;
  20. 9 [label="Age <= 24.5\nentropy = 0.722\nsamples = 20\nvalue = [4, 16]"] ;
  21. 8 -> 9 ;
  22. 10 [label="Pclass <= 1.5\nentropy = 0.414\nsamples = 12\nvalue = [1, 11]"] ;
  23. 9 -> 10 ;
  24. 11 [label="entropy = 0.0\nsamples = 5\nvalue = [0, 5]"] ;
  25. 10 -> 11 ;
  26. 12 [label="entropy = 0.592\nsamples = 7\nvalue = [1, 6]"] ;
  27. 10 -> 12 ;
  28. 13 [label="Pclass <= 1.5\nentropy = 0.954\nsamples = 8\nvalue = [3, 5]"] ;
  29. 9 -> 13 ;
  30. 14 [label="Age <= 25.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
  31. 13 -> 14 ;
  32. 15 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
  33. 14 -> 15 ;
  34. 16 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
  35. 14 -> 16 ;
  36. 17 [label="Age <= 25.5\nentropy = 0.918\nsamples = 6\nvalue = [2, 4]"] ;
  37. 13 -> 17 ;
  38. 18 [label="entropy = 0.0\nsamples = 2\nvalue = [0, 2]"] ;
  39. 17 -> 18 ;
  40. 19 [label="entropy = 1.0\nsamples = 4\nvalue = [2, 2]"] ;
  41. 17 -> 19 ;
  42. 20 [label="Age <= 37.0\nentropy = 0.229\nsamples = 108\nvalue = [4, 104]"] ;
  43. 8 -> 20 ;
  44. 21 [label="entropy = 0.0\nsamples = 56\nvalue = [0, 56]"] ;
  45. 20 -> 21 ;
  46. 22 [label="Pclass <= 1.5\nentropy = 0.391\nsamples = 52\nvalue = [4, 48]"] ;
  47. 20 -> 22 ;
  48. 23 [label="Age <= 49.5\nentropy = 0.187\nsamples = 35\nvalue = [1, 34]"] ;
  49. 22 -> 23 ;
  50. 24 [label="entropy = 0.0\nsamples = 20\nvalue = [0, 20]"] ;
  51. 23 -> 24 ;
  52. 25 [label="entropy = 0.353\nsamples = 15\nvalue = [1, 14]"] ;
  53. 23 -> 25 ;
  54. 26 [label="Age <= 39.0\nentropy = 0.672\nsamples = 17\nvalue = [3, 14]"] ;
  55. 22 -> 26 ;
  56. 27 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
  57. 26 -> 27 ;
  58. 28 [label="entropy = 0.544\nsamples = 16\nvalue = [2, 14]"] ;
  59. 26 -> 28 ;
  60. 29 [label="Age <= 38.5\nentropy = 1.0\nsamples = 144\nvalue = [72, 72]"] ;
  61. 1 -> 29 ;
  62. 30 [label="Age <= 1.5\nentropy = 0.996\nsamples = 132\nvalue = [61, 71]"] ;
  63. 29 -> 30 ;
  64. 31 [label="entropy = 0.0\nsamples = 4\nvalue = [0, 4]"] ;
  65. 30 -> 31 ;
  66. 32 [label="Age <= 3.5\nentropy = 0.998\nsamples = 128\nvalue = [61, 67]"] ;
  67. 30 -> 32 ;
  68. 33 [label="Age <= 2.5\nentropy = 0.722\nsamples = 5\nvalue = [4, 1]"] ;
  69. 32 -> 33 ;
  70. 34 [label="entropy = 0.811\nsamples = 4\nvalue = [3, 1]"] ;
  71. 33 -> 34 ;
  72. 35 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
  73. 33 -> 35 ;
  74. 36 [label="Age <= 5.5\nentropy = 0.996\nsamples = 123\nvalue = [57, 66]"] ;
  75. 32 -> 36 ;
  76. 37 [label="entropy = 0.0\nsamples = 6\nvalue = [0, 6]"] ;
  77. 36 -> 37 ;
  78. 38 [label="Age <= 12.0\nentropy = 1.0\nsamples = 117\nvalue = [57, 60]"] ;
  79. 36 -> 38 ;
  80. 39 [label="entropy = 0.0\nsamples = 8\nvalue = [8, 0]"] ;
  81. 38 -> 39 ;
  82. 40 [label="Age <= 32.5\nentropy = 0.993\nsamples = 109\nvalue = [49, 60]"] ;
  83. 38 -> 40 ;
  84. 41 [label="entropy = 0.996\nsamples = 104\nvalue = [48, 56]"] ;
  85. 40 -> 41 ;
  86. 42 [label="entropy = 0.722\nsamples = 5\nvalue = [1, 4]"] ;
  87. 40 -> 42 ;
  88. 43 [label="Age <= 55.5\nentropy = 0.414\nsamples = 12\nvalue = [11, 1]"] ;
  89. 29 -> 43 ;
  90. 44 [label="entropy = 0.0\nsamples = 11\nvalue = [11, 0]"] ;
  91. 43 -> 44 ;
  92. 45 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
  93. 43 -> 45 ;
  94. 46 [label="Pclass <= 1.5\nentropy = 0.699\nsamples = 577\nvalue = [468, 109]"] ;
  95. 0 -> 46 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
  96. 47 [label="Age <= 17.5\nentropy = 0.95\nsamples = 122\nvalue = [77, 45]"] ;
  97. 46 -> 47 ;
  98. 48 [label="entropy = 0.0\nsamples = 4\nvalue = [0, 4]"] ;
  99. 47 -> 48 ;
  100. 49 [label="Age <= 53.0\nentropy = 0.932\nsamples = 118\nvalue = [77, 41]"] ;
  101. 47 -> 49 ;
  102. 50 [label="Age <= 22.5\nentropy = 0.968\nsamples = 96\nvalue = [58, 38]"] ;
  103. 49 -> 50 ;
  104. 51 [label="entropy = 0.0\nsamples = 5\nvalue = [5, 0]"] ;
  105. 50 -> 51 ;
  106. 52 [label="Age <= 27.5\nentropy = 0.98\nsamples = 91\nvalue = [53, 38]"] ;
  107. 50 -> 52 ;
  108. 53 [label="Age <= 24.5\nentropy = 0.881\nsamples = 10\nvalue = [3, 7]"] ;
  109. 52 -> 53 ;
  110. 54 [label="Age <= 23.5\nentropy = 0.918\nsamples = 3\nvalue = [2, 1]"] ;
  111. 53 -> 54 ;
  112. 55 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
  113. 54 -> 55 ;
  114. 56 [label="entropy = 0.0\nsamples = 2\nvalue = [2, 0]"] ;
  115. 54 -> 56 ;
  116. 57 [label="Age <= 26.5\nentropy = 0.592\nsamples = 7\nvalue = [1, 6]"] ;
  117. 53 -> 57 ;
  118. 58 [label="entropy = 0.0\nsamples = 3\nvalue = [0, 3]"] ;
  119. 57 -> 58 ;
  120. 59 [label="entropy = 0.811\nsamples = 4\nvalue = [1, 3]"] ;
  121. 57 -> 59 ;
  122. 60 [label="Age <= 47.5\nentropy = 0.96\nsamples = 81\nvalue = [50, 31]"] ;
  123. 52 -> 60 ;
  124. 61 [label="Age <= 45.25\nentropy = 0.923\nsamples = 68\nvalue = [45, 23]"] ;
  125. 60 -> 61 ;
  126. 62 [label="entropy = 0.956\nsamples = 61\nvalue = [38, 23]"] ;
  127. 61 -> 62 ;
  128. 63 [label="entropy = 0.0\nsamples = 7\nvalue = [7, 0]"] ;
  129. 61 -> 63 ;
  130. 64 [label="Age <= 48.5\nentropy = 0.961\nsamples = 13\nvalue = [5, 8]"] ;
  131. 60 -> 64 ;
  132. 65 [label="entropy = 0.0\nsamples = 3\nvalue = [0, 3]"] ;
  133. 64 -> 65 ;
  134. 66 [label="entropy = 1.0\nsamples = 10\nvalue = [5, 5]"] ;
  135. 64 -> 66 ;
  136. 67 [label="Age <= 75.5\nentropy = 0.575\nsamples = 22\nvalue = [19, 3]"] ;
  137. 49 -> 67 ;
  138. 68 [label="Age <= 60.5\nentropy = 0.454\nsamples = 21\nvalue = [19, 2]"] ;
  139. 67 -> 68 ;
  140. 69 [label="Age <= 55.5\nentropy = 0.722\nsamples = 10\nvalue = [8, 2]"] ;
  141. 68 -> 69 ;
  142. 70 [label="entropy = 0.0\nsamples = 3\nvalue = [3, 0]"] ;
  143. 69 -> 70 ;
  144. 71 [label="Age <= 59.0\nentropy = 0.863\nsamples = 7\nvalue = [5, 2]"] ;
  145. 69 -> 71 ;
  146. 72 [label="entropy = 0.722\nsamples = 5\nvalue = [4, 1]"] ;
  147. 71 -> 72 ;
  148. 73 [label="entropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
  149. 71 -> 73 ;
  150. 74 [label="entropy = 0.0\nsamples = 11\nvalue = [11, 0]"] ;
  151. 68 -> 74 ;
  152. 75 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
  153. 67 -> 75 ;
  154. 76 [label="Age <= 9.5\nentropy = 0.586\nsamples = 455\nvalue = [391, 64]"] ;
  155. 46 -> 76 ;
  156. 77 [label="Pclass <= 2.5\nentropy = 0.987\nsamples = 30\nvalue = [13, 17]"] ;
  157. 76 -> 77 ;
  158. 78 [label="entropy = 0.0\nsamples = 9\nvalue = [0, 9]"] ;
  159. 77 -> 78 ;
  160. 79 [label="Age <= 0.71\nentropy = 0.959\nsamples = 21\nvalue = [13, 8]"] ;
  161. 77 -> 79 ;
  162. 80 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
  163. 79 -> 80 ;
  164. 81 [label="Age <= 2.5\nentropy = 0.934\nsamples = 20\nvalue = [13, 7]"] ;
  165. 79 -> 81 ;
  166. 82 [label="Age <= 1.5\nentropy = 0.65\nsamples = 6\nvalue = [5, 1]"] ;
  167. 81 -> 82 ;
  168. 83 [label="entropy = 0.918\nsamples = 3\nvalue = [2, 1]"] ;
  169. 82 -> 83 ;
  170. 84 [label="entropy = 0.0\nsamples = 3\nvalue = [3, 0]"] ;
  171. 82 -> 84 ;
  172. 85 [label="Age <= 3.5\nentropy = 0.985\nsamples = 14\nvalue = [8, 6]"] ;
  173. 81 -> 85 ;
  174. 86 [label="entropy = 0.0\nsamples = 2\nvalue = [0, 2]"] ;
  175. 85 -> 86 ;
  176. 87 [label="Age <= 8.5\nentropy = 0.918\nsamples = 12\nvalue = [8, 4]"] ;
  177. 85 -> 87 ;
  178. 88 [label="entropy = 0.811\nsamples = 8\nvalue = [6, 2]"] ;
  179. 87 -> 88 ;
  180. 89 [label="entropy = 1.0\nsamples = 4\nvalue = [2, 2]"] ;
  181. 87 -> 89 ;
  182. 90 [label="Age <= 32.25\nentropy = 0.502\nsamples = 425\nvalue = [378, 47]"] ;
  183. 76 -> 90 ;
  184. 91 [label="Age <= 30.75\nentropy = 0.552\nsamples = 320\nvalue = [279, 41]"] ;
  185. 90 -> 91 ;
  186. 92 [label="Pclass <= 2.5\nentropy = 0.501\nsamples = 299\nvalue = [266, 33]"] ;
  187. 91 -> 92 ;
  188. 93 [label="Age <= 29.35\nentropy = 0.318\nsamples = 52\nvalue = [49, 3]"] ;
  189. 92 -> 93 ;
  190. 94 [label="Age <= 20.0\nentropy = 0.176\nsamples = 38\nvalue = [37, 1]"] ;
  191. 93 -> 94 ;
  192. 95 [label="entropy = 0.469\nsamples = 10\nvalue = [9, 1]"] ;
  193. 94 -> 95 ;
  194. 96 [label="entropy = 0.0\nsamples = 28\nvalue = [28, 0]"] ;
  195. 94 -> 96 ;
  196. 97 [label="Age <= 29.85\nentropy = 0.592\nsamples = 14\nvalue = [12, 2]"] ;
  197. 93 -> 97 ;
  198. 98 [label="entropy = 0.764\nsamples = 9\nvalue = [7, 2]"] ;
  199. 97 -> 98 ;
  200. 99 [label="entropy = 0.0\nsamples = 5\nvalue = [5, 0]"] ;
  201. 97 -> 99 ;
  202. 100 [label="Age <= 29.35\nentropy = 0.534\nsamples = 247\nvalue = [217, 30]"] ;
  203. 92 -> 100 ;
  204. 101 [label="Age <= 24.75\nentropy = 0.581\nsamples = 144\nvalue = [124, 20]"] ;
  205. 100 -> 101 ;
  206. 102 [label="entropy = 0.479\nsamples = 97\nvalue = [87, 10]"] ;
  207. 101 -> 102 ;
  208. 103 [label="entropy = 0.747\nsamples = 47\nvalue = [37, 10]"] ;
  209. 101 -> 103 ;
  210. 104 [label="Age <= 30.25\nentropy = 0.46\nsamples = 103\nvalue = [93, 10]"] ;
  211. 100 -> 104 ;
  212. 105 [label="entropy = 0.463\nsamples = 102\nvalue = [92, 10]"] ;
  213. 104 -> 105 ;
  214. 106 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
  215. 104 -> 106 ;
  216. 107 [label="Age <= 31.5\nentropy = 0.959\nsamples = 21\nvalue = [13, 8]"] ;
  217. 91 -> 107 ;
  218. 108 [label="Pclass <= 2.5\nentropy = 0.863\nsamples = 7\nvalue = [5, 2]"] ;
  219. 107 -> 108 ;
  220. 109 [label="entropy = 0.811\nsamples = 4\nvalue = [3, 1]"] ;
  221. 108 -> 109 ;
  222. 110 [label="entropy = 0.918\nsamples = 3\nvalue = [2, 1]"] ;
  223. 108 -> 110 ;
  224. 111 [label="Pclass <= 2.5\nentropy = 0.985\nsamples = 14\nvalue = [8, 6]"] ;
  225. 107 -> 111 ;
  226. 112 [label="entropy = 0.918\nsamples = 3\nvalue = [2, 1]"] ;
  227. 111 -> 112 ;
  228. 113 [label="entropy = 0.994\nsamples = 11\nvalue = [6, 5]"] ;
  229. 111 -> 113 ;
  230. 114 [label="Age <= 38.5\nentropy = 0.316\nsamples = 105\nvalue = [99, 6]"] ;
  231. 90 -> 114 ;
  232. 115 [label="Pclass <= 2.5\nentropy = 0.162\nsamples = 42\nvalue = [41, 1]"] ;
  233. 114 -> 115 ;
  234. 116 [label="Age <= 34.5\nentropy = 0.337\nsamples = 16\nvalue = [15, 1]"] ;
  235. 115 -> 116 ;
  236. 117 [label="Age <= 33.5\nentropy = 0.544\nsamples = 8\nvalue = [7, 1]"] ;
  237. 116 -> 117 ;
  238. 118 [label="entropy = 0.0\nsamples = 2\nvalue = [2, 0]"] ;
  239. 117 -> 118 ;
  240. 119 [label="entropy = 0.65\nsamples = 6\nvalue = [5, 1]"] ;
  241. 117 -> 119 ;
  242. 120 [label="entropy = 0.0\nsamples = 8\nvalue = [8, 0]"] ;
  243. 116 -> 120 ;
  244. 121 [label="entropy = 0.0\nsamples = 26\nvalue = [26, 0]"] ;
  245. 115 -> 121 ;
  246. 122 [label="Age <= 45.25\nentropy = 0.4\nsamples = 63\nvalue = [58, 5]"] ;
  247. 114 -> 122 ;
  248. 123 [label="Age <= 44.5\nentropy = 0.544\nsamples = 32\nvalue = [28, 4]"] ;
  249. 122 -> 123 ;
  250. 124 [label="Age <= 43.5\nentropy = 0.469\nsamples = 30\nvalue = [27, 3]"] ;
  251. 123 -> 124 ;
  252. 125 [label="entropy = 0.402\nsamples = 25\nvalue = [23, 2]"] ;
  253. 124 -> 125 ;
  254. 126 [label="entropy = 0.722\nsamples = 5\nvalue = [4, 1]"] ;
  255. 124 -> 126 ;
  256. 127 [label="entropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
  257. 123 -> 127 ;
  258. 128 [label="Age <= 61.5\nentropy = 0.206\nsamples = 31\nvalue = [30, 1]"] ;
  259. 122 -> 128 ;
  260. 129 [label="entropy = 0.0\nsamples = 25\nvalue = [25, 0]"] ;
  261. 128 -> 129 ;
  262. 130 [label="Age <= 63.5\nentropy = 0.65\nsamples = 6\nvalue = [5, 1]"] ;
  263. 128 -> 130 ;
  264. 131 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
  265. 130 -> 131 ;
  266. 132 [label="entropy = 0.0\nsamples = 5\nvalue = [5, 0]"] ;
  267. 130 -> 132 ;
  268. }

转换成图像:

可以看到图像非常大。可以设置最大深度(例如:max_depth=8),用网格搜索来调试

五、决策树总结

1、优点:
简单的理解和解释,树木可视化。

2、缺点:
决策树学习者可以创建不能很好地推广数据的过于复杂的树,这被称为过拟合。

3、改进:
减枝cart算法(决策树API当中已经实现,随机森林参数调优有相关介绍)
随机森林

注:企业重要决策,由于决策树很好的分析能力,在决策过程应用较多,可以选择特征

六、其他:记录下fit_transform和transform的区别

1、二者区别
fit(),用来求得训练集X的均值,方差,最大值,最小值,这些训练集X固有的属性。
transform(),在fit的基础上,进行标准化,降维,归一化等操作。
fit_transform(),包含上述两个功能。

2、为什么训练集用fit_transform而测试集用transform
训练集已经通过fit_transform求出了一些固有属性,测试集可沿用上述属性直接标准化,不必重新再求。
x_train=std_x.fit_transform(x_train)
x_test=std_x.transform(x_test)
 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/237159?site
推荐阅读
相关标签
  

闽ICP备14008679号