当前位置:   article > 正文

数据分析(三)线性回归模型实现

数据分析(三)线性回归模型实现

1. 惩罚线性回归模型概述

线性回归在实际应用时需要对普通最小二乘法进行一些修改。普通最小二乘法只在训练数据上最小化错误,难以顾及所有数据。

惩罚线性回归方法是一族用于克服最小二乘法( OLS)过拟合问题的方法。岭回归是惩罚线性回归的一个特例。岭回归通过对回归系数的平方和进行惩罚来避免过拟合。其他惩罚回归算法使用不同形式的惩罚项。

下面几个特点使得惩罚线性回归方法非常有效:

--模型训练足够快速。

--变量的重要性信息。

--部署时的预测足够快速。

--在各种问题上性能可靠,尤其对样本并不明显多于属性的属性矩阵,或者非常稀疏的矩阵。希望模型为稀疏解(即只使用部分属性进行预测的吝啬模型)。

--问题可能适合使用线性模型来解决。

公式 4-6 可以用如下语言描述:向量 beta 是以及常量 beta 零星是使期望预测的均方

错误最小的值,期望预测的均方错误是指在所有数据行(i=1,...,n)上计算 yi 与预测生成

yi 之间的错误平方的平均。

岭惩罚项对于惩罚回归来说并不是唯一有用的惩罚项。任何关于向量长度的指标都可以。使用不同的长度指标可以改变解的重要性。岭回归应用欧式几何的指标(即 β 的平方和)。另外一个有用的算法称作套索(Lasso)回归,该回归源于出租车的几何路径被称作曼哈顿距离或者 L1 正则化(即 β 的绝对值的和)。ElasticNet 惩罚项包含套索惩罚项以及岭惩罚项。

2. 求解惩罚线性回归问题

有大量通用的数值优化算法可以求解公式 4-6、公式 4-8 以及公式 4-11 对应的优化问题,但是惩罚线性回归问题的重要性促使研究人员开发专用算法,从而能够非常快地生成解。本文将对这些算法进行介绍并且运行相关代码,重点介绍2种算法:最小角度回归 LARS 以及 Glmnet。

LARS 算法可以理解为一种改进的前向逐步回归算法。

之所以介绍 LARS 算法是因为该算法非常接近于套索以及前向逐步回归, LARS 算法很容易理解并且实现起来相对紧凑。通过研究 LARS 的代码,你会理解针对更一般的 ElasticNet 回归求解的具体过程,并且会了解惩罚回归求解的细节。

3. 完整代码(code)

  1. from math import sqrt
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. from tqdm import tqdm
  5. def x_normalized(xList, xMeans, xSD):
  6. nrows = len(xList)
  7. ncols = len(xList[0])
  8. xNormalized = []
  9. for i in range(nrows):
  10. rowNormalized = [(xList[i][j] - xMeans[j]) / xSD[j] for j in range(ncols)]
  11. xNormalized.append(rowNormalized)
  12. def data_normalized(wine):
  13. nrows, ncols = wine.shape
  14. wineNormalized = wine
  15. for i in range(ncols):
  16. mean = summary.iloc[1, i]
  17. sd = summary.iloc[2, i]
  18. wineNormalized.iloc[:, i:(i + 1)] = (wineNormalized.iloc[:, i:(i + 1)] - mean) / sd
  19. return wineNormalized
  20. def calculate_betaMat(nSteps, stepSize, wineNormalized):
  21. nrows, ncols = wineNormalized.shape
  22. # initialize a vector of coefficients beta(系数初始化)
  23. beta = [0.0] * (ncols - 1)
  24. # initialize matrix of betas at each step(系数矩阵初始化)
  25. betaMat = []
  26. betaMat.append(list(beta))
  27. # initialize residuals list(误差初始化)
  28. residuals = [0.0] * nrows
  29. for i in tqdm(range(nSteps)):
  30. # calculate residuals(计算误差)
  31. for j in range(nrows):
  32. residuals[j] = wineNormalized.iloc[j, (ncols - 1)]
  33. for k in range(ncols - 1):
  34. residuals[j] += - wineNormalized.iloc[j, k] * beta[k]
  35. # calculate correlation between attribute columns from normalized wine and residual(变量与误差相关系数)
  36. corr = [0.0] * (ncols - 1)
  37. for j in range(ncols - 1):
  38. for k in range(nrows):
  39. corr[j] += wineNormalized.iloc[k, j] * residuals[k] / nrows
  40. iStar = 0
  41. corrStar = corr[0]
  42. for j in range(1, (ncols - 1)):
  43. if abs(corrStar) < abs(corr[j]): # 相关性大的放前面
  44. iStar = j
  45. corrStar = corr[j]
  46. beta[iStar] += stepSize * corrStar / abs(corrStar) # 系数
  47. betaMat.append(list(beta))
  48. return betaMat
  49. def plot_betaMat1(betaMat):
  50. ncols = len(betaMat[0])
  51. for i in range(ncols):
  52. # plot range of beta values for each attribute
  53. coefCurve = betaMat[0:nSteps][i]
  54. plt.plot(coefCurve)
  55. plt.xlabel("Attribute Index")
  56. plt.ylabel(("Attribute Values"))
  57. plt.show()
  58. def plot_betaMat2(nSteps, betaMat):
  59. ncols = len(betaMat[0])
  60. for i in range(ncols):
  61. # plot range of beta values for each attribute
  62. coefCurve = [betaMat[k][i] for k in range(nSteps)]
  63. xaxis = range(nSteps)
  64. plt.plot(xaxis, coefCurve)
  65. plt.xlabel("Steps Taken")
  66. plt.ylabel(("Coefficient Values"))
  67. plt.show()
  68. def S(z, gamma):
  69. if gamma >= abs(z):
  70. return 0.0
  71. return (z / abs(z)) * (abs(z) - gamma)
  72. if __name__ == '__main__':
  73. target_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
  74. wine = pd.read_csv(target_url, header=0, sep=";")
  75. # normalize the wine data
  76. summary = wine.describe()
  77. print(summary)
  78. # 数据标准化
  79. wineNormalized = data_normalized(wine)
  80. # number of steps to take(训练步数)
  81. nSteps = 100
  82. stepSize = 0.1
  83. betaMat = calculate_betaMat(nSteps, stepSize, wineNormalized)
  84. plot_betaMat1(betaMat)
  85. # ----------------------------larsWine---------------------------------------------------
  86. # read data into iterable
  87. names = wine.columns
  88. xList = []
  89. labels = []
  90. firstLine = True
  91. for i in range(len(wine)):
  92. row = wine.iloc[i]
  93. # put labels in separate array
  94. labels.append(float(row[-1]))
  95. # convert row to floats
  96. floatRow = row[:-1]
  97. xList.append(floatRow)
  98. # Normalize columns in x and labels
  99. nrows = len(xList)
  100. ncols = len(xList[0])
  101. # calculate means and variances(计算均值和方差)
  102. xMeans = []
  103. xSD = []
  104. for i in range(ncols):
  105. col = [xList[j][i] for j in range(nrows)]
  106. mean = sum(col) / nrows
  107. xMeans.append(mean)
  108. colDiff = [(xList[j][i] - mean) for j in range(nrows)]
  109. sumSq = sum([colDiff[i] * colDiff[i] for i in range(nrows)])
  110. stdDev = sqrt(sumSq / nrows)
  111. xSD.append(stdDev)
  112. # use calculate mean and standard deviation to normalize xList(X标准化)
  113. xNormalized = x_normalized(xList, xMeans, xSD)
  114. # Normalize labels: 将属性及标签进行归一化
  115. meanLabel = sum(labels) / nrows
  116. sdLabel = sqrt(sum([(labels[i] - meanLabel) * (labels[i] - meanLabel) for i in range(nrows)]) / nrows)
  117. labelNormalized = [(labels[i] - meanLabel) / sdLabel for i in range(nrows)]
  118. # initialize a vector of coefficients beta
  119. beta = [0.0] * ncols
  120. # initialize matrix of betas at each step
  121. betaMat = []
  122. betaMat.append(list(beta))
  123. # number of steps to take
  124. nSteps = 350
  125. stepSize = 0.004
  126. nzList = []
  127. for i in range(nSteps):
  128. # calculate residuals
  129. residuals = [0.0] * nrows
  130. for j in range(nrows):
  131. labelsHat = sum([xNormalized[j][k] * beta[k] for k in range(ncols)])
  132. residuals[j] = labelNormalized[j] - labelsHat # 计算残差
  133. # calculate correlation between attribute columns from normalized wine and residual
  134. corr = [0.0] * ncols
  135. for j in range(ncols):
  136. corr[j] = sum([xNormalized[k][j] * residuals[k] for k in range(nrows)]) / nrows # 每个属性和残差的关联
  137. iStar = 0
  138. corrStar = corr[0]
  139. for j in range(1, (ncols)): # 逐个判断哪个属性对降低残差贡献最大
  140. if abs(corrStar) < abs(corr[j]): # 好的(最大关联)特征会排到列表前面,应该保留,不太好的特征会排到最后
  141. iStar = j
  142. corrStar = corr[j]
  143. beta[iStar] += stepSize * corrStar / abs(corrStar) # 固定增加beta变量值,关联为正增量为正;关联为负,增量为负
  144. betaMat.append(list(beta)) # 求解得到参数结果
  145. nzBeta = [index for index in range(ncols) if beta[index] != 0.0]
  146. for q in nzBeta:
  147. if q not in nzList: # 对于每一迭代步,记录非零系数对应索引
  148. nzList.append(q)
  149. nameList = [names[nzList[i]] for i in range(len(nzList))]
  150. print(nameList)
  151. plot_betaMat2(nSteps, betaMat) # 绘制系数曲线
  152. # -------------------------------larsWine 10折交叉------------------------------------------------
  153. # Build cross-validation loop to determine best coefficient values.
  154. # number of cross validation folds
  155. nxval = 10
  156. # number of steps and step size
  157. nSteps = 350
  158. stepSize = 0.004
  159. # initialize list for storing errors.
  160. errors = [] # 记录每一步迭代的错误
  161. for i in range(nSteps):
  162. b = []
  163. errors.append(b)
  164. for ixval in range(nxval): # 10折交叉验证
  165. # Define test and training index sets
  166. idxTrain = [a for a in range(nrows) if a % nxval != ixval * nxval]
  167. idxTest = [a for a in range(nrows) if a % nxval == ixval * nxval]
  168. # Define test and training attribute and label sets
  169. xTrain = [xNormalized[r] for r in idxTrain] # 训练集
  170. labelTrain = [labelNormalized[r] for r in idxTrain]
  171. xTest = [xNormalized[r] for r in idxTest] # 测试集
  172. labelTest = [labelNormalized[r] for r in idxTest]
  173. # Train LARS regression on Training Data
  174. nrowsTrain = len(idxTrain)
  175. nrowsTest = len(idxTest)
  176. # initialize a vector of coefficients beta
  177. beta = [0.0] * ncols
  178. # initialize matrix of betas at each step
  179. betaMat = []
  180. betaMat.append(list(beta))
  181. for iStep in range(nSteps):
  182. # calculate residuals
  183. residuals = [0.0] * nrows
  184. for j in range(nrowsTrain):
  185. labelsHat = sum([xTrain[j][k] * beta[k] for k in range(ncols)])
  186. residuals[j] = labelTrain[j] - labelsHat
  187. # calculate correlation between attribute columns from normalized wine and residual
  188. corr = [0.0] * ncols
  189. for j in range(ncols):
  190. corr[j] = sum([xTrain[k][j] * residuals[k] for k in range(nrowsTrain)]) / nrowsTrain
  191. iStar = 0
  192. corrStar = corr[0]
  193. for j in range(1, (ncols)):
  194. if abs(corrStar) < abs(corr[j]):
  195. iStar = j
  196. corrStar = corr[j]
  197. beta[iStar] += stepSize * corrStar / abs(corrStar)
  198. betaMat.append(list(beta))
  199. # Use beta just calculated to predict and accumulate out of sample error - not being used in the calc of beta
  200. for j in range(nrowsTest):
  201. labelsHat = sum([xTest[j][k] * beta[k] for k in range(ncols)])
  202. err = labelTest[j] - labelsHat
  203. errors[iStep].append(err)
  204. cvCurve = []
  205. for errVect in errors:
  206. mse = sum([x * x for x in errVect]) / len(errVect)
  207. cvCurve.append(mse)
  208. minMse = min(cvCurve)
  209. minPt = [i for i in range(len(cvCurve)) if cvCurve[i] == minMse][0]
  210. print("Minimum Mean Square Error", minMse)
  211. print("Index of Minimum Mean Square Error", minPt)
  212. xaxis = range(len(cvCurve))
  213. plt.plot(xaxis, cvCurve)
  214. plt.xlabel("Steps Taken")
  215. plt.ylabel(("Mean Square Error"))
  216. plt.show()
  217. # -------------------------------glmnet larsWine2------------------------------------------------
  218. # select value for alpha parameter
  219. alpha = 1.0
  220. # make a pass through the data to determine value of lambda that
  221. # just suppresses all coefficients.
  222. # start with betas all equal to zero.
  223. xy = [0.0] * ncols
  224. for i in range(nrows):
  225. for j in range(ncols):
  226. xy[j] += xNormalized[i][j] * labelNormalized[i]
  227. maxXY = 0.0
  228. for i in range(ncols):
  229. val = abs(xy[i]) / nrows
  230. if val > maxXY:
  231. maxXY = val
  232. # calculate starting value for lambda
  233. lam = maxXY / alpha
  234. # this value of lambda corresponds to beta = list of 0's
  235. # initialize a vector of coefficients beta
  236. beta = [0.0] * ncols
  237. # initialize matrix of betas at each step
  238. betaMat = []
  239. betaMat.append(list(beta))
  240. # begin iteration
  241. nSteps = 100
  242. lamMult = 0.93 # 100 steps gives reduction by factor of 1000 in
  243. # lambda (recommended by authors)
  244. nzList = []
  245. for iStep in range(nSteps):
  246. # make lambda smaller so that some coefficient becomes non-zero
  247. lam = lam * lamMult
  248. deltaBeta = 100.0
  249. eps = 0.01
  250. iterStep = 0
  251. betaInner = list(beta)
  252. while deltaBeta > eps:
  253. iterStep += 1
  254. if iterStep > 100:
  255. break
  256. # cycle through attributes and update one-at-a-time
  257. # record starting value for comparison
  258. betaStart = list(betaInner)
  259. for iCol in range(ncols):
  260. xyj = 0.0
  261. for i in range(nrows):
  262. # calculate residual with current value of beta
  263. labelHat = sum([xNormalized[i][k] * betaInner[k] for k in range(ncols)])
  264. residual = labelNormalized[i] - labelHat
  265. xyj += xNormalized[i][iCol] * residual
  266. uncBeta = xyj / nrows + betaInner[iCol]
  267. betaInner[iCol] = S(uncBeta, lam * alpha) / (1 + lam * (1 - alpha))
  268. sumDiff = sum([abs(betaInner[n] - betaStart[n]) for n in range(ncols)])
  269. sumBeta = sum([abs(betaInner[n]) for n in range(ncols)])
  270. deltaBeta = sumDiff / sumBeta
  271. print(iStep, iterStep)
  272. beta = betaInner
  273. # add newly determined beta to list
  274. betaMat.append(beta)
  275. # keep track of the order in which the betas become non-zero
  276. nzBeta = [index for index in range(ncols) if beta[index] != 0.0]
  277. for q in nzBeta:
  278. if q not in nzList:
  279. nzList.append(q)
  280. # print out the ordered list of betas
  281. nameList = [names[nzList[i]] for i in range(len(nzList))]
  282. print(nameList)
  283. nPts = len(betaMat)
  284. plot_betaMat2(nPts, betaMat) # 绘制系数曲线

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/353780
推荐阅读
相关标签
  

闽ICP备14008679号