当前位置:   article > 正文

(13-3-02)服装推荐系统:数据集处理——数据清洗_推荐系统coat数据集

推荐系统coat数据集

12.5.2  数据清洗

在经过前面的初步分析之后,可能会发现数据集中存在噪声、缺失值、异常值等问题。为了确保数据的质量和一致性,你需要进行数据清洗。编写文件data_cleaning.ipynb实现数据清洗功能,具体实现流程如下所示。

1加载数据集文件,使用给定的文件路径配置信息构建完整的文件路径,以便后续对相应文件进行读取和处理操作。对应的实现代码如下所示:

  1. categorical_column = lambda x: ('NONE' if pd.isna(x) or len(x) == 0 else 'NONE' if x =='None' else x)
  2. df_customer = pd.read_csv(CUSTOMER_MASTER_TABLE,
  3. converters =
  4. {
  5. 'fashion_news_frequency': categorical_column
  6. },
  7. usecols = ['customer_id', 'FN', 'Active', 'club_member_status','fashion_news_frequency', 'age'],
  8. dtype =
  9. {
  10. 'club_member_status': 'category'
  11. }
  12. )
  13. df_customer.FN.fillna(0, inplace = True)
  14. df_customer.FN = df_customer.FN.astype(bool)
  15. df_customer.Active.fillna(0, inplace = True)
  16. df_customer.Active = df_customer.Active.astype(bool)
  17. df_customer.rename(columns = {'FN':'subscribe_fashion_newsletter','Active':'active'}, inplace = True)
  18. df_customer['fashion_news_frequency'] = df_customer['fashion_news_frequency'].astype('category')
  19. df_customer.customer_id = df_customer.customer_id.apply(lambda x: int(x[-16:],16) ).astype('int64')
  20. df_customer.head()

2读取名为CUSTOMER_MASTER_TABLE的CSV文件,并将特定列进行数据类型转换、重命名和填充缺失值。最后,打印输出df_customer的前几行数据。对应的实现代码如下所示:

  1. categorical_column = lambda x: ('NONE' if pd.isna(x) or len(x) == 0 else 'NONE' if x =='None' else x)
  2. df_customer = pd.read_csv(CUSTOMER_MASTER_TABLE,
  3.                           converters =
  4.                                       {                                      
  5.                                        'fashion_news_frequency': categorical_column   
  6.                                       },
  7.                          usecols = ['customer_id', 'FN', 'Active', 'club_member_status','fashion_news_frequency', 'age'],
  8.                           dtype =
  9.                                   {
  10.                                     'club_member_status': 'category'                                   
  11.                                   }                          
  12.                          )
  13. df_customer.FN.fillna(0, inplace = True)
  14. df_customer.FN = df_customer.FN.astype(bool)
  15. df_customer.Active.fillna(0, inplace = True)
  16. df_customer.Active = df_customer.Active.astype(bool)
  17. df_customer.rename(columns = {'FN':'subscribe_fashion_newsletter','Active':'active'}, inplace = True)
  18. df_customer['fashion_news_frequency'] = df_customer['fashion_news_frequency'].astype('category')
  19. df_customer.customer_id = df_customer.customer_id.apply(lambda x: int(x[-16:],16) ).astype('int64')
  20. df_customer.head()

执行后输出:

  1. customer_id subscribe_fashion_newsletter active club_member_status fashion_news_frequency age
  2. 0 6883939031699146327 False False ACTIVE NONE 49.0
  3. 1 -7200416642310594310 False False ACTIVE NONE 25.0
  4. 2 -6846340800584936 False False ACTIVE NONE 24.0
  5. 3 -94071612138601410 False False ACTIVE NONE 54.0
  6. 4 -283965518499174310 True True ACTIVE Regularly 52.0

3对df_customer的age列进行描述性统计分析,并绘制箱线图和直方图。对应的实现代码如下所示:

  1. df_customer.age.describe()
  2. sns.boxplot(x = df_customer.age)
  3. sns.displot(df_customer, x = "age",  binwidth = 5)
  4. plt.show()

执行后会显示绘制的箱线图和直方图,如图12-3所示。

图12-3  绘制的箱线图和直方图

4)接下来开始移除异常值,通过如下代码计算年龄数据的四分位数(Q1, Q3)和四分位距(IQR),然后根据上限(upper limit)标记年龄超过上限的客户为异常值。同时,计算年龄的中位数,并将缺失值填充为中位数,并将年龄数据转换为无符号8位整数类型。最后,将生成的数据框的前几行打印出来。

  1. Q1, Q3 = np.nanpercentile(df_customer.age, [25, 75])
  2. print('Q1 25 percentile of the given data is, ', Q1)
  3. print('Q1 75 percentile of the given data is, ', Q3)
  4. IQR = Q3 - Q1
  5. print('Interquartile range is', IQR)
  6. up_lim = Q3 + 1.5 * IQR
  7. print('Upper limit is', up_lim)
  8. df_customer['is_outlier'] = df_customer.age > up_lim
  9. age_median = df_customer.age.median(skipna = True)
  10. print(f'Median age: {age_median}')
  11. df_customer.age = df_customer.age.fillna(value = age_median)
  12. df_customer.age = np.uint8(df_customer.age)
  13. df_customer.head()

执行后输出:

  1. customer_id subscribe_fashion_newsletter active club_member_status fashion_news_frequency age is_outlier
  2. 0 6883939031699146327 False False ACTIVE NONE 49 False
  3. 1 -7200416642310594310 False False ACTIVE NONE 25 False
  4. 2 -6846340800584936 False False ACTIVE NONE 24 False
  5. 3 -94071612138601410 False False ACTIVE NONE 54 False
  6. 4 -283965518499174310 True True ACTIVE Regularly 52 False

(5)接下来开始处理交易数据,首先从pickle文件中读取交易数据到DataFrame,然后对日期时间列进行格式转换,然后进行数据类型转换和列重命名,以减少内存使用和提高数据的可读性。对应的实现代码如下所示:

  1. df_transaction = pd.read_pickle(TRANSACTION_DATA,
  2. compression = 'gzip'
  3. )
  4. df_transaction.t_dat = pd.to_datetime(df_transaction.t_dat)
  5. df_transaction.article_id = pd.to_numeric(df_transaction.article_id, downcast = 'unsigned')
  6. df_transaction.price = df_transaction.price.astype('float16')
  7. df_transaction.customer_id = df_transaction.customer_id.apply(lambda x: int(x[-16:],16) ).astype('int64')
  8. df_transaction.sales_channel_id = df_transaction.sales_channel_id.apply(lambda x: False if (x == 1) else True)
  9. df_transaction.rename(columns = {'sales_channel_id':'online_sale'}, inplace = True)
  10. print("Shape of transaction records:" + str(df_transaction.shape))
  11. df_transaction.head()

执行后会输出:

  1. t_dat customer_id article_id price online_sale
  2. 0 2018-09-20 -6846340800584936 663713001 0.050842 True
  3. 1 2018-09-20 -6846340800584936 541518023 0.030487 True
  4. 2 2018-09-20 -8334631767138808638 505221004 0.015236 True
  5. 3 2018-09-20 -8334631767138808638 685687003 0.016937 True
  6. 4 2018-09-20 -8334631767138808638 685687004 0.016937 True

6)编写如下所示的代码,首先从交易数据中删除重复记录,并保留第一条记录。然后打印删除重复记录后的交易数据形状,并对交易数据按照"customer_id"和"article_id"进行分组,并统计每个组中"price"的计数。最后使用lambda函数筛选出计数大于1的组,并重新命名计数列为"count",并将结果重置为新的DataFrame。

  1. df_transaction.drop_duplicates(keep = 'first', inplace = True)
  2. print("Shape of transaction records after droping duplicate:" + str(df_transaction.shape))
  3. df_group = df_transaction.groupby(['customer_id','article_id'])['price'].count().pipe(lambda  dfx: dfx.loc[dfx > 1]).reset_index(name = 'count')
  4. df_group.head()

执行后会输出:

  1. customer_id article_id count
  2. 0 -9223352921020755230 706016001 2
  3. 1 -9223343869995384291 519583013 2
  4. 2 -9223343869995384291 583534014 3
  5. 3 -9223343869995384291 649671009 2
  6. 4 -9223343869995384291 655784014 2

7)编写如下所示的代码,在交易数据中添加一个新的列"index_col",该列的值为数据的索引。然后将"df_group" DataFrame中的"customer_id"和"article_id"列与"df_transaction" DataFrame进行内连接(inner join)操作,根据这两列进行数据的合并。最后返回合并后的DataFrame的前几行数据。

  1. df_transaction['index_col'] = df_transaction.index
  2. df_group = (df_group[['customer_id','article_id']]
  3. .merge(df_transaction,
  4. on = ['customer_id','article_id'],
  5. how = 'inner')
  6. )
  7. df_group.head()

执行后会输出:

  1. customer_id article_id t_dat price online_sale index_col
  2. 0 -9223352921020755230 706016001 2019-10-12 0.033875 False 17854648
  3. 1 -9223352921020755230 706016001 2019-10-26 0.033875 False 18327866
  4. 2 -9223343869995384291 519583013 2019-03-16 0.010155 True 7428712
  5. 3 -9223343869995384291 519583013 2019-03-20 0.010155 True 7584492
  6. 4 -9223343869995384291 583534014 2019-02-21 0.014389 True 6435020

8)编写如下代码对DataFrame按照"t_dat"列进行排序,然后计算每个购买记录的上一次购买时间,并添加到DataFrame中。最后计算每个购买记录与上一次购买时间之间的时间间隔,并将结果添加到DataFrame中。

  1. df_group.sort_values(by = 't_dat',  inplace = True)
  2. df_group['previous_purchase'] = df_group.groupby(['customer_id','article_id'])['t_dat'].shift(periods = 1)
  3. df_group['previous_purchase'] = pd.to_datetime(df_group['previous_purchase'], errors = 'coerce')
  4. df_group['time_pass_before_last_purchase'] = ((df_group['t_dat'] - df_group['previous_purchase']).dt.days).fillna(0).astype(int)
  5. df_group.head()

执行后会输出:

  1. customer_id article_id t_dat price online_sale index_col previous_purchase time_pass_before_last_purchase
  2. 1845563 2818613237059253259 562245058 2018-09-20 0.033875 False 2103 NaT 0
  3. 39824 -8971660418312476665 571436010 2018-09-20 0.012184 False 3697 NaT 0
  4. 39825 -8971660418312476665 571436010 2018-09-20 0.013542 False 3698 2018-09-20 0
  5. 1667034 1648773112888926075 651327001 2018-09-20 0.018677 True 544 NaT 0
  6. 2740283 8630231611017253359 700833004 2018-09-20 0.064392 True 14252 NaT 0

(9)绘制顾客在最后一次购买之前经过的天数的直方图,对应的实现代码如下所示:

  1. usecol = ['article_id','product_type_name']
  2. df_articles = pd.read_csv(ARTICLES_MASTER_TABLE,
  3. usecols = usecol
  4. )
  5. df_subset = df_group[df_group.time_pass_before_last_purchase != 0]
  6. df_subset = df_articles.merge(df_subset,
  7. on = 'article_id',
  8. how ='inner')
  9. df_subset.head()
  10. del [df_articles]
  11. gc.collect()
  12. plt.figure(figsize = (15,5),
  13. dpi = 100,
  14. clear = True,
  15. constrained_layout = True
  16. )
  17. ax = sns.histplot(data = df_subset, x = 'time_pass_before_last_purchase', bins = 74)
  18. ax.set_xticks(list(range(0,740,10)))
  19. ax.set_xlim(0,370)
  20. ax.set_xlabel("Days elapsed before last purchase")
  21. ax.set_ylabel("Count")
  22. plt.show()
  1. 首先,从"ARTICLES_MASTER_TABLE"中读取"article_id"和"product_type_name"两列的数据,存储在DataFrame "df_articles"中。
  2. 从"df_group"中筛选出"time_pass_before_last_purchase"不等于0的记录,存储在DataFrame "df_subset"中。
  3. 将"df_articles"和"df_subset"根据"article_id"列进行内连接,得到新的DataFrame "df_subset"。
  4. 绘制直方图,展示"time_pass_before_last_purchase"的分布情况。
  5. 对内存中的变量进行删除和垃圾回收操作。
  6. 最后,显示绘制的直方图。

绘制的直方图效果如图12-4所示。

12-4  顾客在最后一次购买之前经过的天数的直方图

10)通过如下代码绘制产品类型的计数柱状图,首先创建一个图形对象(Figure)并设置其属性,然后使用Seaborn库的countplot函数绘制柱状图,其中X轴表示产品类型,Y轴表示计数。通过设置旋转角度使X轴标签垂直显示。最后,显示绘制的图形并清除图形对象以释放资源。

  1. fig_WS = plt.figure(num = 1,
  2. figsize = (15,5),
  3. dpi = 100,
  4. clear = True,
  5. constrained_layout = True
  6. )
  7. ax = sns.countplot(data = df_subset,
  8. x = 'product_type_name',
  9. order = df_subset['product_type_name'].value_counts().index
  10. )
  11. ax.set_xticklabels(ax.get_xticklabels(),
  12. rotation = 90
  13. )
  14. ax.set_xlabel("Product Type")
  15. ax.set_ylabel("Count")
  16. plt.show()
  17. fig_WS.clf()

执行后会绘制产品类型的计数柱状图,如图12-5所示。

12-5  产品类型的计数柱状图

在众多商品中,属于配件类别的物品,例如内衣、帽子、发夹、钱包、腰带等,用户可能会多次购买。几乎85%的服装重新购买的物品属于服装类别,如裤子、连衣裙、毛衣、T恤等。很可能用户不会购买相同的物品多次,因为不同颜色的物品会有不同的article_id。几乎90%的重新购买的物品在30天内完成,而H&M有30天的退货政策。因此,我们可以假设这些是退货的物品,用户以不同尺码购买了相同的物品,而H&M将其视为新交易,并可能在维护一个单独的退货交易表。

由于我们希望向用户推荐与他们过去购买的物品类似的新物品,所以我们可以删除这些重复的物品,其中他们再次购买了相同的物品,只保留用户对该物品最后一次交易的记录。

(11首先通过如下代码对数据框(DataFrame)进行排序,按照customer_id、article_id和t_dat进行降序排序。这样做的目的是删除重复的购买记录,并保留每个顾客对每个物品的第一次购买记录。

  1. df_group.sort_values(by = ['customer_id','article_id','t_dat'], inplace = True, ascending = False)
  2. df_group.head()

执行后会输出:

  1. customer_id article_id t_dat price online_sale index_col previous_purchase time_pass_before_last_purchase
  2. 2831516 9223333063893176977 607834005 2018-11-18 0.017776 False 2620551 2018-11-08 10
  3. 2831515 9223333063893176977 607834005 2018-11-08 0.025406 True 2265236 NaT 0
  4. 2831514 9223148401910457466 830770001 2019-12-14 0.029358 False 20126479 2019-11-08 36
  5. 2831513 9223148401910457466 830770001 2019-11-08 0.033875 False 18773300 NaT 0
  6. 2831512 9223148401910457466 767377003 2019-06-09 0.050842 False 11590886 2019-06-06 3

(12)编写如下代码,从主交易数据表中删除所有重复的行。从df_group数据表中删除重复的顾客和物品组合,只保留每个组合的第一次购买记录。选择特定的列(article_id、t_dat、customer_id、price、online_sale)从df_group数据表中提取出来。将提取的数据表与主交易数据表进行连接,添加到主交易数据表的末尾。然后,从主交易数据表中删除列"index_col"。最后,显示更新后的主交易数据表的前几行。

  1. df_transaction.drop(index = df_group.index_col, inplace = True)
  2. df_group.drop_duplicates(subset = ['customer_id','article_id'], keep = 'first', inplace = True)
  3. usecol = ['article_id', 't_dat', 'customer_id', 'price', 'online_sale']
  4. df_transaction = pd.concat([df_transaction,
  5. df_group[usecol]
  6. ],
  7. axis = 0)
  8. df_transaction.drop(columns = ['index_col'],
  9. axis = 1,
  10. inplace = True)
  11. df_transaction.head()

执行后会输出:

  1. t_dat customer_id article_id price online_sale
  2. 1 2018-09-20 -6846340800584936 541518023 0.030487 True
  3. 2 2018-09-20 -8334631767138808638 505221004 0.015236 True
  4. 3 2018-09-20 -8334631767138808638 685687003 0.016937 True
  5. 4 2018-09-20 -8334631767138808638 685687004 0.016937 True
  6. 5 2018-09-20 -8334631767138808638 685687001 0.016937 True

13)接下来开始删除异常值记录,为什么要这样做呢?因为每种商品都是不同的,售价也不同。因此,不能简单地对定价列应用异常值处理,而不考虑每种商品的价格。我们将按商品分组,找出该商品组类型中价格的异常值。

注意:由于动态定价的存在,某些商品的价格可能比其他商品高得多。

编写如下代码绘制在线销售与离线销售价格散点图,首先将数据中在线销售和离线销售的价格分别提取出来,然后进行描述性统计分析。接下来打印在线销售价格的描述统计信息和离线销售价格的描述统计信息。然后创建一个图表,并将在线销售和离线销售的价格分别以散点图的形式绘制出来。最后,设置图表的标题、坐标轴标签和刻度,并显示绘制的图表。

  1. online_sales = df_transaction[df_transaction.online_sale == 1]['price']
  2. offline_sales = df_transaction[df_transaction.online_sale == 0]['price']
  3. print("="*20 + "Online Sales" + "="*20)
  4. print(online_sales.describe())
  5. print("="*20 + "Offline Sales" + "="*20)
  6. print(offline_sales.describe())
  7. fig = plt.figure(figsize = (15,10))
  8. y_index = range(online_sales.shape[0])
  9. ax = sns.scatterplot(y = online_sales, x = y_index, color = ['red'], label = "Online")
  10. y_index = range(offline_sales.shape[0])
  11. ax = sns.scatterplot(y = offline_sales, x = y_index, color = ['blue'] , label = "Offline")
  12. ax.set_title("Sales Price")
  13. ax.set_ylabel("Price")
  14. ax.set_xlabel("Index")
  15. ax.set_yticks(np.linspace(0.01, 0.5, num = 40))
  16. #ax.grid()
  17. ax.legend(title = 'Sales Channel')
  18. plt.show()

执行效果如图12-6所示。

图12-6  在线销售与离线销售价格散点图

14)编写如下代码,使用散点图绘制在线销售和离线销售的价格分布情况。

  1. fig = plt.figure(figsize = (15,10))
  2. #online_sales = df_transaction[df_transaction.online_sale == 1]['price'].values
  3. sorted_online_price = np.sort(online_sales)[::-1]
  4. y_index = range(online_sales.shape[0])
  5. ax = sns.scatterplot(y = sorted_online_price, x = y_index, color = ['red'], label = "Online")
  6. #offline_sales = df_transaction[df_transaction.online_sale == 0]['price'].values
  7. sorted_offline_price = np.sort(offline_sales)[::-1]
  8. y_index = range(offline_sales.shape[0])
  9. ax = sns.scatterplot(y = sorted_offline_price, x = y_index, color = ['blue'] , label = "Offline")
  10. ax.set_title("Sorted Sales Price")
  11. ax.set_ylabel("Price")
  12. ax.set_xlabel("Index")
  13. ax.set_yticks(np.linspace(0.01, 0.5, num = 50))
  14. ax.grid()
  15. ax.legend(title = 'Sales Channel')
  16. plt.show()

执行效果如图12-7所示。

图12-7  在线销售和离线销售的价格分布情况散点图

15)请看下面的两段代码,其中第1段代码用于统计价格在区间[0.01, 0.17]内的数据所占的比例,并输出结果。2段代码用于统计价格大于0.17的数据所占的比例,并输出结果。

  1. per = df_transaction[(df_transaction.price >=0.01) & (df_transaction.price<=0.17)].shape[0]
  2. per = per/df_transaction.shape[0]
  3. print(f'Percentage of data between [0.01, 0.17]: {per}')
  4. per = df_transaction[(df_transaction.price > 0.17)].shape[0]
  5. per = per/df_transaction.shape[0]
  6. print(f'Percentage of data more then 0.17: {per}')
  7. fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (20,10))
  8. sns.distplot(online_sales, ax = axis[0])
  9. sns.distplot(offline_sales, ax = axis[1])
  10. plt.suptitle("Distribution of the sale price")
  11. axis[0].set_title("Online Sale Price")
  12. axis[1].set_title("Offline Sale Price")
  13. plt.show()

执行后会创建一个大小为20x10的图形,并分为1行2列的子图。左侧子图显示在线销售价格的分布情况,右侧子图显示离线销售价格的分布情况。图形的标题为"Distribution of the sale price",左侧子图标题为"Online Sale Price",右侧子图标题为"Offline Sale Price"。如图12-8所示。

图12-8  执行效果

16)对相似的product_type_name进行分组为了减少唯一product type name的基数,采取以下步骤实现:

  1. 创建一个新的列"product_type_name_org",复制原始值并更新"product_type_name"。
  2. 根据它们所属的类别对产品进行分组。

对应的实现代码如下所示:

  1. unique_product_type = df_article.product_type_name.unique()
  2. print(f'Number of unique product type before: {len(unique_product_type)}')
  3. colname = 'product_type_name_org'
  4. colindex = ['product_type_name']
  5. # 将原始值复制到新列“_org”
  6. df_article[colname] = df_article.product_type_name
  7. # 合并所有“hat”,“cap”,“beanie”产品类型为“Hat Cap”
  8. # Beanie看起来像某种帽子
  9. filters = ["Hat", "Cap", "Beanie"]
  10. items = hlpeda.filter_types(filters, unique_product_type)
  11. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  12. df_article.loc[rowindex, colindex] = "Hat Cap"
  13. # 将所有“Wood balls”,“Side table”等产品类型合并为“Cloths Care”
  14. filters = ["Wood balls", "Side table", "Stain remover spray", "Sewing kit", "Clothing mist", "Zipper head"]
  15. items = hlpeda.filter_types(filters, unique_product_type)
  16. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  17. df_article.loc[rowindex, colindex] = "Cloths Care"
  18. # 将所有“Ring”,“Necklace”,“Bracelet”,“Earring”产品类型合并为“Jewellery”
  19. filters = ["Ring", "Necklace", "Bracelet", "Earring"]
  20. items = hlpeda.filter_types(filters, unique_product_type)
  21. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  22. df_article.loc[rowindex, colindex] = "Jewellery"
  23. # 将所有“Polo shirt”产品类型合并为“Shirt”
  24. filters = ["Polo shirt"]
  25. items = hlpeda.filter_types(filters, unique_product_type)
  26. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  27. df_article.loc[rowindex, colindex] = "Shirt"
  28. # 将所有“Bag”产品类型合并为“Bag”
  29. filters = ["Bag", 'Backpack']
  30. items = hlpeda.filter_types(filters, unique_product_type)
  31. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  32. df_article.loc[rowindex, colindex] = "Bag"
  33. # 将所有“Swimwear bottom”,“Swimsuit”,“Bikini top”产品类型合并为“Swimwear”
  34. filters = ["Swimwear"]
  35. items = hlpeda.filter_types(filters, unique_product_type)
  36. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  37. df_article.loc[rowindex, colindex] = "Swimwear"
  38. # 将所有“Underwear bottom”,“Underwear Tights”,“Underwear body”,“Long John”产品类型合并为“Underwear”
  39. filters = ["Underwear", "Long John"]
  40. items = hlpeda.filter_types(filters, unique_product_type)
  41. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  42. df_article.loc[rowindex, colindex] = "Underwear"
  43. # 将所有“Hair”,“Alice band”,“Hairband”产品类型合并为“Hair Accessories”
  44. filters = ["Hair", "Alice band", "Hairband"]
  45. items = hlpeda.filter_types(filters, unique_product_type)
  46. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  47. df_article.loc[rowindex, colindex] = "Hair Accessories"
  48. # 将所有“Sandals”,“Heeled sandals”,“Wedge”,“Pumps”,“Slippers”产品类型合并为“Sandals”
  49. # Ballerinas是一种用于芭蕾舞演出的凉鞋类型
  50. filters = ["Sandals", "Heels", "Flip flop", "Wedge", "Pump", "Slipper", "Ballerinas"]
  51. items = hlpeda.filter_types(filters, unique_product_type)
  52. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  53. df_article.loc[rowindex, colindex] = "Sandals"
  54. # 将所有“Toy”,“Soft Toys”产品类型合并为“Toys”
  55. filters = ["Toy"]
  56. items = hlpeda.filter_types(filters, unique_product_type)
  57. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  58. df_article.loc[rowindex, colindex] = "Toy"
  59. # 将所有“Flat Shoe”,“Other Shoe”,“Boots”,“Sneakers”产品类型合并为“Shoe”
  60. filters = ["shoe", "Boot", "Leg warmer", "Sneakers"]
  61. items = hlpeda.filter_types(filters, unique_product_type)
  62. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  63. df_article.loc[rowindex, colindex] = "Shoe"
  64. # 将所有“Dog wear”产品类型合并为“Dog wear”
  65. filters = ["Dog wear"]
  66. items = hlpeda.filter_types(filters, unique_product_type)
  67. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  68. df_article.loc[rowindex, colindex] = "Dog wear"
  69. # 将所有“Sunglasses”,“EyeGlasses”产品类型合并为“Glasses”
  70. filters = ["glasses"]
  71. items = hlpeda.filter_types(filters, unique_product_type)
  72. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  73. df_article.loc[rowindex, colindex] = "Glasses"
  74. # 将所有“Cosmetics”,“Chem. Cosmetics”产品类型合并为“Cosmetics”
  75. filters = ["Cosmetics"]
  76. items = hlpeda.filter_types(filters, unique_product_type)
  77. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  78. df_article.loc[rowindex, colindex] = "Cosmetics"
  79. # 将所有“earphone case”,“case”产品类型合并为“Mobile Accesories”
  80. filters = ["earphone case", "case"]
  81. items = hlpeda.filter_types(filters, unique_product_type)
  82. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  83. df_article.loc[rowindex, colindex] = "Mobile Accesories"
  84. # 将所有“Bra extender”产品类型合并为“Bra”
  85. filters = ["Bra extender"]
  86. items = hlpeda.filter_types(filters, unique_product_type)
  87. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  88. df_article.loc[rowindex, colindex] = "Bra"
  89. # 将所有“Nipple covers”产品类型(不属于Nightwear部门名称)合并为“Bar”
  90. filters = ["Nipple covers"]
  91. rowindex = df_article[(df_article.department_name != 'Nightwear') &
  92. (df_article.product_type_name.isin(filters))].index.tolist()
  93. df_article.loc[rowindex, colindex] = "Bra"
  94. # 将所有“Tailored Waistcoat”,“Outdoor Waistcoat”产品类型合并为“Waistcoat”
  95. filters = ["Waistcoat"]
  96. items = hlpeda.filter_types(filters, unique_product_type)
  97. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  98. df_article.loc[rowindex, colindex] = "Waistcoat"
  99. # 将“Outdoor trousers”产品类型合并为“Trousers”
  100. filters = ["Outdoor trousers"]
  101. items = hlpeda.filter_types(filters, unique_product_type)
  102. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  103. df_article.loc[rowindex, colindex] = "Trousers"
  104. # UnderDress是与某些裙子一起穿的。
  105. # 因此,让我们将它们与Dress部分分组,因为如果用户以前购买了裙子,他们可能需要搭配着穿的UnderDress
  106. filters = ["Underdress"]
  107. items = hlpeda.filter_types(filters, unique_product_type)
  108. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  109. df_article.loc[rowindex, colindex] = "Dress"
  110. # 将“Towel”,“Blanket”,“Cushion”,“Waterbottle”,“Robe”产品类型合并为“House Accesories”
  111. filters = ["Towel", "Blanket", "Cushion", "Waterbottle", "Robe"]
  112. items = hlpeda.filter_types(filters, unique_product_type)
  113. rowindex = df_article[df_article.product_type_name.isin(items)].index.tolist()
  114. df_article.loc[rowindex, colindex] = "House Accesories"
  115. print(f'Number of new unique product type after: {df_article.product_type_name.nunique()}')

执行后会输出显示数据框中的新唯一产品类型的数量:

  1. Number of unique product type before: 130
  2. Number of new unique product type after: 66

17)绘制一个水平条形图,显示每个产品类型的数量统计。对应的实现代码如下所示:

  1. class_label_cnt = df_article.product_type_name.value_counts()
  2. fig = plt.figure(figsize = (10,20))
  3. class_label_cnt.plot.barh()

执行效果如图12-9所示。

图12-9  产品类型数量统计条形图

注意:探索性数据分析和特征工程

在数据清洗之后,你可以进行更深入的探索性数据分析(EDA)和特征工程。EDA可以帮助你理解数据集中的关系、趋势和分布,为后续的建模和预测准备好数据。特征工程则涉及从原始数据中提取、转换和创建新特征,以便用于机器学习模型的训练。编写文件eda_feature_eng.ipynb实现探索性数据分析和特征工程功能,为节省本书篇幅,将不再讲解这个文件的具体实现过程。

未完待续

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/喵喵爱编程/article/detail/789302
推荐阅读
相关标签
  

闽ICP备14008679号