如何在Kaggle内核上使用BIG数据集（16G RAM_kaggle加载几十g数据集

作者：很楠不爱3 | 2024-03-20 02:22:13

踩

kaggle加载几十g数据集

https://www.kaggle.com/yuliagm/how-to-work-with-big-datasets-on-16g-ram-dask

这个特殊的竞争要求我们查看和分析一些非常大的数据集。在它的给定形式中，它甚至不会加载到kaggle内核上的pandas中。如果您没有非常花哨的计算机，它可能也不会加载到您的计算机上。至少在我的笔记本电脑（macbook，16G RAM）上它不会。

我假设我不是唯一一个计算能力有限，运行付费云设置的预算有限的人。所以我一直在寻找不同的方法来处理有限资源上的大数据。

下面是我学习大部分Kaggle内核时所收集的一些技巧，同时没有超载其分配的RAM。

这些提示可能对于初学者和中级用户更有用，但如果您是专家并且知道更好的方法，请分享，我将很乐意更新本书！

大纲
（我打算在某个时候将这个链接起来......）

提示1 - 删除未使用的变量和gc.collect（）
提示2 - 预设数据类型
提示3 - 导入文件的选定行（包括生成您自己的子样本）
提示4 - 分批导入并单独处理每个
提示5 - 仅导入选定的列
提示6 - 创意数据处理
提示7 - 使用Dask


import numpy as np 
import pandas as pd 
import datetime
import os
import time
import matplotlib.pyplot as plt
import seaborn as sns
import gc
%matplotlib inline


#make wider graphs
sns.set(rc={'figure.figsize':(12,5)});
plt.figure(figsize=(12,5));

提示＃1删除未使用的变量和gc.collect（）
关于python的问题是，一旦它将某些内容加载到RAM中，它就无法真正有效地摆脱它。因此，如果您将大型数据框加载到pandas中，然后复制它并再也不使用它，那么原始数据帧仍将存在于您的RAM中。吞噬你的记忆。您创建的任何其他变量也是如此。

因此，如果您用完了数据框（或其他变量），请养成删除它的习惯。

例如，如果您创建一个数据帧临时值，提取一些功能并将结果合并到您的主训练集，则temp仍将占用空间。您需要通过声明del temp来明确删除它。您还需要确保没有其他内容指的是temp（您没有绑定任何其他变量）。

即便这样做，仍可能有剩余的内存使用量。

这就是垃圾收集模块的用武之地。在项目开始时导入gc，然后每次要清空space put命令gc.collect（）。

它还有助于在多次转换/函数/复制等之后运行gc.collect（）...因为所有的小引用/值都会累积。


#import some file
temp = pd.read_csv('../input/train_sample.csv')
 
#do something to the file
temp['os'] = temp['os'].astype('str')


#delete when no longer needed
del temp
#collect residual garbage
gc.collect()

提示＃2预设数据类型¶
如果你将数据导入CSV，python最好猜测数据类型，但是在分配比所需空间更多的空间时会出错。因此，如果您事先知道您的数字是整数，并且不会超过某些值，请在导入之前将数据类型设置为最低要求。


train = pd.read_csv('../input/train_sample.csv', dtype=dtypes)
 
#check datatypes:
train.info()

3 导入 csv file选定行

a) Select number of rows to import

Instead of the default pd.read_csv('filename') you can use parameter nrows to specify number of rows to import. For exampe: train = pd.read_csv('../input/train.csv', nrows=10000) will only read the first 10000 rows (including the heading)..

train = pd.read_csv('../input/train.csv', nrows=10000, dtype=dtypes)

train.head()

b) Simple row skip (with or without headings)跳过指定行或头

You can also specify number of rows to skip (skiprows) , if you, for example want 1 million rows after the first 5 million: train = pd.read_csv('../input/train.csv', skiprows=5000000, nrows=1000000). This however will ignore the first line with headers. Instead you can pass in range of rows to skip, that will not include the first row (indexed [0]).


#but if you want to import the headings from the original file
#skip first 5mil rows, but use the first row for heading:
train = pd.read_csv('../input/train.csv', skiprows=range(1, 5000000), nrows=1000000, dtype=dtypes)
train.head()


#plain skipping looses heading info.  It's OK for files that don't have headings, 
#or dataframes you'll be linking together, or where you make your own custom headings...
train = pd.read_csv('../input/train.csv', skiprows=5000000, nrows=1000000, header = None, dtype=dtypes)
train.head()

＃4批量导入并单独处理每个
我们知道归因的点击比例非常低。所以我们要说我们想同时看一下所有这些。我们不知道它们是哪些行，我们无法加载整个数据和过滤器。但是我们可以加载chuncks，从每个chunk中提取我们需要的东西并摆脱其他一切！

这个想法很简单。您可以指定希望一次导入pandas的块大小（行数）。然后你对它进行某种处理。然后pandas导入下一个块，直到没有剩余的行。

因此，在下面我导入一百万行，仅提取具有'is_attributed'== 1的行（即已下载应用程序），然后将这些结果合并到通用数据框中以供进一步检查。


#set up an empty dataframe
df_converted = pd.DataFrame()
 
#we are going to work with chunks of size 1 million rows
chunksize = 10 ** 6
 
#in each chunk, filter for values that have 'is_attributed'==1, and merge these values into one dataframe
for chunk in pd.read_csv('../input/train.csv', chunksize=chunksize, dtype=dtypes):
    filtered = (chunk[(np.where(chunk['is_attributed']==1, True, False))])
    df_converted = pd.concat([df_converted, filtered], ignore_index=True, )

提示＃5仅导入选定的列
如果只想分析某些特定功能，则只能导入所选列。

例如，假设我们想要通过ips分析点击次数。或者是ips的转换。

只输入2个字段而不是全表可能适合我们的RAM


#wanted columns
columns = ['ip', 'click_time', 'is_attributed']
dtypes = {
        'ip'            : 'uint32',
        'is_attributed' : 'uint8',
        }
 
ips_df = pd.read_csv('../input/train.csv', usecols=columns, dtype=dtypes)

提示＃6创意数据处理¶
内核无法在整个数据帧上处理groupby。但它可以分段完成。例如：


#processing part of the table is not a problem
ips_df[0:100][['ip', 'is_attributed']].groupby('ip', as_index=False).count()[:10]


size=100000
all_rows = len(ips_df)
num_parts = all_rows//size
 
#generate the first batch
ip_sums = ips_df[0:size][['ip', 'is_attributed']].groupby('ip', as_index=False).sum()
 
#add remaining batches
for p in range(1,num_parts):
    start = p*size
    end = p*size + size
    if end < all_rows:
        group = ips_df[start:end][['ip', 'is_attributed']].groupby('ip', as_index=False).sum()
    else:
        group = ips_df[start:][['ip', 'is_attributed']].groupby('ip', as_index=False).sum()
    ip_sums = ip_sums.merge(group, on='ip', how='outer')
    ip_sums.columns = ['ip', 'sum1','sum2']
    ip_sums['conversions_per_ip'] = np.nansum((ip_sums['sum1'], ip_sums['sum2']), axis = 0)
    ip_sums.drop(columns=['sum1', 'sum2'], axis = 0, inplace=True)

ip_sums.head(10)

提示＃7使用Dask


import dask
import dask.dataframe as dd

There are different sections to Dask, but for this case you'll likely just use Dask DataFrames.

Here are some basics from the developers:

A Dask DataFrame is a large parallel dataframe composed of many smaller Pandas dataframes, split along the index. These pandas dataframes may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask dataframe operation triggers many operations on the constituent Pandas dataframes.

(https://dask.pydata.org/en/latest/dataframe.html)

For convenience and Dask.dataframe copies the Pandas API. Thus commands look and feel familiar.

What DaskDataframes can do? -they are very fast on most commonly used set of Pandas API

below is taken directly from: https://dask.pydata.org/en/latest/dataframe.html

Trivially parallelizable operations (fast):

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/很楠不爱3/article/detail/269953