当前位置:   article > 正文

pandas 笔记

pandas 笔记
# Pandas中的数据对象 ## series对象 Series是pandas中最基本的数据对象,它定义了Numpy的ndarray对象的接口__array__(),因此可以用numpy的数组处理函数直接对series对象进行处理,**series对象除了支持使用位置作为下标存取元素之外,还可以使用索引标签作为下标存取元素。每个series对象实际都是由两个数组组成:** - index:它是从ndarray数组继承的index索引对象,保存标签信息,若创建series对象时不指定index,将自动创建一个表示位置下标的索引 - values:保存元素值的ndarray数组,numpy的函数都对此数组进行处理
import pandas as pd
s=pd.Series([1,2,3,4],index=["a",'b','c','d'])
  • 1
  • 2
s.index
  • 1
Index([‘a’, ‘b’, ‘c’, ‘d’], dtype=’object’)
s.values
  • 1
array([1, 2, 3, 4], dtype=int64)
s[1]
  • 1
2
s['a']
  • 1
1
s[[1,2,3]]
  • 1
b 2 c 3 d 4 dtype: int64 **Series对象同时还具有数组和字典的功能,因此它也支持字典的一些方法**
list(s.iteritems())
  • 1
[(‘a’, 1), (‘b’, 2), (‘c’, 3), (‘d’, 4)] 当两个series对象进行操作符运算时,pandas会按照标签对齐元素,也就是说运算操作符会对标签相同的两个元素进行计算
b=pd.Series([1,3,5,7],index=["a",'b','c','d'])
b-s
  • 1
  • 2
a 0 b 1 c 2 d 3 dtype: int64 ## Dataframe对象 ### Dataframe的各个组成元素 ![](http://ww1.sinaimg.cn/large/699db930ly1fs0gfr21vzj21bp0u00wf.jpg)
s=pd.read_csv('data.txt',parse_dates=[0],header=None)
  • 1
s.dtypes
  • 1
0 datetime64[ns] 1 float64 dtype: object
s.shape
  • 1
(14, 2)
s.columns
  • 1
Int64Index([0, 1], dtype=’int64’)
s[0]
  • 1
0 2016-01-01 1 2015-01-01 2 2014-01-01 3 2013-01-01 4 2012-01-01 5 2011-01-01 6 2009-01-01 7 2008-01-01 8 2007-01-01 9 2006-01-01 10 2005-01-01 11 2004-01-01 12 2003-01-01 13 2002-01-01 Name: 0, dtype: datetime64[ns]
s.loc[1:3]
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
01
12015-01-0113.5
22014-01-0110.5
32013-01-0110.5

[]运算符可以通过列索引标签或者列序号获取指定的列,loc[]可以通过行索引标签或者行号获取指定的行

### 将内存中的数据转换为Dataframe对象 调用Datafrmae()可以将多种格式的数据转换为Dataframe对象,它的三个参数data,index和column分别为数据,行索引,列索引。data参数可以是 - 二维数组或者能转换为二维数组的嵌套列表 - 字典:字典中的每对“键-值”将成为Dataframe对象的列。值可以是一维数组,列表或者series对象
import numpy as np
pd.DataFrame(np.random.randint(1,10,(4,2)),index=['a','b','c','d'],columns=['键','value'])
  • 1
  • 2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
value
a35
b79
c15
d35

此外还可以调用from_开头的类方法,将特定格式的数据转换为dataframe对象。from_dict()将字典转换为dataframe对象,其中orient参数可以指定字典键对应的方向。
- 默认值是“columns”,表示将字典的键转换为列索引,即字典中的每个值与每一列相对应
- 当参数为‘index’时,字典中的每个值与一行对应
- 当字典为嵌套字典时,另外一个轴的索引值由第二个字典中的键所决定

dict1={'a':[1,2,3],'b':[4,5,6]}
dict2={'a':{'l':1,'o':2,'v':3},'b':{'y':4,'o':5,'u':4}}
  • 1
  • 2
pd.DataFrame.from_dict(dict1,orient='index')
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
012
a123
b456
pd.DataFrame.from_dict(dict1,orient='columns')
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ab
014
125
236
pd.DataFrame.from_dict(dict2,orient='index')
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
lovyu
a1.023.0NaNNaN
bNaN5NaN4.04.0
pd.DataFrame.from_dict(dict2,orient='columns')
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ab
l1.0NaN
o2.05.0
uNaN4.0
v3.0NaN
yNaN4.0

from_items()将‘键值’序列转换为dataframe对象,之中‘键’表示一维数据的列表,数组或者series对象,orient参数与from_dict一致

items=dict1.items()
pd.DataFrame.from_items(items,orient='columns')
  • 1
  • 2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ab
014
125
236
pd.DataFrame.from_items(items,orient="index",columns=['1','2','3'])#当Orient参数为index时,必须指定columns列索引
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
123
a123
b456
### 将Dataframe转换为其他格式的数据 to_dict()方法将dataframe对象转换为字典,它的orient参数决定了字典元素的类型
df=pd.DataFrame.from_dict(dict2,orient='index')
df.to_dict(orient='records')#将每一行转换为一个字典,形成一个字典列表
  • 1
  • 2
[{‘l’: 1.0, ‘o’: 2.0, ‘v’: 3.0, ‘y’: nan, ‘u’: nan}, {‘l’: nan, ‘o’: 5.0, ‘v’: nan, ‘y’: 4.0, ‘u’: 4.0}]
df.to_dict(orient='list')#列表字典,将一列转换为字典,用列索引来对应键值
  • 1
{‘l’: [1.0, nan], ‘o’: [2, 5], ‘v’: [3.0, nan], ‘y’: [nan, 4.0], ‘u’: [nan, 4.0]}
df.to_dict(orient='dict')#嵌套字典,外部字典键值用列索引,内部字典键值用行索引
  • 1
{‘l’: {‘a’: 1.0, ‘b’: nan}, ‘o’: {‘a’: 2, ‘b’: 5}, ‘v’: {‘a’: 3.0, ‘b’: nan}, ‘y’: {‘a’: nan, ‘b’: 4.0}, ‘u’: {‘a’: nan, ‘b’: 4.0}} to_records()参数可以将dataframe对象转换为结构数组,若index参数为TRUE,则返回数组中包含行索引的数据:
df.to_records()
  • 1
rec.array([(‘a’, 1., 2, 3., nan, nan), (‘b’, nan, 5, nan, 4., 4.)], dtype=[(‘index’, ‘O’), (‘l’, ‘
df.to_records(index=False)
  • 1
rec.array([( 1., 2, 3., nan, nan), (nan, 5, nan, 4., 4.)], dtype=[(‘l’, ‘ ## index对象 index对象保存索引标签数据,他可以快速的找到标签对应的整数下标,这种将标签映射到整数下标的功能与python的字段类似,其values属性可以获得保存标签的数组,与series一样,字符串使用object类型的数组保存。
index=df.columns
index.values
  • 1
  • 2
array([‘l’, ‘o’, ‘v’, ‘y’, ‘u’], dtype=object) **index对象可以当做是一维数组,通过与numpy数组相同的下标操作可以获得一个新的index对象,但是index对象是只读的,因此一旦创建将无法修改**
index[index>'o']
  • 1
Index([‘v’, ‘y’, ‘u’], dtype=’object’)
index[[1,3]]
  • 1
Index([‘o’, ‘y’], dtype=’object’)
index[1::2]
  • 1
Index([‘o’, ‘y’], dtype=’object’) **index对象也具有字典映射功能,她将数组的值映射到在其他的位置** - index.get_loc(value):获取单个值value的下标 - index_get_indexer(values):获取一组值values的下标,当值不存在时,得到-1
index.get_loc('v')
  • 1
2
index.get_indexer(['v','o'])
  • 1
array([2, 1], dtype=int64) ## Mulitindex对象 Multiindex表示多级索引,它从index继承,其中的多级标签采用元组对象来表示。在Muliindex内部并不直接保存元组对象,而是使用多个index对象保存索引中每级的标签
mindex=pd.Index([('a','x'),('a','y'),('b','x'),('b','y')])
  • 1
mindex.levels[0]
  • 1
Index([‘a’, ‘b’], dtype=’object’)
mindex.levshape
  • 1
(2, 2) 当一个元组列表传递给Index()时,将自动创建multiindex对象,希望创建元素类型为元组的index对象时,可以设置tupleize_cols参数为false。**此外还可以使用以from_开头的方法从特定的数据结构创建multindex对象**
class1=['a','b','c','d']
class2=['1','2','3','4']
pd.MultiIndex.from_arrays([class1,class2])
  • 1
  • 2
  • 3
MultiIndex(levels=[[‘a’, ‘b’, ‘c’, ‘d’], [‘1’, ‘2’, ‘3’, ‘4’]], labels=[[0, 1, 2, 3], [0, 1, 2, 3]]) ## 常用的函数参数 ![](http://ww1.sinaimg.cn/large/699db930ly1fs2f5yifmsj23281qye82.jpg) # 下标存取 ![](http://ww1.sinaimg.cn/large/699db930ly1fs2f8ib6rsj23282aohdu.jpg) ## []参数 通过[]操作符对dataframe对象进行存取时,支持一下五种下标对象: - 单个索引标签,获取标签对应的列,返回一个series对象 - 多个索引标签,获取以列表、数组表示的多个标签对应的列,返回一个dataframe对象 - 整数切片:以整数下标获取切片对应的行 - 标签切片:当使用标签作为切片时包含终值 - 布尔数组,获取数组中为TRUE对应的行 - 布尔dataframe:将dataframe对象中false对应的元素设置为NaN
df=pd.DataFrame(np.random.randint(0,10,(4,3)),index=['a','b','c','d'],columns=['1','2','3'])
  • 1
df[2:4]#取行
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
123
c587
d800
df['1']
  • 1
a 2 b 8 c 5 d 8 Name: 1, dtype: int32
df[['1','2']]
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
12
a22
b80
c58
d80
df[df['1']>2]#b布尔数组
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
123
b806
c587
d800
df[df>2]#会将不符合条件的数直接设置为NaN值
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
123
aNaNNaNNaN
b8.0NaN6.0
c5.08.07.0
d8.0NaNNaN
## .loc[]和.iloc[]存取器 .loc[]的下标对象是一个元组,其中的两个元素分别于dataframe的两个轴相对应。若下标不是元组,则改下标对应0轴,:对应1轴。每个周的下标对象都支持单个标签,标签列表、标签切片以及布尔数组
df.loc['a']
  • 1
1 2 2 2 3 0 Name: a, dtype: int32
df.loc['a','1']
  • 1
2 .iloc[]和loc[]类似,不过它使用整数下标
df.iloc[0]#取行
  • 1
1 2 2 2 3 0 Name: a, dtype: int32
df.iloc[[1,3]]
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
123
b806
d800
df.iloc[[1,3],2]
  • 1
b 6 d 0 Name: 3, dtype: int32
df.iloc[:,1]
  • 1
a 2 b 0 c 8 d 0 Name: 2, dtype: int32 此外.ix[]可以混用标签和位置下标
df.ix[1:3,'1']
  • 1
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated “”“Entry point for launching an IPython kernel. b 8 c 5 Name: 1, dtype: int32 ## 获取单个值 .at[]和.iat[]分别使用标签和整数下标获取单个值,次额外get_value()和.ax[]类似,不过执行的速度要快些
df.at['a','1']
  • 1
2
df.iat[1,2]
  • 1
6 ## 多级标签的存取 .loc[]和.at[]的下标可以指定多级索引中每级索引上的标签。这时候多级索引轴对应的下标是一个下标元组,该元组的每个元素与索引中的每级索引对应。若下标不是元组,则将其转换为长度为1的元组。若元组的长度比索引的层数少,则在其后补slice(None)
df.loc[['a','b'],['1','3']]
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
13
a20
b86
## query()方法 当需要根据一定的条件对行进行过滤时,通常可以先创建一个布尔数组,使用该数组获取TRUE所对应的行
df.query("'1'>2")
  • 1
————————————————————————— TypeError Traceback (most recent call last) in () —-> 1 df.query(“‘1’>2”) ~\Anaconda3\lib\site-packages\pandas\core\frame.py in query(self, expr, inplace, **kwargs) 2114 kwargs[‘level’] = kwargs.pop(‘level’, 0) + 1 2115 kwargs[‘target’] = None -> 2116 res = self.eval(expr, **kwargs) 2117 2118 try: ~\Anaconda3\lib\site-packages\pandas\core\frame.py in eval(self, expr, inplace, **kwargs) 2184 kwargs[‘target’] = self 2185 kwargs[‘resolvers’] = kwargs.get(‘resolvers’, ()) + tuple(resolvers) -> 2186 return _eval(expr, inplace=inplace, **kwargs) 2187 2188 def select_dtypes(self, include=None, exclude=None): ~\Anaconda3\lib\site-packages\pandas\core\computation\eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace) 265 eng = _engines[engine] 266 eng_inst = eng(parsed_expr) –> 267 ret = eng_inst.evaluate() 268 269 if parsed_expr.assigner is None and multi_line: ~\Anaconda3\lib\site-packages\pandas\core\computation\engines.py in evaluate(self) 70 “”” 71 if not self._is_aligned: —> 72 self.result_type, self.aligned_axes = _align(self.expr.terms) 73 74 # make sure no names in resolvers and locals/globals clash ~\Anaconda3\lib\site-packages\pandas\core\computation\align.py in _align(terms) 133 # if all resolved variables are numeric scalars 134 if all(term.isscalar for term in terms): –> 135 return _result_type_many(*(term.value for term in terms)).type, None 136 137 # perform the main alignment ~\Anaconda3\lib\site-packages\pandas\core\computation\common.py in _result_type_many(*arrays_and_dtypes) 15 argument limit “”” 16 try: —> 17 return np.result_type(*arrays_and_dtypes) 18 except ValueError: 19 # we have > NPY_MAXARGS terms in our expression TypeError: data type “” not understood # 文件的输入输出 ![](http://ww1.sinaimg.cn/large/699db930ly1fs2llr0kmrj232811gb29.jpg) ## CSV文件 resd_csv()从文本文件中读取数据,它的可选参数非常的多,下面介绍一些常用参数: - sep参数,指定数据的分隔符号,可以使用正则表达式,默认值为逗号,有时候很CSV文件为了便于阅读,在侯浩之后添加了一些空格以对齐每列的数据。如果希望忽略这些空格,可以将skipinitialspace参数设置为TRUE - 如果数据使用空格或者制表符分割,可以不设置sep参数,而将delim_whitespace参数设置为TRUE - 默认情况下第一行文本被作为列索引标签,如果数据文件中没有保存列名的行,可以设置header参数为0 - 如果数据文件之前包含一些说明行,可以使用skiprows参数指定数据开始的行号。 - na_values,true_values和false_values等参数指定NaN、True和False对应的字符串列表 - 如果希望从字符串中读取数据,可以使用io.BytesIO(string)将字符串包装成输入流 - **如果希望将字符创转换为时间,可以使用parse_dates指定转换为时间的列** - **如果数据文件包含中文,可以使用encoding参数指定文件的编码** - **可以使用usecols参数指定需要读入的列** - 当文件很大时,可以使用chunksize参数指定一次读入的行数,当使用chunksize时,read_csv()返回一个迭代器 - 当文件名包含中文时,需要使用Unicode字符串指定文件名
import pandas as pd
df=pd.read_csv('data.txt',encoding='utf-8',parse_dates=[0],na_values=['-',' '],header=0)
  • 1
  • 2
df.columns=['time','ratio']
  • 1
df
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
timeratio
02015-01-0113.5
12014-01-0110.5
22013-01-0110.5
32012-01-0110.5
42011-01-0110.7
52009-01-0112.6
62008-01-0112.9
72007-01-0112.7
82006-01-0114.3
92005-01-0113.7
102004-01-0114.1
112003-01-0114.3
122002-01-0113.9
## 读写数据库 用to_sql()可以将数据写入sql数据库,它的第一个参数为数据库的表名,第二个参数为表述与数据库连接的Engine对象,engine在sqlalchemy库中定义 **pandas通过pymysql直接读取mysql库中的表,并写出到excel(会覆盖原有的excel!): 查询中含有中文时需加入use_unicode=True, charset=”utf8”。** **sqlalchemy链接mysql有多个办法** - 选择 1 **engine = create_engine(‘mysql+pymysql://用户名:密码@127.0.0.1:端口号3306/数据库名?charset=utf8’)** - 选择2 **engine = create_engine(‘mysql+mysqldb://用户名:密码@127.0.0.1:端口号3306/数据库名?charset=utf8’)** **df.to_sql(目标表名,con=engine, schema=数据库名, index=False, index_label=False, if_exists=’append’, chunksize=1000)**
import tushare as ts  
import pandas as pd  
from sqlalchemy import create_engine  
df = ts.get_hist_data('000875')#读取数据,格式为DataFrame  
engine = create_engine('mysql://root:luozong@localhost/luozong?charset=utf8')#用sqlalchemy创建引擎  
df.to_sql('tick_data',engine,if_exists='append')#存入数据库,这句有时候运行一次报错,运行第二次就不报错了,不知道为什么  
df1 = pd.read_sql('tick_data',engine)#从数据库中读取表存为DataFrame  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
C:\Users\Administrator\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py:470: Warning: (1366, “Incorrect string value: ‘\\xD6\\xD0\\xB9\\xFA\\xB1\\xEA…’ for column ‘VARIABLE_VALUE’ at row 497”) cursor.execute(statement, parameters) C:\Users\Administrator\Anaconda3\lib\site-packages\sqlalchemy\dialects\mysql\base.py:1569: Warning: (1287, “’@@tx_isolation’ is deprecated and will be removed in a future release. Please use ‘@@transaction_isolation’ instead”) cursor.execute(‘SELECT @@tx_isolation’) **使用pymsql来连接mysql数据库并读取数据**
import pandas as pd  
import pymysql  

a=pymysql.connect(host='localhost',port=3306,user='root',passwd='luozong',db='world',use_unicode=True, charset="utf8")  

sql='select %s from %s where CountryCode=%s '%('*','city','"CHN"')  

d=pd.read_sql(sql,con=a)  

a.close()  

with pd.ExcelWriter(r'C:\Users\Administrator\Desktop\1.xls') as writer:  
    d.to_excel(writer)  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
import matplotlib.pyplot as pl
df1=df1[600:]
  • 1
  • 2
y1=df1['high']
  • 1
y2=df1['close']
y3=df1['low']
pl.figure(figsize=(16,9))
pl.plot(df1['date'],y1,label='high price')
pl.plot(df1['date'],y2,label='close price',color='red')
pl.plot(df1['date'],y3,label='low price')
pl.legend()
pl.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

# 数值运算函数 series和dataframe对象都支持numpy的数据接口,因此可以直接使用numpy提供的ufunc函数对它进行计算,此外他们还提供各种运算方法,例如max(),min(),mean(),std()等,这些函数都有如下三个常用参数: - axis:计算对应的轴 - level:指定运算对应的索引级别 - skipna:是否跳过NaN值 除了支持加减乘除等运算符之外,pandas还提供了add(),sub(),mul(),div(),mod()等与二元运算符相对应的函数。这些函数可以通过axis,level和fill_value等参数控制其行为。fill_value参数表示是对于不存在的值或者NaN值使用fill_value设置的默认值
df2=df1[:50]
df2.mean()
  • 1
  • 2
open 9.76140 high 10.02300 close 9.79260 low 9.55260 volume 871823.83340 price_change 0.01260 p_change 0.18680 ma5 9.75968 ma10 9.73068 ma20 9.65476 v_ma5 871689.65920 v_ma10 885206.66500 v_ma20 908107.45520 dtype: float64
df2.max()
  • 1
date 2015-12-17 open 10.71 high 10.95 close 10.71 low 10.28 volume 2.06881e+06 price_change 0.96 p_change 9.99 ma5 10.398 ma10 10.156 ma20 9.936 v_ma5 1.62152e+06 v_ma10 1.34618e+06 v_ma20 1.06599e+06 dtype: object pandas还提供了rolling_*()函数来对序列中相邻的N个元素进行移动窗口运算
import sys
import numpy as np
t=np.linspace(0,10,40)
x=np.sin(1*np.pi*t)
x[np.random.randint(0,len(t),40)]+= np.random.normal(0,0.3,40)
s=pd.Series(x,index=t)
s_mean=pd.rolling_mean(s,5,center=True)
s_median=pd.rolling_median(s,5,center=True)
pl.plot(t,x,t,s_mean,t,s_median)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
    Series.rolling(window=5,center=True).mean()
  import sys
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: pd.rolling_median is deprecated for Series and will be removed in a future version, replace with 
    Series.rolling(window=5,center=True).median()






[<matplotlib.lines.Line2D at 0x2d212b26828>,
 <matplotlib.lines.Line2D at 0x2d212b26cc0>,
 <matplotlib.lines.Line2D at 0x2d212b26e10>]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

expanding_*()函数对序列进行扩展窗口运算,例如expanding_max()返回每个元素为止的历史最大值

x=np.cumsum(np.random.randn(400))
x_max=pd.expanding_max(x)
x_min=pd.expanding_min(x)
x_mean=pd.expanding_mean(x)
pl.plot(x)
pl.plot(x_min)
pl.plot(x_max)
pl.plot(x_mean)
pl.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: pd.expanding_max is deprecated for ndarrays and will be removed in a future version

C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: pd.expanding_min is deprecated for ndarrays and will be removed in a future version
  This is separate from the ipykernel package so we can avoid doing imports until
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: pd.expanding_mean is deprecated for ndarrays and will be removed in a future version
  after removing the cwd from sys.path.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

x
  • 1
array([210], dtype=int32) # 字符串处理 series对象还提供了大量的字符串处理函数,由于数量众多,因此pandas使用了一个类似名称空间的对象str来包装这些字符串相关的方法
df_abc=df2.columns
df_abc.str.upper()
  • 1
  • 2
Index([‘DATE’, ‘OPEN’, ‘HIGH’, ‘CLOSE’, ‘LOW’, ‘VOLUME’, ‘PRICE_CHANGE’, ‘P_CHANGE’, ‘MA5’, ‘MA10’, ‘MA20’, ‘V_MA5’, ‘V_MA10’, ‘V_MA20’], dtype=’object’) python中包含两种字符串:字节字符串和Unicode字符串。通过str.decode()可以将字节字符串按照指定的编码解码为Unicode字符串。
s_utf8=pd.Series(['北京','北京市','北京地区'])
s_unicode=s_utf8.str.decode('utf-8')
s_gb2312=s_utf8.str.encode('gb2312')
  • 1
  • 2
  • 3
s_utf8.str.len()
  • 1
0 2 1 3 2 4 dtype: int64 字符串序列跟字符串一样,支持加法和乘法运算
s_abc=pd.Series(['a','b','c'])
s_utf8+'-'+s_abc*2
  • 1
  • 2
0 北京-aa 1 北京市-bb 2 北京地区-cc dtype: object ![](http://ww1.sinaimg.cn/large/699db930ly1fs2wwzoi98j20u01400u9.jpg) ![](http://ww1.sinaimg.cn/large/699db930ly1fs2wydgyzhj20u0140766.jpg) ![](http://ww1.sinaimg.cn/large/699db930ly1fs2wyrbnstj20u0140gnb.jpg) ![](http://ww1.sinaimg.cn/large/699db930ly1fs2wyxrijkj21400u0dgm.jpg) # 时间序列 pandas提供了表示时间点、时间段和时间间隔等三种与时间有关的函数,以及元素为这些类型的索引对象,并提供了许多时间序列相关的函数 ## 时间点、时间段和时间间隔 Timestamp对象从Python标准库中的datetime类继承,表示时间轴上的一个时刻。它提供了方便的时区转换功能 ### 时间点
now=pd.Timestamp.now()
now_shanghai=now.tz_localize('Asia/Shanghai')
  • 1
  • 2
now_shanghai
  • 1
Timestamp(‘2018-06-08 08:36:33.781159+0800’, tz=’Asia/Shanghai’)
now_shanghai.tz_convert('Asia/Tokyo')
  • 1
Timestamp(‘2018-06-08 09:36:33.781159+0900’, tz=’Asia/Tokyo’) **通过模块pytz可的common_timezones()可以获得常用的表示时区的字符串**
import pytz
pytz.common_timezones
  • 1
  • 2
[‘Africa/Abidjan’, ‘Africa/Accra’, ‘Africa/Addis_Ababa’, ‘Africa/Algiers’, ‘Africa/Asmara’, ‘Africa/Bamako’, ‘Africa/Bangui’, ‘Africa/Banjul’, ‘Africa/Bissau’, ‘Africa/Blantyre’, ‘Africa/Brazzaville’, ‘Africa/Bujumbura’, ‘Africa/Cairo’, ‘Africa/Casablanca’, ‘Africa/Ceuta’, ‘Africa/Conakry’, ‘Africa/Dakar’, ‘Africa/Dar_es_Salaam’, ‘Africa/Djibouti’, ‘Africa/Douala’, ‘Africa/El_Aaiun’, ‘Africa/Freetown’, ‘Africa/Gaborone’, ‘Africa/Harare’, ‘Africa/Johannesburg’, ‘Africa/Juba’, ‘Africa/Kampala’, ‘Africa/Khartoum’, ‘Africa/Kigali’, ‘Africa/Kinshasa’, ‘Africa/Lagos’, ‘Africa/Libreville’, ‘Africa/Lome’, ‘Africa/Luanda’, ‘Africa/Lubumbashi’, ‘Africa/Lusaka’, ‘Africa/Malabo’, ‘Africa/Maputo’, ‘Africa/Maseru’, ‘Africa/Mbabane’, ‘Africa/Mogadishu’, ‘Africa/Monrovia’, ‘Africa/Nairobi’, ‘Africa/Ndjamena’, ‘Africa/Niamey’, ‘Africa/Nouakchott’, ‘Africa/Ouagadougou’, ‘Africa/Porto-Novo’, ‘Africa/Sao_Tome’, ‘Africa/Tripoli’, ‘Africa/Tunis’, ‘Africa/Windhoek’, ‘America/Adak’, ‘America/Anchorage’, ‘America/Anguilla’, ‘America/Antigua’, ‘America/Araguaina’, ‘America/Argentina/Buenos_Aires’, ‘America/Argentina/Catamarca’, ‘America/Argentina/Cordoba’, ‘America/Argentina/Jujuy’, ‘America/Argentina/La_Rioja’, ‘America/Argentina/Mendoza’, ‘America/Argentina/Rio_Gallegos’, ‘America/Argentina/Salta’, ‘America/Argentina/San_Juan’, ‘America/Argentina/San_Luis’, ‘America/Argentina/Tucuman’, ‘America/Argentina/Ushuaia’, ‘America/Aruba’, ‘America/Asuncion’, ‘America/Atikokan’, ‘America/Bahia’, ‘America/Bahia_Banderas’, ‘America/Barbados’, ‘America/Belem’, ‘America/Belize’, ‘America/Blanc-Sablon’, ‘America/Boa_Vista’, ‘America/Bogota’, ‘America/Boise’, ‘America/Cambridge_Bay’, ‘America/Campo_Grande’, ‘America/Cancun’, ‘America/Caracas’, ‘America/Cayenne’, ‘America/Cayman’, ‘America/Chicago’, ‘America/Chihuahua’, ‘America/Costa_Rica’, ‘America/Creston’, ‘America/Cuiaba’, ‘America/Curacao’, ‘America/Danmarkshavn’, ‘America/Dawson’, ‘America/Dawson_Creek’, ‘America/Denver’, ‘America/Detroit’, ‘America/Dominica’, ‘America/Edmonton’, ‘America/Eirunepe’, ‘America/El_Salvador’, ‘America/Fort_Nelson’, ‘America/Fortaleza’, ‘America/Glace_Bay’, ‘America/Godthab’, ‘America/Goose_Bay’, ‘America/Grand_Turk’, ‘America/Grenada’, ‘America/Guadeloupe’, ‘America/Guatemala’, ‘America/Guayaquil’, ‘America/Guyana’, ‘America/Halifax’, ‘America/Havana’, ‘America/Hermosillo’, ‘America/Indiana/Indianapolis’, ‘America/Indiana/Knox’, ‘America/Indiana/Marengo’, ‘America/Indiana/Petersburg’, ‘America/Indiana/Tell_City’, ‘America/Indiana/Vevay’, ‘America/Indiana/Vincennes’, ‘America/Indiana/Winamac’, ‘America/Inuvik’, ‘America/Iqaluit’, ‘America/Jamaica’, ‘America/Juneau’, ‘America/Kentucky/Louisville’, ‘America/Kentucky/Monticello’, ‘America/Kralendijk’, ‘America/La_Paz’, ‘America/Lima’, ‘America/Los_Angeles’, ‘America/Lower_Princes’, ‘America/Maceio’, ‘America/Managua’, ‘America/Manaus’, ‘America/Marigot’, ‘America/Martinique’, ‘America/Matamoros’, ‘America/Mazatlan’, ‘America/Menominee’, ‘America/Merida’, ‘America/Metlakatla’, ‘America/Mexico_City’, ‘America/Miquelon’, ‘America/Moncton’, ‘America/Monterrey’, ‘America/Montevideo’, ‘America/Montserrat’, ‘America/Nassau’, ‘America/New_York’, ‘America/Nipigon’, ‘America/Nome’, ‘America/Noronha’, ‘America/North_Dakota/Beulah’, ‘America/North_Dakota/Center’, ‘America/North_Dakota/New_Salem’, ‘America/Ojinaga’, ‘America/Panama’, ‘America/Pangnirtung’, ‘America/Paramaribo’, ‘America/Phoenix’, ‘America/Port-au-Prince’, ‘America/Port_of_Spain’, ‘America/Porto_Velho’, ‘America/Puerto_Rico’, ‘America/Punta_Arenas’, ‘America/Rainy_River’, ‘America/Rankin_Inlet’, ‘America/Recife’, ‘America/Regina’, ‘America/Resolute’, ‘America/Rio_Branco’, ‘America/Santarem’, ‘America/Santiago’, ‘America/Santo_Domingo’, ‘America/Sao_Paulo’, ‘America/Scoresbysund’, ‘America/Sitka’, ‘America/St_Barthelemy’, ‘America/St_Johns’, ‘America/St_Kitts’, ‘America/St_Lucia’, ‘America/St_Thomas’, ‘America/St_Vincent’, ‘America/Swift_Current’, ‘America/Tegucigalpa’, ‘America/Thule’, ‘America/Thunder_Bay’, ‘America/Tijuana’, ‘America/Toronto’, ‘America/Tortola’, ‘America/Vancouver’, ‘America/Whitehorse’, ‘America/Winnipeg’, ‘America/Yakutat’, ‘America/Yellowknife’, ‘Antarctica/Casey’, ‘Antarctica/Davis’, ‘Antarctica/DumontDUrville’, ‘Antarctica/Macquarie’, ‘Antarctica/Mawson’, ‘Antarctica/McMurdo’, ‘Antarctica/Palmer’, ‘Antarctica/Rothera’, ‘Antarctica/Syowa’, ‘Antarctica/Troll’, ‘Antarctica/Vostok’, ‘Arctic/Longyearbyen’, ‘Asia/Aden’, ‘Asia/Almaty’, ‘Asia/Amman’, ‘Asia/Anadyr’, ‘Asia/Aqtau’, ‘Asia/Aqtobe’, ‘Asia/Ashgabat’, ‘Asia/Atyrau’, ‘Asia/Baghdad’, ‘Asia/Bahrain’, ‘Asia/Baku’, ‘Asia/Bangkok’, ‘Asia/Barnaul’, ‘Asia/Beirut’, ‘Asia/Bishkek’, ‘Asia/Brunei’, ‘Asia/Chita’, ‘Asia/Choibalsan’, ‘Asia/Colombo’, ‘Asia/Damascus’, ‘Asia/Dhaka’, ‘Asia/Dili’, ‘Asia/Dubai’, ‘Asia/Dushanbe’, ‘Asia/Famagusta’, ‘Asia/Gaza’, ‘Asia/Hebron’, ‘Asia/Ho_Chi_Minh’, ‘Asia/Hong_Kong’, ‘Asia/Hovd’, ‘Asia/Irkutsk’, ‘Asia/Jakarta’, ‘Asia/Jayapura’, ‘Asia/Jerusalem’, ‘Asia/Kabul’, ‘Asia/Kamchatka’, ‘Asia/Karachi’, ‘Asia/Kathmandu’, ‘Asia/Khandyga’, ‘Asia/Kolkata’, ‘Asia/Krasnoyarsk’, ‘Asia/Kuala_Lumpur’, ‘Asia/Kuching’, ‘Asia/Kuwait’, ‘Asia/Macau’, ‘Asia/Magadan’, ‘Asia/Makassar’, ‘Asia/Manila’, ‘Asia/Muscat’, ‘Asia/Nicosia’, ‘Asia/Novokuznetsk’, ‘Asia/Novosibirsk’, ‘Asia/Omsk’, ‘Asia/Oral’, ‘Asia/Phnom_Penh’, ‘Asia/Pontianak’, ‘Asia/Pyongyang’, ‘Asia/Qatar’, ‘Asia/Qyzylorda’, ‘Asia/Riyadh’, ‘Asia/Sakhalin’, ‘Asia/Samarkand’, ‘Asia/Seoul’, ‘Asia/Shanghai’, ‘Asia/Singapore’, ‘Asia/Srednekolymsk’, ‘Asia/Taipei’, ‘Asia/Tashkent’, ‘Asia/Tbilisi’, ‘Asia/Tehran’, ‘Asia/Thimphu’, ‘Asia/Tokyo’, ‘Asia/Tomsk’, ‘Asia/Ulaanbaatar’, ‘Asia/Urumqi’, ‘Asia/Ust-Nera’, ‘Asia/Vientiane’, ‘Asia/Vladivostok’, ‘Asia/Yakutsk’, ‘Asia/Yangon’, ‘Asia/Yekaterinburg’, ‘Asia/Yerevan’, ‘Atlantic/Azores’, ‘Atlantic/Bermuda’, ‘Atlantic/Canary’, ‘Atlantic/Cape_Verde’, ‘Atlantic/Faroe’, ‘Atlantic/Madeira’, ‘Atlantic/Reykjavik’, ‘Atlantic/South_Georgia’, ‘Atlantic/St_Helena’, ‘Atlantic/Stanley’, ‘Australia/Adelaide’, ‘Australia/Brisbane’, ‘Australia/Broken_Hill’, ‘Australia/Currie’, ‘Australia/Darwin’, ‘Australia/Eucla’, ‘Australia/Hobart’, ‘Australia/Lindeman’, ‘Australia/Lord_Howe’, ‘Australia/Melbourne’, ‘Australia/Perth’, ‘Australia/Sydney’, ‘Canada/Atlantic’, ‘Canada/Central’, ‘Canada/Eastern’, ‘Canada/Mountain’, ‘Canada/Newfoundland’, ‘Canada/Pacific’, ‘Europe/Amsterdam’, ‘Europe/Andorra’, ‘Europe/Astrakhan’, ‘Europe/Athens’, ‘Europe/Belgrade’, ‘Europe/Berlin’, ‘Europe/Bratislava’, ‘Europe/Brussels’, ‘Europe/Bucharest’, ‘Europe/Budapest’, ‘Europe/Busingen’, ‘Europe/Chisinau’, ‘Europe/Copenhagen’, ‘Europe/Dublin’, ‘Europe/Gibraltar’, ‘Europe/Guernsey’, ‘Europe/Helsinki’, ‘Europe/Isle_of_Man’, ‘Europe/Istanbul’, ‘Europe/Jersey’, ‘Europe/Kaliningrad’, ‘Europe/Kiev’, ‘Europe/Kirov’, ‘Europe/Lisbon’, ‘Europe/Ljubljana’, ‘Europe/London’, ‘Europe/Luxembourg’, ‘Europe/Madrid’, ‘Europe/Malta’, ‘Europe/Mariehamn’, ‘Europe/Minsk’, ‘Europe/Monaco’, ‘Europe/Moscow’, ‘Europe/Oslo’, ‘Europe/Paris’, ‘Europe/Podgorica’, ‘Europe/Prague’, ‘Europe/Riga’, ‘Europe/Rome’, ‘Europe/Samara’, ‘Europe/San_Marino’, ‘Europe/Sarajevo’, ‘Europe/Saratov’, ‘Europe/Simferopol’, ‘Europe/Skopje’, ‘Europe/Sofia’, ‘Europe/Stockholm’, ‘Europe/Tallinn’, ‘Europe/Tirane’, ‘Europe/Ulyanovsk’, ‘Europe/Uzhgorod’, ‘Europe/Vaduz’, ‘Europe/Vatican’, ‘Europe/Vienna’, ‘Europe/Vilnius’, ‘Europe/Volgograd’, ‘Europe/Warsaw’, ‘Europe/Zagreb’, ‘Europe/Zaporozhye’, ‘Europe/Zurich’, ‘GMT’, ‘Indian/Antananarivo’, ‘Indian/Chagos’, ‘Indian/Christmas’, ‘Indian/Cocos’, ‘Indian/Comoro’, ‘Indian/Kerguelen’, ‘Indian/Mahe’, ‘Indian/Maldives’, ‘Indian/Mauritius’, ‘Indian/Mayotte’, ‘Indian/Reunion’, ‘Pacific/Apia’, ‘Pacific/Auckland’, ‘Pacific/Bougainville’, ‘Pacific/Chatham’, ‘Pacific/Chuuk’, ‘Pacific/Easter’, ‘Pacific/Efate’, ‘Pacific/Enderbury’, ‘Pacific/Fakaofo’, ‘Pacific/Fiji’, ‘Pacific/Funafuti’, ‘Pacific/Galapagos’, ‘Pacific/Gambier’, ‘Pacific/Guadalcanal’, ‘Pacific/Guam’, ‘Pacific/Honolulu’, ‘Pacific/Kiritimati’, ‘Pacific/Kosrae’, ‘Pacific/Kwajalein’, ‘Pacific/Majuro’, ‘Pacific/Marquesas’, ‘Pacific/Midway’, ‘Pacific/Nauru’, ‘Pacific/Niue’, ‘Pacific/Norfolk’, ‘Pacific/Noumea’, ‘Pacific/Pago_Pago’, ‘Pacific/Palau’, ‘Pacific/Pitcairn’, ‘Pacific/Pohnpei’, ‘Pacific/Port_Moresby’, ‘Pacific/Rarotonga’, ‘Pacific/Saipan’, ‘Pacific/Tahiti’, ‘Pacific/Tarawa’, ‘Pacific/Tongatapu’, ‘Pacific/Wake’, ‘Pacific/Wallis’, ‘US/Alaska’, ‘US/Arizona’, ‘US/Central’, ‘US/Eastern’, ‘US/Hawaii’, ‘US/Mountain’, ‘US/Pacific’, ‘UTC’] ### 时间段 Period对象表示一个标准的时间段,例如某年,某月,某日,某小时等等。时间段的长短由freq属性决定
pd.Period.now(freq='D')
  • 1
Period(‘2018-06-08’, ‘D’)
pd.Period.now(freq='H')
  • 1
Period(‘2018-06-08 08:00’, ‘H’) freq属性是一个描述时间段的字符串,其可选值可以通过下面的代码获得:
from pandas.tseries import frequencies
frequencies._period_code_map.keys()
  • 1
  • 2
dict_keys([‘A-DEC’, ‘A-JAN’, ‘A-FEB’, ‘A-MAR’, ‘A-APR’, ‘A-MAY’, ‘A-JUN’, ‘A-JUL’, ‘A-AUG’, ‘A-SEP’, ‘A-OCT’, ‘A-NOV’, ‘Q-DEC’, ‘Q-JAN’, ‘Q-FEB’, ‘Q-MAR’, ‘Q-APR’, ‘Q-MAY’, ‘Q-JUN’, ‘Q-JUL’, ‘Q-AUG’, ‘Q-SEP’, ‘Q-OCT’, ‘Q-NOV’, ‘M’, ‘W-SUN’, ‘W-MON’, ‘W-TUE’, ‘W-WED’, ‘W-THU’, ‘W-FRI’, ‘W-SAT’, ‘B’, ‘D’, ‘H’, ‘T’, ‘S’, ‘L’, ‘U’, ‘N’, ‘Q’, ‘A’, ‘W’, ‘C’]) **时间段的起点和终点可以通过start_time和end_time属性获得,它们都是表示时间点的Timestamp对象**
now_day=pd.Period.now(freq='D')
now_day.start_time
  • 1
  • 2
Timestamp(‘2018-06-08 00:00:00’)
now_day.end_time
  • 1
Timestamp(‘2018-06-08 23:59:59.999999999’) **调用Timestamp对象的to_period方法可以吧时间点转换为包含该时间点的时间段。时间段不包含时区信息**
now_shanghai.to_period('H')
  • 1
Period(‘2018-06-08 08:00’, ‘H’)
now_shanghai.to_datetime()
  • 1
C:\Users\Administrator\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2963: FutureWarning: to_datetime is deprecated. Use self.to_pydatetime() exec(code_obj, self.user_global_ns, self.user_ns) datetime.datetime(2018, 6, 8, 8, 36, 33, 781159, tzinfo=
now.day
  • 1
8
now.year
  • 1
  • 2
2018
now.month
  • 1
6
now.dayofweek
  • 1
4
now.dayofyear
  • 1
159 ### 时间间隔 将两个时间点相减,可以得到表示时间间隔的Timedelta对象
nation_day=pd.Timestamp('2015-10-1')
td=now-nation_day
  • 1
  • 2
**时间点和时间间隔之间可以进行加减运算**
nation_day+td
  • 1
Timestamp(‘2018-06-08 08:36:33.781159’) **Timedelta对象的days、seconds、microseconds和nanoseconds等属性分别获得它包含的天数、秒数、微秒数和纳秒数**
td.days
  • 1
981
td.seconds
  • 1
  • 2
30993
td.components
  • 1
Components(days=981, hours=8, minutes=36, seconds=33, milliseconds=781, microseconds=159, nanoseconds=0)
tuple(td.components)
  • 1
(981, 8, 36, 33, 781, 159, 0) ## 时间序列 random_timestamp()中的date_range()函数创建以start为起点,end为终点,周期为freq的DatetimeIndex对象
def random_timestamps(start, end, freq, count):
    index = pd.date_range(start, end, freq=freq)#根据时间段产生随机时间
    locations = np.random.choice(np.arange(len(index)), size=count, replace=False)
    locations.sort()
    return index[locations]

np.random.seed(42)
ts_index = random_timestamps("2015-01-01", "2015-10-01", freq="Min", count=5)
pd_index = ts_index.to_period("M")
td_index = pd.TimedeltaIndex(np.diff(ts_index))

print (ts_index, "\n")
print (pd_index, "\n")
print (td_index, "\n")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
DatetimeIndex([‘2015-01-15 16:12:00’, ‘2015-02-15 08:04:00’, ‘2015-02-28 12:30:00’, ‘2015-08-06 02:40:00’, ‘2015-08-18 13:13:00’], dtype=’datetime64[ns]’, freq=None) PeriodIndex([‘2015-01’, ‘2015-02’, ‘2015-02’, ‘2015-08’, ‘2015-08’], dtype=’period[M]’, freq=’M’) TimedeltaIndex([‘30 days 15:52:00’, ‘13 days 04:26:00’, ‘158 days 14:10:00’, ‘12 days 10:33:00’], dtype=’timedelta64[ns]’, freq=None) 这三种索引对象都提供了许多与时间相关的属性
ts_index.weekday
  • 1
Int64Index([3, 6, 5, 3, 1], dtype=’int64’)
pd_index.month
  • 1
Int64Index([1, 2, 2, 8, 8], dtype=’int64’)
td_index.seconds
  • 1
Int64Index([57120, 15960, 51000, 37980], dtype=’int64’) **DataIndex.shift(n,freq)可以移动时间点,将当前的时间移动n个freq时间单位。对于天,小时这样的精确单位,相当于加上指定的时间间隔**
ts_index.shift(1,freq='H')
  • 1
DatetimeIndex([‘2015-01-15 17:12:00’, ‘2015-02-15 09:04:00’, ‘2015-02-28 13:30:00’, ‘2015-08-06 03:40:00’, ‘2015-08-18 14:13:00’], dtype=’datetime64[ns]’, freq=None) **对于月份这种不精确的时间单位,则移动一个单位相当于移动到月头或者月底**
ts_index.shift(1,freq='M')
  • 1
DatetimeIndex([‘2015-01-31 16:12:00’, ‘2015-02-28 08:04:00’, ‘2015-03-31 12:30:00’, ‘2015-08-31 02:40:00’, ‘2015-08-31 13:13:00’], dtype=’datetime64[ns]’, freq=None) DatatimeIndex.normalize()将时刻修改为当天的凌晨零点,可以理解为按日期取整
ts_index.normalize()
  • 1
DatetimeIndex([‘2015-01-15’, ‘2015-02-15’, ‘2015-02-28’, ‘2015-08-06’, ‘2015-08-18’], dtype=’datetime64[ns]’, freq=None) **如果希望对任意的时间周期取整,可以先通过to_peroid()将其转换为period_index对象,然后再调用to_timestamp()方法转换回datetimeIndex对象,to_timestamp()的how参数决定将时间段的起点还是重点转换为时间点,默认值是‘start’**
ts_index.to_period("H").to_timestamp()
  • 1
DatetimeIndex([‘2015-01-15 16:00:00’, ‘2015-02-15 08:00:00’, ‘2015-02-28 12:00:00’, ‘2015-08-06 02:00:00’, ‘2015-08-18 13:00:00’], dtype=’datetime64[ns]’, freq=None) 以时间为索引的series对象,称为时间序列
ts=pd.Series(range(0,5),index=ts_index)
  • 1
ts
  • 1
2015-01-15 16:12:00 0 2015-02-15 08:04:00 1 2015-02-28 12:30:00 2 2015-08-06 02:40:00 3 2015-08-18 13:13:00 4 dtype: int32 **时间序列提供了一些专门用于处理时间的方法** **between_time()返回所有位于指定时间范围之内的数据**
ts.between_time('9:00','18:00')
  • 1
2015-01-15 16:12:00 0 2015-02-28 12:30:00 2 2015-08-18 13:13:00 4 dtype: int32 **tshift()则将索引移动指定的时间**
ts.tshift(1,freq='M')
  • 1
2015-01-31 16:12:00 0 2015-02-28 08:04:00 1 2015-03-31 12:30:00 2 2015-08-31 02:40:00 3 2015-08-31 13:13:00 4 dtype: int32 **PeriodIndex()和TimedeltaIndex为索引的序列与科研使用tshift()对索引进行移动**
pd_series=pd.Series(range(0,5),index=pd_index)
td_series=pd.Series(range(0,4),index=td_index)
  • 1
  • 2
pd_series.shift(1,freq='M')
  • 1
2015-02 0 2015-03 1 2015-03 2 2015-09 3 2015-09 4 Freq: M, dtype: int32
td_series.shift('1',freq='H')
  • 1
30 days 16:52:00 0 13 days 05:26:00 1 158 days 15:10:00 2 12 days 11:33:00 3 dtype: int32 ## 与NaN有关的函数 Pandas使用NaN表示缺失的数据,由于整数列无法使用NaN,因此如果整数类型的列出现缺失数据,则会被自动转换为浮点数类型。下面将布尔类型的dataframe对象传递给一个整数类型的Dataframe对象的where()方法,该方法将False对象的元素设置为NaN值,
data=np.random.randint(0,10,(10,3))
df_int=pd.DataFrame(data,columns=['A',"B","C"])
dd=df_int.where(df_int>2)
  • 1
  • 2
  • 3
** isnull()和not_null()用于判断元素值是否为NaN,它们返回全是布尔值的dataframe对象。**
dd.isnull()
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
0FalseTrueTrue
1FalseTrueFalse
2FalseTrueTrue
3FalseFalseFalse
4FalseFalseFalse
5FalseFalseFalse
6FalseTrueTrue
7FalseTrueFalse
8FalseTrueFalse
9FalseTrueFalse
dd.notnull()
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
0TrueFalseFalse
1TrueFalseTrue
2TrueFalseFalse
3TrueTrueTrue
4TrueTrueTrue
5TrueTrueTrue
6TrueFalseFalse
7TrueFalseTrue
8TrueFalseTrue
9TrueFalseTrue

count()返回每行或者每列的非NaN元素的个数

dd.count()
  • 1
A 10 B 3 C 7 dtype: int64
dd.count(axis=1)
  • 1
0 1 1 2 2 1 3 3 4 3 5 3 6 1 7 2 8 2 9 2 dtype: int64 **d对于包含NaN值的列,最简单的方法是调用dropna()删除包含NaN值的行或者列,当全部使用默认参数时,将默认删除包含NaN值的行。可以通过thresh参数设置NaN值的阈值。**
dd.dropna()
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
333.03.0
496.07.0
554.03.0
dd.dropna(thresh=4,axis=1)#这个方法有bug
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
AC
08NaN
183.0
26NaN
333.0
497.0
553.0
65NaN
739.0
864.0
987.0
dd
  • 1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
08NaNNaN
18NaN3.0
26NaNNaN
333.03.0
496.07.0
554.03.0
65NaNNaN
73NaN9.0
86NaN4.0
98NaN7.0

当行数据按照某种物理顺序排列时,可以使用NaN前后的数据对其进行填充。ffill()使用之前的数据进行填充,而bfill()则使用之后的数据进行填充。interpolate()使用前后数据进行插值填充。

interpolate()默认使用等距线性插值,可以同时其method参数设置插值算法,其方法主要有:

          'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh',
          'polynomial', 'spline', 'piecewise_polynomial',
          'from_derivatives', 'pchip', 'akima'}```


```python
dd.ffill()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
08NaNNaN
18NaN3.0
26NaN3.0
333.03.0
496.07.0
554.03.0
654.03.0
734.09.0
864.04.0
984.07.0
dd.bfill()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
083.03.0
183.03.0
263.03.0
333.03.0
496.07.0
554.03.0
65NaN9.0
73NaN9.0
86NaN4.0
98NaN7.0
dd.interpolate(method='index')#使用索引值填充




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
08NaNNaN
18NaN3.0
26NaN3.0
333.03.0
496.07.0
554.03.0
654.06.0
734.09.0
864.04.0
984.07.0
dd.interpolate(method="linear")




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
08NaNNaN
18NaN3.0
26NaN3.0
333.03.0
496.07.0
554.03.0
654.06.0
734.09.0
864.04.0
984.07.0

此外,还可以使用字典参数fillna()对不同的列使用不同的值进行填充NaN

dd.fillna({"B":5,"C":6})




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
085.06.0
185.03.0
265.06.0
333.03.0
496.07.0
554.03.0
655.06.0
735.09.0
865.04.0
985.07.0

各种聚合方法的skipna参数默认为TRUE,因此计算是将忽略NaN元素,注意每行或者每列是单独运算的。如果需要忽略包含NaN的整行,需要先调用dropna()。

dd.sum()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
A    61.0
B    13.0
C    36.0
dtype: float64
  • 1
  • 2
  • 3
  • 4
dd.sum(skipna=False)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
A    61.0
B     NaN
C     NaN
dtype: float64
  • 1
  • 2
  • 3
  • 4

combine_first(other)方法使用other填充df中的NaN元素。它将df中的NaN元素替换为other中对应标签的元素。

other=pd.DataFrame(np.random.randint(0,5,(10,3)),columns=['A','B','C'])
dd.combine_first(other)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
ABC
083.01.0
182.03.0
260.01.0
333.03.0
496.07.0
554.03.0
654.04.0
730.09.0
864.04.0
984.07.0

改变Dataframe的形状

Dataframe的shape属性和Numpy的二维数组是一样的,是一个有两个元素的数组。因为Dataframe的index和columns都支持multiindex索引,因此可以用Dataframe表示更高维的数据

d0=df[:10]




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
d0=d0[['open', 'high', 'close', 'low', 'volume']]




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
d0.groupby(['open','volume']).mean()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
highcloselow
openvolume
3.1094721.493.143.113.08
111745.673.133.123.09
3.1171162.303.143.123.11
3.1364214.613.153.143.12
98914.753.143.083.07
3.1460980.013.143.123.11
3.22195015.983.223.093.06
3.25113690.003.273.253.22
3.2799754.003.303.253.24
3.32176011.233.323.283.26

添加行或者列

由于Dataframe可以看做是一个series的字典,因此可以通过Dataframe[colname]=values即可添加新列。有时候新添加的列是从已存在的列上计算而来的,这时候可以使用eval()方法计算。

import numpy as np
df['new_col']=df.eval('percent/100')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
df




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
timepercentnew_col
02015-01-0113.50.135
12014-01-0110.50.105
22013-01-0110.50.105
32012-01-0110.50.105
42011-01-0110.70.107
52009-01-0112.60.126
62008-01-0112.90.129
72007-01-0112.70.127
82006-01-0114.30.143
92005-01-0113.70.137
102004-01-0114.10.141
112003-01-0114.30.143
122002-01-0113.90.139

assign()方法添加由关键字参数指定的列,它返回一个新的Dataframe对象,原数据的内容保持不变

df.assign(new_col2=100-df.percent)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
timepercentnew_colnew_col2
02015-01-0113.50.13586.5
12014-01-0110.50.10589.5
22013-01-0110.50.10589.5
32012-01-0110.50.10589.5
42011-01-0110.70.10789.3
52009-01-0112.60.12687.4
62008-01-0112.90.12987.1
72007-01-0112.70.12787.3
82006-01-0114.30.14385.7
92005-01-0113.70.13786.3
102004-01-0114.10.14185.9
112003-01-0114.30.14385.7
122002-01-0113.90.13986.1

append()方法用于添加行,他没有inplace参数,只能返回一个全新的对象。由于每次调用append()都会复制原有数据,因此如果在循环中使用append()添加数据行会影响运行数速度,可以考虑使用一个列表缓存所有分块数据,然后使用concat()将这些数据沿着指定轴拼贴到一起。

def random_dataframe(n):
    columns = ["A", "B", "C"]
    for i in range(n):
        nrow = np.random.randint(10, 20)
        yield pd.DataFrame(np.random.randint(0, 100, size=(nrow, 3)), columns=columns)

df_list = list(random_dataframe(1000))




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
%%time
df_res1 = pd.DataFrame([])
for df in df_list:
    df_res1 = df_res1.append(df)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
Wall time: 574 ms
  • 1
%%time
df_res2 = pd.concat(df_list, axis=0)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
Wall time: 129 ms
  • 1

drop()删除指定标签的行或者列

df.drop("A",axis=1)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
BC
02988
14515
25081
38681
48825
52142
63282
7259
82462
94914
108095
116718

行索引与列之间的相互转换

reset_index()可以将索引转换成列,通过level参数可以指定被转换为列的级别。如果只希望从索引中删除某个级别,可以设置drop参数为TRUE

df=pd.read_csv('data.txt',header=None)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-53-e2a66a7edd4c> in <module>()
----> 1 df=pd.read_csv('data.txt',header=None,index=0)


TypeError: parser_f() got an unexpected keyword argument 'index'
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
df.columns=['time','percent']
df.index=df['time']




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
df.reset_index(level='time',drop=True)#可以将原有的索引删除




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
timepercent
0201615.2
1201513.5
2201410.5
3201310.5
4201210.5
5201110.7
6200912.6
7200812.9
8200712.7
9200614.3
10200513.7
11200414.1
12200314.3
13200213.9

set_index()可将列转换为航所以,如果append参数为false,则删除当前的行索引;若为TRUE,则为当前的索引添加新的级别

df.set_index('time')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
percent
time
201615.2
201513.5
201410.5
201310.5
201210.5
201110.7
200912.6
200812.9
200712.7
200614.3
200513.7
200414.1
200314.3
200213.9
import matplotlib.pyplot as pl
pl.plot(df['percent'])
pl.show()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

png

行索引与列索引的相互转换

stack()方法把指定级别的列索引转换为行索引,而unstack()则把行索引转换为列索引

交换索引等级

reorder_levels()和swaplevel()交换指定轴的索引级别

透视表

pivot()可以将dataframe中的散列数据分别作为行索引,列索引和元素值,将这三列数据转换为二维表格

pivot()的三个参数index,columns和values只支持指定一列数据,若不指定values数据,就将剩余的列都当做元素值列,得到多级索引

分组运算

所谓分组运算就是使用特定的条件将数据分为多个分组,然后对每个分组进行运算,最后再将结果整合起来。pandas中的分组运算由Dataframe或Series对象的groupby()方法实现

groupby()方法

groupby()的axis参数默认为0表示源数据的行进行分组。源数据的每行与分组数据的每个元素对应,分组数据中的每个唯一值对应每一个分组

当分组用的数据在源数据时中时,可以直接通过列名指定分组数据

import pandas as pd
import numpy as np

dose=pd.read_csv('dose.csv')
tmt_groupby=dose.groupby('Tmt')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
print(tmt_groupby)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
<pandas.core.groupby.DataFrameGroupBy object at 0x000001EB6DB94EF0>
  • 1

还可以使用列表传递多组分组数据给groupby()

double_dose=dose.groupby(['Tmt','Age'])




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

**当分组数据不在源数据中时,可以直接传递分组数据

randvalue=np.random.randint(0,10,dose.shape[0])
random_group=dose.groupby(randvalue)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

当分组数据可以通过源数据的行索引计算时,可以将计算函数传递给groupby()

alternate_group=dose.groupby(lambda n:n%3)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

上述三种分组数据可以任意自由组合

crazy_group=dose.groupby(['Gender',lambda n:n%2,randvalue])




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Groupby对象

* 使用len()可以获得分组数*

len(tmt_groupby)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
4
  • 1

Groupby对象直接迭代接口,他与字典的iteritems()方法类似,每次迭代得到的分组的键和数据,当使用多列数据分组时,与每个组对应的键是一个元组

for key ,value in tmt_groupby:
    print(key,value.shape)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
A (65, 6)
B (65, 6)
C (65, 6)
D (65, 6)
  • 1
  • 2
  • 3
  • 4

* get_group()方法可以获得与指定的分组键对应对应的数据*

tmt_groupby.get_group('C')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
DoseResponse1Response2TmtAgeGender
050.09.87210.032C60sF
225.00.6260.803C50sM
325.01.3721.557C60sF
415.00.0100.020C60sF
720.00.0380.033C60sM
14100.09.71011.246C50sM
1830.04.8924.851C60sF
210.10.0000.000C60sF
220.10.0000.000C60sM
2540.09.0259.929C50sM
335.00.0000.000C50sM
3710.00.0000.001C60sF
3880.09.3459.736C60sM
4310.00.0000.000C50sM
4480.010.16311.245C50sM
451.00.0000.000C50sM
475.00.0000.000C60sM
50100.09.62710.149C60sF
5420.00.0790.097C50sM
72100.09.6159.407C60sM
7930.02.3902.487C60sM
8040.08.1417.638C60sM
8160.010.32810.113C40sF
875.00.0000.000C60sM
10825.00.4650.651C60sM
11120.00.0490.101C60sM
11450.09.8369.101C60sM
12840.08.4818.157C60sM
13215.00.0040.006C50sM
13350.010.41711.080C50sM
1545.00.0000.000C60sF
1550.10.0000.000C60sM
16280.010.14010.147C60sF
16560.010.49011.218C50sM
17220.00.1840.247C60sF
1741.00.0000.000C60sM
17625.00.3510.339C60sM
17815.00.0020.002C60sM
1791.00.0000.000C60sF
181100.09.9239.737C60sM
18625.01.0271.372C40sF
19040.08.5398.590C40sF
19150.010.2089.857C40sF
19810.00.0000.002C40sF
20015.00.0080.034C40sF
20110.00.0000.000C60sM
20660.09.96110.126C60sF
20780.010.1159.402C60sM
20910.00.0000.000C60sM
21060.09.8559.711C60sM
21630.02.8953.518C50sM
22040.09.0929.356C60sF
22215.00.0020.008C60sM
22320.00.1170.288C40sF
2320.10.0000.000C40sF
23630.01.7281.931C60sM
24060.09.8089.343C60sM
2431.00.0000.000C60sM
24930.04.3503.873C40sF
259100.010.16910.201C40sF

65 rows × 6 columns

对Groupby的下标操作将获得一个只包含源数据中指定列的新groupby对象,通过这种方法可以先使用源数据中的某些列进行分组,然后选择一些列进行后续计算

tmt_groupby['Dose']




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
<pandas.core.groupby.SeriesGroupBy object at 0x000001EB6DEA3080>
  • 1

`Groupby类中定义了_getattr()_()方法,因此当获取Groupby中未定义的属性时,将按照下面的顺序操作:
- 如果属性名是源数据对象的某列的名称,则相当于Groupby[name],即获得针对该列的Groupby对象
- 如果属性名是源数据对象的方法时,则相当于通过apply()对每个分组调用该方法。注意Pandas中定义了转换为apply()的方法集合,只有在此集合之中的方法才能被自动转换

tmt_groupby.Dose




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
<pandas.core.groupby.SeriesGroupBy object at 0x000001EB6E835BA8>
  • 1

分组-运算-合并

通过Groupby()对象提供的agg()、transform()、filter()以及apply()等方法可以实现各种分组运算,每个方法的第一个参数都是一个回调函数,改函数对每个分组的数据进行运算并返回结果。这些方法根据回调函数的返回结果生成最终的分组运算结果。

agg()-聚合

agg()对每个分组的数据进行聚合运算。所谓聚合运算就是将一组由N个数值组成的数据转换为单个数值的运算,例如求和、平均值、中间值甚至是随机取值等都是聚合运算。其回调函数接受的数据是表示每个分组中每列数据的series对象,若回调函数不能处理series对象,则agg()会接着尝试将整个分组的数据作为dataframe对象传递给回调函数,回调函数对其参数记性聚合运算,将series对象转换为单个数值,或将Dataframe对象转换为series对象。agg()返回一个Dataframe对象,其行索引为每个分组的键,而列索引为源数据的列索引

tmt_groupby.agg(np.mean)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
DoseResponse1Response2
Tmt
A33.5461546.7289856.863185
B33.5461545.5733545.456415
C33.5461544.0404154.115323
D33.5461543.3206463.188369
tmt_groupby.agg(lambda df:df.loc[df.Response1.idxmax()])




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
DoseResponse1Response2AgeGender
Tmt
A80.011.22610.13260sF
B100.010.82410.15850sM
C60.010.49011.21850sM
D80.010.9119.85460sF

transform()-转换

transform()对每个分组中的数据进行转换运算。与agg()相同,首先尝试将每列的series对象传递给回调函数,如果失败,则将表示整个分组的Dataframe对象传递给回调函数。回调函数的返回结果与参数的形状相同,transform()将这些结果按照源数据的顺序合并在一起。

tmt_groupby.transform(lambda s:s-s.mean()).head()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
DoseResponse1Response2
016.4538465.8315855.916677
1-18.546154-3.318646-3.184369
2-8.546154-3.414415-3.312323
3-8.546154-2.668415-2.558323
4-18.546154-4.030415-4.095323

filter()-过滤

filter()对每个分组进行条件判断,她将表示每个分组的dataframe对象传递给回调参数,该函数返回TRUE或False,以决定是否保留该分组。filter()的返回结果是过滤掉一些行之后的Dataframe对象,其行索引与源数据的行索引的顺序一致。

tmt_groupby.filter(lambda df : df.Response1.max()<11).head()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
DoseResponse1Response2TmtAgeGender
050.09.87210.032C60sF
115.00.0020.004D60sF
225.00.6260.803C50sM
325.01.3721.557C60sF
415.00.0100.020C60sF

apply()——运用

apply()将表示每个分组的Dataframe对象传递给回调函数并收集其返回值,将这些返回值按照某种规则合并。apply的用法十分灵活,可以实现上述agg(),transform()和filter()方法的功能。它会根据回调函数的返回值的类型选择恰当的合并方式,然而这种自动选择有时会得到令人费解的结果

tmt_groupby.apply(pd.DataFrame.max)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
DoseResponse1Response2TmtAgeGender
Tmt
A100.011.22610.745A60sM
B100.010.82410.340B60sM
C100.010.49011.246C60sM
D100.010.9119.863D60sM
tmt_groupby.apply(pd.DataFrame.mean)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
DoseResponse1Response2
Tmt
A33.5461546.7289856.863185
B33.5461545.5733545.456415
C33.5461544.0404154.115323
D33.5461543.3206463.188369

* 当回调函数的返回值是DataFrame对象时,根据其行标签是否与参数对象的行标签为同一对象,会得到不同的结果*

group=tmt_groupby[['Response1','Response1']]




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
group.apply(lambda df:df-df.mean()).head()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Response1Response1
05.8315855.831585
1-3.318646-3.318646
2-3.414415-3.414415
3-2.668415-2.668415
4-4.030415-4.030415
group.apply(lambda df:(df-df.mean())[:]).head()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Response1Response1
Tmt
A6-6.728985-6.728985
10-1.503985-1.503985
12-6.728985-6.728985
17-6.728985-6.728985
322.5660152.566015

当回调函数返回None值时,将忽略该返回值,因此可以实现filter()的功能

tmt_groupby.apply(lambda df:None if df.Response1.mean()<5 else df.sample(2))




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
DoseResponse1Response2TmtAgeGender
Tmt
A9340.09.78710.103A60sF
15750.010.31710.465A50sM
B1700.10.0000.000B40sF
8920.04.1994.733B60sF

数据处理与数据可视化

分析pandas项目的提交历史

将数据从文件中读取出来

def read_git_log(log_fn):
    import io
    with io.open(log_fn, "r", encoding="utf8") as f:

        author = datetime = None
        message = []
        message_start = False
        for line in f:
            line = line.strip()
            if not line:
                continue

            if line.startswith("commit"):
                if author is not None:
                    yield author, datetime, u"\n".join(message)
                    del message[:]
                message_start = False
            elif line.startswith("Author:"):
                author = line[line.index(":")+1 : line.index("<")].strip()
            elif line.startswith("Date:"):
                datetime = line[line.index(":")+1 :].strip()
                message_start = True
            elif message_start:
                message.append(line)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30

将生成器数据转换为Dataframe数据

df_commit=pd.DataFrame(read_git_log('pandas.log'),columns=['Author','DateString','Message'])
df_commit.shape




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
(17260, 3)
  • 1

转换数据

df_commit.DateString.head()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
0    Sat Jun 9 02:40:03 2018 +0300
1    Fri Jun 8 18:32:20 2018 -0500
2    Fri Jun 8 19:44:17 2018 +0200
3    Fri Jun 8 18:39:11 2018 +0100
4    Fri Jun 8 11:54:36 2018 -0500
Name: DateString, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
df_commit['Date']=pd.to_datetime(df_commit.DateString)
print(df_commit.head())




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
                  Author                     DateString  \
0           Pyry Kovanen  Sat Jun 9 02:40:03 2018 +0300   
1               chris-b1  Fri Jun 8 18:32:20 2018 -0500   
2  Joris Van den Bossche  Fri Jun 8 19:44:17 2018 +0200   
3           Fábio Rosado  Fri Jun 8 18:39:11 2018 +0100   
4         Tom Augspurger  Fri Jun 8 11:54:36 2018 -0500   

                                             Message                Date  
0  BUG: Fix empty Data frames to JSON round-tripp... 2018-06-08 23:40:03  
1  BLD: include dll in package_data on Windows (#... 2018-06-08 23:32:20  
2  REGR: allow merging on object boolean columns ... 2018-06-08 17:44:17  
3  DOC: update multi-index term with MultiIndex (... 2018-06-08 17:39:11  
4  Revert change to comparison op with datetime.d... 2018-06-08 16:54:36  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
df_commit['TimeZone']=df_commit.DateString.str[-5:]
import re
df_commit["Type"] = df_commit.Message.str.extract(r"^([A-Z/]{2,12})", flags=re.MULTILINE)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  This is separate from the ipykernel package so we can avoid doing imports until
  • 1
  • 2
df_commit.head(10)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
AuthorDateStringMessageDateTimeZoneType
0Pyry KovanenSat Jun 9 02:40:03 2018 +0300BUG: Fix empty Data frames to JSON round-tripp…2018-06-08 23:40:03+0300BUG
1chris-b1Fri Jun 8 18:32:20 2018 -0500BLD: include dll in package_data on Windows (#…2018-06-08 23:32:20-0500BLD
2Joris Van den BosscheFri Jun 8 19:44:17 2018 +0200REGR: allow merging on object boolean columns …2018-06-08 17:44:17+0200REGR
3Fábio RosadoFri Jun 8 18:39:11 2018 +0100DOC: update multi-index term with MultiIndex (…2018-06-08 17:39:11+0100DOC
4Tom AugspurgerFri Jun 8 11:54:36 2018 -0500Revert change to comparison op with datetime.d…2018-06-08 16:54:36-0500NaN
5Damini SatyaFri Jun 8 09:50:20 2018 -0700Fix #21356: JSON nested_to_record Silently Dro…2018-06-08 16:50:20-0700NaN
6Joris Van den BosscheFri Jun 8 18:41:49 2018 +0200DOC: update whatsnew 0.23.1 (#21387)2018-06-08 16:41:49+0200DOC
7Tom AugspurgerFri Jun 8 11:27:13 2018 -0500REGR: NA-values in ctors with string dtype (#2…2018-06-08 16:27:13-0500REGR
8Uddeshya SinghFri Jun 8 21:55:51 2018 +0530BUG: invalid rolling window on empty input (#2…2018-06-08 16:25:51+0530BUG
9Dr. IrvFri Jun 8 07:34:33 2018 -0400BUG: Series.combine() fails with ExtensionArra…2018-06-08 11:34:33-0400BUG
df_commit.set_index('Date',drop=False,inplace=True)
df_commit.sort_index(inplace=True)
df_commit.head()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
AuthorDateStringMessageDateTimeZoneType
Date
2009-08-05 02:32:49Wes McKinneyWed Aug 5 02:32:49 2009 +0000adding trunk\ngit-svn-id: http://pandas.google…2009-08-05 02:32:49+0000NaN
2009-08-05 02:33:13Wes McKinneyWed Aug 5 02:33:13 2009 +0000oops\ngit-svn-id: http://pandas.googlecode.com…2009-08-05 02:33:13+0000NaN
2009-08-05 03:17:29Wes McKinneyWed Aug 5 03:17:29 2009 +0000added svn:ignore\ngit-svn-id: http://pandas.go…2009-08-05 03:17:29+0000NaN
2009-08-05 03:30:16Wes McKinneyWed Aug 5 03:30:16 2009 +0000first commit with cleaned up code\ngit-svn-id:…2009-08-05 03:30:16+0000NaN
2009-08-05 03:40:05Wes McKinneyWed Aug 5 03:40:05 2009 +0000minor edit\ngit-svn-id: http://pandas.googleco…2009-08-05 03:40:05+0000NaN

绘制提交时间分布图

timedelta=df_commit.Date.diff(1).dropna()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
hour_delta=timedelta.dt.days*24+timedelta.dt.seconds/3600.0
ax=hour_delta.plot(kind='hist',figsize=(16,9),bins=100,histtype='step',lw=4,range=(0,5))
ax.set_xlabel('Hours')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
Text(0.5,0,'Hours')
  • 1

png

统计每个星期的提交次数

ax=df_commit.Author.resample('W',how='count').plot(kind='area',figsize=(16,9),color='red',lw=4)
ax.set_xlabel('date')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).count()
  """Entry point for launching an IPython kernel.





Text(0.5,0,'date')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

png

ax=df_commit.Author.resample('M',how='count').plot(kind='area',figsize=(16,9),color='red',lw=4)
ax.set_xlabel('Month')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).count()
  """Entry point for launching an IPython kernel.





Text(0.5,0,'Month')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

png

绘制月份提交图

ax=df_commit.Author.resample('M',how='count').plot(kind='area',figsize=(16,9),color='red',lw=4)
ax.set_xlabel('Month')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).count()
  """Entry point for launching an IPython kernel.





Text(0.5,0,'Month')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

png

上述图通过groupby的实现

df_commit['Period']=df_commit.index.to_period('M')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
df_commit.groupby('Period').count().plot(kind='area',figsize=(16,9),color='red',lw=4)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
Text(0.5,17.2,'Month')
  • 1

png

统计每个作者的提交次数

df_commit.groupby('Author').count()['Message']




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
Author
3553x                   1
A. Flaxman              1
ARF                     1
Aaron Barber            1
Aaron Critchley         9
Aaron Schumacher        1
Aaron Staple            2
Aaron Toth              1
AbdealiJK               1
Abraham Flaxman        19
Acanthostega            1
Adam Chainz             1
Adam Gleave             1
Adam Greenhall          4
Adam Hooper             1
Adam J. Stewart         3
Adam Kim                1
Adam Klein            327
Adam Marcus             2
Adam Obeng              5
Adam Smith              1
AdamShamlian            1
Adrian                  1
Adrian Liaw             1
Adrien Emery            1
Agustín Herranz         1
Ajay Saxena             5
Akash Tandon            1
Alan Du                 1
Alan Velasco            5
                     ... 
tv3141                  1
tworec                  1
tzinckgraf              2
unknown                 3
unutbu                 18
verakai                 1
vytas                   1
waitingkuo              4
wandersoncferreira      1
watercrossing           1
wavedatalab             1
wcwagner                4
westurner               2
willayd                 1
xgdgsc                  1
xpvpc                   2
y-p                   943
yelite                  1
ysau                    1
yui-knk                 5
zach powers             3
zachcp                  1
zertrin                 1
zhanghui                1
zhangjinjie             1
znmean                  1
zzgao                   1
Åsmund Hjulstad         1
Øystein S. Haaland      1
颜发才(Yan Facai)          1
Name: Message, Length: 1379, dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
s_Author=df_commit.Author.value_counts()#另一种统计方法




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

* 下面使用crosstab()统计每个月每位作者的提交次数,所得到的结果df_counts的行索引为月份,列索引为作者*

df_counts=pd.crosstab(df_commit.index.to_period('M'),df_commit.Author)
df_counts.index.name='Month'
df_counts.shape




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
(107, 1379)
  • 1
df_counts[s_Author.head(5).index].plot(kind='area',subplots=True,figsize=(8,6),sharex=True,sharey=True)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001EB75238898>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001EB7556CB00>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001EB75596B70>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001EB755C1BE0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001EB755EBC50>],
      dtype=object)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

png

活动记录图

daily_commit = df_commit.index.to_period("D").value_counts() #❶
daily_commit.index = pd.MultiIndex.from_arrays([daily_commit.index.asfreq("W"),  #❷
                                                daily_commit.index.weekday])
daily_commit = daily_commit.sort_index()
active_data = daily_commit.unstack(0).iloc[:, -60:].fillna(0) #❸




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11




<div class="se-preview-section-delimiter"></div>

#%fig=Pandas项目的活动记录图
import matplotlib.pyplot as pl
fig, ax = pl.subplots(figsize=(15, 4))
ax.set_aspect("equal")
ax.pcolormesh(active_data.values, cmap="Greens", 
              vmin=0, vmax=active_data.values.max() * 0.75) #❹

tick_locs = np.arange(3, 60, 10)
ax.set_xticks(tick_locs + 0.5)
ax.set_xticklabels(active_data.columns[tick_locs].to_timestamp(how="start").format())
ax.set_yticks(np.arange(7) + 0.5)

from pandas.tseries.frequencies import DAYS
ax.set_yticklabels(DAYS);




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

png

分析空气质量数据

读取数据和转换

store=pd.HDFStore('aqi.hdf5')
df_aqi=store.select('aqi')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
df_aqi.head()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
TimeCityPositionAQILevelPM2_5PM10CONO2O3SO2
12014-04-11 15:00:00上海普陀76.049.0101.00.0000.00.00.0
22014-04-11 15:00:00上海十五厂72.052.094.00.47953.0124.09.0
32014-04-11 15:00:00上海虹口80.059.098.00.61252.0115.011.0
42014-04-11 15:00:00上海徐汇上师大74.054.087.00.70643.0113.014.0
52014-04-11 15:00:00上海杨浦四漂84.062.099.00.45643.082.09.0
df_aqi['Date']=pd.to_datetime(df_aqi.Time)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
df_aqi.set_index(df_aqi.index.to_period('D'),drop=False,inplace=True)#必须设置inplace,index才会是datetimeIndex




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
df_aqi.head()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
TimeCityPositionAQILevelPM2_5PM10CONO2O3SO2Date
Date
2014-04-112014-04-11 15:00:00上海普陀76.049.0101.00.0000.00.00.02014-04-11 15:00:00
2014-04-112014-04-11 15:00:00上海十五厂72.052.094.00.47953.0124.09.02014-04-11 15:00:00
2014-04-112014-04-11 15:00:00上海虹口80.059.098.00.61252.0115.011.02014-04-11 15:00:00
2014-04-112014-04-11 15:00:00上海徐汇上师大74.054.087.00.70643.0113.014.02014-04-11 15:00:00
2014-04-112014-04-11 15:00:00上海杨浦四漂84.062.099.00.45643.082.09.02014-04-11 15:00:00
df_aqi.City.value_counts()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
天津     134471
北京     109999
上海      92745
天津市        13
北京市        12
上海市        10
Name: City, dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
df_aqi['City']=df_aqi.City.str.replace('市','')




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
df_aqi.City.value_counts()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
天津    134484
北京    110011
上海     92755
Name: City, dtype: int64
  • 1
  • 2
  • 3
  • 4

评估aqi与其他污染气体的相关性

df_aqi.corr()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
AQIPM2_5PM10CONO2O3SO2
AQI1.0000000.9436710.6940760.6106570.533769-0.1362600.419722
PM2_50.9436711.0000000.5692050.6326920.555882-0.1689330.425680
PM100.6940760.5692051.0000000.4604230.471571-0.1362180.414261
CO0.6106570.6326920.4604231.0000000.565252-0.2333380.537967
NO20.5337690.5558820.4715710.5652521.000000-0.4391860.448082
O3-0.136260-0.168933-0.136218-0.233338-0.4391861.000000-0.197512
SO20.4197220.4256800.4142610.5379670.448082-0.1975121.000000
fig,ax=pl.subplots(figsize=(16,9))
ax.set_aspect('equal')
ax.pcolormesh(df_aqi.corr(),cmap='Greens',vmin=-0.9,vmax=1)





<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
<matplotlib.collections.QuadMesh at 0x1eb1af6ff60>
  • 1

png

fig,ax=pl.subplots(figsize=(16,9))
plot_dataframe_as_colormesh(df_aqi.corr(),ax=ax,colorbar=True,xtick_rot=90)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
<matplotlib.axes._subplots.AxesSubplot at 0x1eb1b0c44a8>
  • 1

png

ax.pcolormesh?




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

热力图函数

import pylab as pl
import numpy as np


def plot_dataframe_as_colormesh(df, ax=None, inverse_yaxis=False, colorbar=False, xtick_rot=0,
                   xtick_start=0, xtick_step=1, ytick_start=0, ytick_step=1,
                   xtick_format=None, ytick_format=None,
                   **kw):
    nrow, ncol = df.shape
    if ax is None:
        fig_width = 10.0
        fig_height = fig_width / ncol * nrow
        fig, ax = pl.subplots(figsize=(fig_width, fig_height))

    ax.set_aspect("equal")
    if inverse_yaxis:
        ax.invert_yaxis()
    mesh = ax.pcolormesh(df.values, **kw)
    if colorbar:
        pl.colorbar(ax=ax, mappable=mesh)

    xticks_loc = np.arange(xtick_start, ncol, xtick_step)
    yticks_loc = np.arange(ytick_start, nrow, ytick_step)

    xlabels = df.columns.tolist()
    if xtick_format is not None:
        xlabels = [xtick_format(label) for label in xlabels]
    ylabels = df.index.tolist()
    if ytick_format is not None:
        ylabels = [ytick_format(label) for label in ylabels]

    ax.set_xticks(xticks_loc + 0.5)
    ax.set_xticklabels([xlabels[idx] for idx in xticks_loc], rotation=xtick_rot)
    ax.set_yticks(yticks_loc + 0.5)
    ax.set_yticklabels([ylabels[idx] for idx in yticks_loc])
    return ax




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42

每座城市的日平均PM2.5分布图

daily_city_groupby=df_aqi.groupby(['Date','City'])




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: 'Date' is both a column name and an index level.
Defaulting to column but this will raise an ambiguity error in a future version
  """Entry point for launching an IPython kernel.
  • 1
  • 2
  • 3
mean_PM=daily_city_groupby.PM2_5.mean().unstack(1)
mean_PM.plot(kind='hist',histtype='step',bins=20,normed=True,lw=2)
pl.grid()
pl.rcParams['font.family']='STSong'




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
C:\Users\Administrator\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
  • 1
  • 2

png

mean_PM.plot(kind='kde',lw=2)
pl.xlim(0,800)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
(0, 800)
  • 1

png

mean_PM.corr()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
City上海北京天津
City
上海1.000000-0.1313910.032613
北京-0.1313911.0000000.583861
天津0.0326130.5838611.000000
fig,ax=pl.subplots(figsize=(4,3))
plot_dataframe_as_colormesh(mean_PM.corr(),ax=ax,colorbar=True,xtick_rot=90)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
<matplotlib.axes._subplots.AxesSubplot at 0x1eb1b6894e0>
  • 1

png

统计一个星期中每天的PM25平均值

week_mean=df_aqi.groupby([df_aqi.Time.dt.dayofweek,'City']).PM2_5.mean()
ax=week_mean.unstack(1).plot(kind='Bar')
from pandas.tseries.frequencies import DAYS
ax.set_xticklabels(DAYS)
ax.legend(loc='best',mode='expand',bbox_to_anchor=(0,1.2,1,0.102))
ax.grid(True)




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

png

hour_mean=df_aqi.groupby([df_aqi.Time.dt.hour,'City']).PM2_5.mean()
ax=hour_mean.unstack(1).plot(kind='Bar',figsize=(16,9))
ax.legend(loc='best',mode='expand',bbox_to_anchor=(0,1.,1,0.102),ncol=3)
ax.grid(True)





<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

png

查看北京市天气情况

bj_weather=df_aqi.query('City=="北京"')
month_place_PM=bj_weather.groupby([bj_weather.Time.dt.to_period('M'),'Position']).PM2_5.mean()




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
place_mean=month_place_PM.mean(level=1).sort_values()
place_mean.plot(kind='bar',color='green',figsize=(16,9
                                                 ))




<div class="se-preview-section-delimiter"></div>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
<matplotlib.axes._subplots.AxesSubplot at 0x1eb219d69b0>
  • 1

png

place=place_mean.iloc[[0,1,-2,-1]].index
ax=month_place_PM.unstack(1).loc[:,place].plot(kind='bar',figsize=(16,9),width=0.8)
ax.grid()
  • 1
  • 2
  • 3

png

“`

行索引与列索引的相互转换

stack()方法把指定级别的列索引转换为行索引,而unstack()则把行索引转换为列索引

交换索引等级

reorder_levels()和swaplevel()交换指定轴的索引级别

透视表

pivot()可以将dataframe中的散列数据分别作为行索引,列索引和元素值,将这三列数据转换为二维表格

pivot()的三个参数index,columns和values只支持指定一列数据,若不指定values数据,就将剩余的列都当做元素值列,得到多级索引

分组运算

所谓分组运算就是使用特定的条件将数据分为多个分组,然后对每个分组进行运算,最后再将结果整合起来。pandas中的分组运算由Dataframe或Series对象的groupby()方法实现

groupby()方法

groupby()的axis参数默认为0表示源数据的行进行分组。源数据的每行与分组数据的每个元素对应,分组数据中的每个唯一值对应每一个分组

当分组用的数据在源数据时中时,可以直接通过列名指定分组数据

“`python
import pandas as pd
import numpy as np

dose=pd.read_csv(‘dose.csv’)
tmt_groupby=dose.groupby(‘Tmt’)

“`

Text(0.5,0,'Hours')
  • 1

统计每个星期的提交次数

“`python
ax=df_commit.Author.resample(‘W’,how=’count’).plot(kind=’area’,figsize=(16,9),color=’red’,lw=4)
ax.set_xlabel(‘date’)

“`

C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).count()
  """Entry point for launching an IPython kernel.





Text(0.5,0,'date')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

“`python
ax=df_commit.Author.resample(‘M’,how=’count’).plot(kind=’area’,figsize=(16,9),color=’red’,lw=4)
ax.set_xlabel(‘Month’)

“`

C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).count()
  """Entry point for launching an IPython kernel.





Text(0.5,0,'Month')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

绘制月份提交图

“`python
ax=df_commit.Author.resample(‘M’,how=’count’).plot(kind=’area’,figsize=(16,9),color=’red’,lw=4)
ax.set_xlabel(‘Month’)

“`

C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).count()
  """Entry point for launching an IPython kernel.





Text(0.5,0,'Month')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

上述图通过groupby的实现

“`python
df_commit[‘Period’]=df_commit.index.to_period(‘M’)

“`

Text(0.5,17.2,'Month')
  • 1

统计每个作者的提交次数

“`python
df_commit.groupby(‘Author’).count()[‘Message’]

“`

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001EB75238898>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001EB7556CB00>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001EB75596B70>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001EB755C1BE0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001EB755EBC50>],
      dtype=object)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

活动记录图

“`python
daily_commit = df_commit.index.to_period(“D”).value_counts() #❶
daily_commit.index = pd.MultiIndex.from_arrays([daily_commit.index.asfreq(“W”), #❷
daily_commit.index.weekday])
daily_commit = daily_commit.sort_index()
active_data = daily_commit.unstack(0).iloc[:, -60:].fillna(0) #❸

“`

分析空气质量数据

读取数据和转换

“`python
store=pd.HDFStore(‘aqi.hdf5’)
df_aqi=store.select(‘aqi’)

“`

<matplotlib.collections.QuadMesh at 0x1eb1af6ff60>
  • 1

“`python
fig,ax=pl.subplots(figsize=(16,9))
plot_dataframe_as_colormesh(df_aqi.corr(),ax=ax,colorbar=True,xtick_rot=90)

“`

<matplotlib.axes._subplots.AxesSubplot at 0x1eb1b0c44a8>
  • 1

“`python
ax.pcolormesh?

“`

C:\Users\Administrator\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
  • 1
  • 2

“`python
mean_PM.plot(kind=’kde’,lw=2)
pl.xlim(0,800)

“`

(0, 800)
  • 1

“`python
mean_PM.corr()

“`

<matplotlib.axes._subplots.AxesSubplot at 0x1eb1b6894e0>
  • 1

统计一个星期中每天的PM25平均值

“`python
week_mean=df_aqi.groupby([df_aqi.Time.dt.dayofweek,’City’]).PM2_5.mean()
ax=week_mean.unstack(1).plot(kind=’Bar’)
from pandas.tseries.frequencies import DAYS
ax.set_xticklabels(DAYS)
ax.legend(loc=’best’,mode=’expand’,bbox_to_anchor=(0,1.2,1,0.102))
ax.grid(True)

“`

“`python
hour_mean=df_aqi.groupby([df_aqi.Time.dt.hour,’City’]).PM2_5.mean()
ax=hour_mean.unstack(1).plot(kind=’Bar’,figsize=(16,9))
ax.legend(loc=’best’,mode=’expand’,bbox_to_anchor=(0,1.,1,0.102),ncol=3)
ax.grid(True)

“`

查看北京市天气情况

“`python
bj_weather=df_aqi.query(‘City==”北京”’)
month_place_PM=bj_weather.groupby([bj_weather.Time.dt.to_period(‘M’),’Position’]).PM2_5.mean()





    <matplotlib.axes._subplots.AxesSubplot at 0x1eb219d69b0>




![](http://ww1.sinaimg.cn/large/699db930ly1fsu5dbwiyxj20pr0g2jrg.jpg)



```python
place=place_mean.iloc[[0,1,-2,-1]].index
ax=month_place_PM.unstack(1).loc[:,place].plot(kind='bar',figsize=(16,9),width=0.8)
ax.grid()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/230211
推荐阅读
相关标签
  

闽ICP备14008679号