数据整合

行列操作

import pandas as pd
import numpy as np
sample = pd.DataFrame(np.random.randn(4,5),columns=['a','b','c','d','e'])
sample

Out[2]: 
          a         b         c         d         e
0 -0.776807  2.355071 -0.940921  0.164487 -1.025772
1  0.596704  0.962625  1.848441 -1.122676 -0.359290
2 -0.092755 -0.124250 -0.259899 -0.111997 -1.816197
3  0.372941  0.297850 -0.409256  0.485376 -2.790929

选择单列

1	sample['a']

Out[4]: 
0   -0.776807
1    0.596704
2   -0.092755
3    0.372941
Name: a, dtype: float64

数据框的ix，iloc，ioc方法都可以选择行，列，iloc方法只能使用数值作为索引来选择行列，loc方法在选择时能使用字符串索引，ix方法则可以使用两种索引

1	sample.ix[:,'a']

Out[5]: 
0   -0.776807
1    0.596704
2   -0.092755
3    0.372941
Name: a, dtype: float64

或者单选列

sample[['a']]
Out[6]: 
          a
0 -0.776807
1  0.596704
2 -0.092755
3  0.372941

选择多行多列

1	sample.ix[0:2,0:2]

Out[7]: 
          a         b
0 -0.776807  2.355071
1  0.596704  0.962625
2 -0.092755 -0.124250

1	sample.iloc[0:2,0:2]

创建，删除列

第一种方式

1 2	sample['new_col1']=sample['a']-sample['b'] sample

Out[10]: 
          a         b         c         d         e  new_col1
0 -0.776807  2.355071 -0.940921  0.164487 -1.025772 -3.131877
1  0.596704  0.962625  1.848441 -1.122676 -0.359290 -0.365921
2 -0.092755 -0.124250 -0.259899 -0.111997 -1.816197  0.031495
3  0.372941  0.297850 -0.409256  0.485376 -2.790929  0.075091

第二种方式

1	sample.assign(new_col2=sample['a']-sample['b'],new_col3=sample['a']+sample['b'])

Out[11]: 
          a         b         c    ...     new_col1  new_col2  new_col3
0 -0.776807  2.355071 -0.940921    ...    -3.131877 -3.131877  1.578264
1  0.596704  0.962625  1.848441    ...    -0.365921 -0.365921  1.559329
2 -0.092755 -0.124250 -0.259899    ...     0.031495  0.031495 -0.217004
3  0.372941  0.297850 -0.409256    ...     0.075091  0.075091  0.670792
[4 rows x 8 columns]

删除列，第一种方式

1	sample.drop('a',axis=1)

Out[12]: 
          b         c         d         e  new_col1
0  2.355071 -0.940921  0.164487 -1.025772 -3.131877
1  0.962625  1.848441 -1.122676 -0.359290 -0.365921
2 -0.124250 -0.259899 -0.111997 -1.816197  0.031495
3  0.297850 -0.409256  0.485376 -2.790929  0.075091

第二种方式，

1 2	# In sample.drop(['a','b'],axis=1)

      c         d         e  new_col1

0 -0.940921  0.164487 -1.025772 -3.131877
1  1.848441 -1.122676 -0.359290 -0.365921

条件查询

生成示例数据

# In[]
sample = pd.DataFrame({'name':['Bob','Lindy','Mark',"Miki",'Sully','Rose'],
                       'score':[98,78,88,77,69,69],
                       'group':[1,1,1,2,1,2]})
sample

Out[14]: 
    name  score  group
0    Bob     98      1
1  Lindy     78      1
2   Mark     88      1
3   Miki     77      2
4  Sully     69      1
5   Rose     69      2

单条件查询

涉及单条件查询时，一般会使用比较运算符，产生布尔类型的索引可用于条件查询。

1	sample.score >66

Out[15]: 
0    True
1    True
2    True
3    True
4    True
5    True

再通过指定的索引进行条件查询，返回bool值为True的数据：

sample[sample.score >66]

Out[16]: 
    name  score  group
0    Bob     98      1
1  Lindy     78      1
2   Mark     88      1
3   Miki     77      2
4  Sully     69      1
5   Rose     69      2

多条件查询

sample[(sample.score >66) & (sample.group==1)]
Out[17]: 
    name  score  group
0    Bob     98      1
1  Lindy     78      1
2   Mark     88      1
4  Sully     69      1

使用 qurey

sample.query('score > 90')
Out[20]: 
  name  score  group
0  Bob     98      1

其他查询

查询sample中70到80之间的记录，并且将边界包含进来（inclusive=True）

# In[]
sample[sample['score'].between(70,80,inclusive=True)]
Out[21]: 
    name  score  group
1  Lindy     78      1
3   Miki     77      2

对于字符串列来说，可以使用isin 方法进行查询：

sample[sample['name'].isin(['Bob','Lindy'])]
Out[24]: 
    name  score  group
0    Bob     98      1
1  Lindy     78      1

使用正则表达式匹配进行查询，例如查询姓名以M开头的人的所有记录：

sample[sample['name'].str.contains('[M]+')]
Out[26]: 
   name  score  group
2  Mark     88      1
3  Miki     77      2

横向连接

Pandas Data Frame 提供 merge 方法以完成各种表格的横向连接操作，这种连接操作跟SQL语句的连接操作类似。

df1 =pd.DataFrame({'id':[1,2,3],
                   'col1':['a','b','c']})
df2 = pd.DataFrame({'id':[4,3],
                    'col2':['d','e']})
df1
Out[29]: 
   id col1
0   1    a
1   2    b
2   3    c
df2
Out[30]: 
   id col2
0   4    d
1   3    e

内连接使用merge函数示例，根据公共字段保留两表的共有信息，how = 'innner'参数表示使用内连接，on表示两表的公共字段，若公共字段再两表名称不一致时，可以通过 left_on和right_on指定：

df1.merge(df2,how='inner',on='id')
Out[32]: 
   id col1 col2
0   3    c    e
df1.merge(df2,how='inner',left_on='id',right_on='id')
Out[33]: 
   id col1 col2
0   3    c    e

外连接

外连接包括左连接，全连接，右连接

左连接

左连接通过公共字段，保留坐标的全部信息，右表在左表缺失的信息会以NaN补全：

df1.merge(df2,how='left',on='id')
Out[34]: 
   id col1 col2
0   1    a  NaN
1   2    b  NaN
2   3    c    e

右连接

右连接与左连接相对，右连接通过公共字段，保留右表的全部信息，左表在右表缺失的信息会以 NaN 补全。

df1.merge(df2,how='right',on='id')
Out[35]: 
   id col1 col2
0   3    c    e
1   4  NaN    d

全连接

全连接通过公共字段，保留右表的全部信息，两表相互缺失的信息会以 NaN 补全。

df1.merge(df2,how='outer',on='id')
Out[36]: 
   id col1 col2
0   1    a  NaN
1   2    b  NaN
2   3    c    e
3   4  NaN    d

行索引连接

pd.concat可以完成横向和纵向的合并，这通过 ’axis=‘ 来控制，当参数axis= 1时表示进行横向合并。

df1 = pd.DataFrame({'id':[1,2,3],
                    'col1':['a','b','c']},
                   index=[1,2,3])
df2 =pd.DataFrame({'id':[1,2,3],
                   'col2':['aa','bb','cc']},
                  index=[1,3,2])
pd.concat([df1,df2],axis=1)
Out[37]: 
   id col1  id col2
1   1    a   1   aa
2   2    b   3   cc
3   3    c   2   bb

纵向合并

当参数 axis = 0 时，表示纵向合并。ignore_index= True 表示忽略df1 和 df2 的原先的行索引，合并后重新排列索引。

pd.concat([df1,df2],ignore_index=True,axis=0)
Out[43]: 
  col1 col2  id
0    a  NaN   1
1    b  NaN   2
2    c  NaN   3
3  NaN   aa   1
4  NaN   bb   2
5  NaN   cc   3

去除重复行

pd.concat([df1,df2],ignore_index=True,axis=0).drop_duplicates()

Out[44]: 
  col1 col2  id
0    a  NaN   1
1    b  NaN   2
2    c  NaN   3
3  NaN   aa   1
4  NaN   bb   2
5  NaN   cc   3

排序

按照学生成绩降序排列数据，第一个参数表示排序的依据，ascending = False 代表降序排列，na_position=’last’表示缺失值数据排列在数据的最后位置。

sample = pd.DataFrame({'name':['Bob','Lindy','Mark',"Miki",'Sully','Rose'],
                       'score':[98,78,88,77,69,np.nan],
                       'group':[1,1,1,2,1,2]})
sample
###
sample.sort_values('score',ascending= False,na_position='last')
Out[46]: 
    name  score  group
0    Bob   98.0      1
2   Mark   88.0      1
1  Lindy   78.0      1
3   Miki   77.0      2
4  Sully   69.0      1
5   Rose    NaN      2

分组汇总

数据准备

sample = pd.read_csv('./sample.csv',encoding='utf-8')
sample
Out[58]: 
   chinese   class  grade  math   name
0        88      1      1  98.0    Bob
1        78      1      1  78.0  Lindy
2        68      1      1  78.0   Miki
3        56      2      2  77.0   Mark
4        77      1      2  77.0  Sully
5        56      2      2   NaN   Rose

分组汇总操作中，会涉及分组变量，度量变量和汇总统计量。pandas 提供了 groupby 方法进行分组汇总。

在sample数据中，grade为分组变量，math 为度量变量，现需要查询grade 为1，2中数学成绩最高。

分组变量

在进行分组汇总时，分组变量可以有多个。

sample.groupby(['grade','class'])['math'].max()
Out[65]: 
grade  class
1      1        98.0
2      1        77.0
       2        77.0
Name: math, dtype: float64

汇总变量

在进行分组汇总时，汇总变量也可以多个。

sample.groupby('grade',)['math','chinese'].mean()
Out[75]: 
            math  chinese
grade                    
1      84.666667       78
2      77.000000       63

汇总统计量

方法	解释	方法	解释
mean	均值	mad	平均绝对偏差
max	最大值	count	计数
min	最小值	skew	偏度
median	中位数	quantile	指定分位数
std	标准差

以上统计量方法可以直接接 groupby 对象使用，agg方法提供了一次汇总多个统计量的方法，例如，汇总各个班级的数学成绩的均值，最大值，最小值。

sample.groupby('class')['math'].agg(['mean','min','max'])
Out[78]: 
        mean   min   max
class                   
1      82.75  77.0  98.0
2      77.00  77.0  77.0

多重索引

以年级，班级对学生的数学，语文成绩进行分组汇总，汇总统计量为均值。此时df中有两个行索引和两个列索引。

df=sample.groupby(['class','grade'])['math','chinese'].agg(['mean','min','max'])
Out[80]: 
                  math             chinese        
                  mean   min   max    mean min max
class grade                                       
1     1      84.666667  78.0  98.0      78  68  88
      2      77.000000  77.0  77.0      77  77  77
2     2      77.000000  77.0  77.0      56  56  56

查询各个年级、班级的数学成绩的最小值。

df['math']['min']
Out[84]: 
class  grade
1      1        78.0
       2        77.0
2      2        77.0
Name: min, dtype: float64