09 Python之Pandas库之数据处理与规整
发布日期:2021-06-29 15:44:24
浏览次数:2
分类:技术文章
本文共 9235 字,大约阅读时间需要 30 分钟。
Pandas库之数据处理与规整
import numpy as npimport pandas as pdimport pandas_datareader.data as webimport datetime# 爬取数据df = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,1),datetime.date.today())# 展示数据df
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
2019-04-09 | 202.850006 | 199.229996 | 200.320007 | 199.500000 | 35768200 | 199.500000 |
2019-04-10 | 200.740005 | 198.179993 | 198.679993 | 200.619995 | 21695300 | 200.619995 |
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |
1 缺失数据处理
1.1 去掉包含缺失值的行
df_drop = df.dropna()df_drop.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
1.2 对缺失值进行补充
df_fillna = df.fillna(value=0)df_fillna.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
1.3 判断数据是否为nan,并进行布尔填充
df_isnull = pd.isnull(df)df_isnull.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | False | False | False | False | False | False |
2019-04-02 | False | False | False | False | False | False |
2019-04-03 | False | False | False | False | False | False |
2019-04-04 | False | False | False | False | False | False |
2019-04-05 | False | False | False | False | False | False |
2 函数的应用和映射
# 列计算平均值df.mean()
High 1.981070e+02Low 1.950050e+02Open 1.962690e+02Close 1.971340e+02Volume 2.435307e+07Adj Close 1.971340e+02dtype: float64
# 行计算平均值df.mean(1)
Date2019-04-01 4.643826e+062019-04-02 3.794444e+062019-04-03 3.878796e+062019-04-04 3.185879e+062019-04-05 3.087931e+062019-04-08 4.313782e+062019-04-09 5.961534e+062019-04-10 3.616050e+062019-04-11 3.483633e+062019-04-12 4.624216e+06dtype: float64
#skipna参数默认是True 表示排除缺失值df.mean(axis = 1,skipna = False)
Date2019-04-01 4.643826e+062019-04-02 3.794444e+062019-04-03 3.878796e+062019-04-04 3.185879e+062019-04-05 3.087931e+062019-04-08 4.313782e+062019-04-09 5.961534e+062019-04-10 3.616050e+062019-04-11 3.483633e+062019-04-12 4.624216e+06dtype: float64
# 行名字排序sorted_row_df = df.sort_index()sorted_row_df.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
# 列名字排序sorted_col_df = df.sort_index(axis=1)sorted_col_df.head()
Adj Close | Close | High | Low | Open | Volume | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.240005 | 191.240005 | 191.679993 | 188.380005 | 191.639999 | 27862000 |
2019-04-02 | 194.020004 | 194.020004 | 194.460007 | 191.050003 | 191.089996 | 22765700 |
2019-04-03 | 195.350006 | 195.350006 | 196.500000 | 193.149994 | 193.250000 | 23271800 |
2019-04-04 | 195.690002 | 195.690002 | 196.369995 | 193.139999 | 194.789993 | 19114300 |
2019-04-05 | 197.000000 | 197.000000 | 197.100006 | 195.929993 | 196.449997 | 18526600 |
常用的方法如上所介绍,还有很多其他的,下面罗列了一些,可供参考:
count 非na值的数量
describe 针对Series或DataFrame列计算汇总统计
min、max 计算最小值和最大值
argmin、argmax 计算能够获取到最大值和最小值得索引位置
idxmin、idxmax 计算能够获取到最大值和最小值得索引值
quantile 计算样本的分位数(0到1)
sum 值的总和
mean 值的平均数
median 值的算术中位数
mad 根据平均值计算平均绝对离差
var 样本值的方差
std 样本值的标准差
skew 样本值的偏度(三阶矩)
kurt 样本值的累积和
cumsum 样本值的累积和
cummin,cummax 样本值的累计最大值和累计最小值
cumprod 样本值的累计积
diff 计算一阶差分
pct_change 计算百分数变化
3 数据规整
Pandas提供了大量的方法能够轻松的对Series,DataFrame和Pannel对象进行各种符号各种逻辑关系的合并操作
concat 可以沿一条轴将多个对象堆叠到一起
append 将一行连接到一个DataFrame上
duolicated 移除重复数据
3.1 数据堆叠concat
df1 = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,1),datetime.datetime(2019,4,3))df1
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
df2 = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,5),datetime.datetime(2019,4,8))df2
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
纵向拼接(默认):
pd.concat([df1,df2],axis=0)
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
横向拼接,index对不上的用NaN填充
pd.concat([df1,df2],axis=1)
High | Low | Open | Close | Volume | Adj Close | High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | ||||||||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 | NaN | NaN | NaN | NaN | NaN | NaN |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 | NaN | NaN | NaN | NaN | NaN | NaN |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 | NaN | NaN | NaN | NaN | NaN | NaN |
2019-04-04 | NaN | NaN | NaN | NaN | NaN | NaN | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300.0 | 195.690002 |
2019-04-05 | NaN | NaN | NaN | NaN | NaN | NaN | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600.0 | 197.000000 |
2019-04-08 | NaN | NaN | NaN | NaN | NaN | NaN | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700.0 | 200.100006 |
3.2 数据连接append
df1
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
s = df1.iloc[0]s
High 1.916800e+02Low 1.883800e+02Open 1.916400e+02Close 1.912400e+02Volume 2.786200e+07Adj Close 1.912400e+02Name: 2019-04-01 00:00:00, dtype: float64
# ignore_index=False 表示索引不变df1.append(s,ignore_index=False)
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 |
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
# ignore_index=True 表示索引重置df1.append(s,ignore_index=True)
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
0 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
1 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 |
2 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 |
3 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
3.3 移除重复数据duplicated
z = df1.append(s,ignore_index=False)z
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 |
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
查看重复数据
z.duplicated()
Date2019-04-01 False2019-04-02 False2019-04-03 False2019-04-01 Truedtype: bool
移除重复数据
z.drop_duplicates()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 |
4 分组
z.groupby("Open").sum()
High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|
Open | |||||
191.089996 | 194.460007 | 191.050003 | 194.020004 | 22765700.0 | 194.020004 |
191.639999 | 383.359985 | 376.760010 | 382.480011 | 55724000.0 | 382.480011 |
193.250000 | 196.500000 | 193.149994 | 195.350006 | 23271800.0 | 195.350006 |
z.groupby(["Open","Close"]).sum()
High | Low | Volume | Adj Close | ||
---|---|---|---|---|---|
Open | Close | ||||
191.089996 | 194.020004 | 194.460007 | 191.050003 | 22765700.0 | 194.020004 |
191.639999 | 191.240005 | 383.359985 | 376.760010 | 55724000.0 | 382.480011 |
193.250000 | 195.350006 | 196.500000 | 193.149994 | 23271800.0 | 195.350006 |
转载地址:https://codingchaozhang.blog.csdn.net/article/details/89349197 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!
发表评论
最新留言
不错!
[***.144.177.141]2024年05月01日 16时00分06秒
关于作者
喝酒易醉,品茶养心,人生如梦,品茶悟道,何以解忧?唯有杜康!
-- 愿君每日到此一游!
推荐文章
Pycharm+tensorflow dropout 学习(三)
2019-04-29
Pycharm+tensorflow CNN 学习(四)
2019-04-29
用python暴力破解压缩包密码
2019-04-29
基于OpenCV 将图片进行预处理,转变为MNIST图片格式
2019-04-29
PyCharm+Tensorflow CNN调用训练好的模型进行预测 (五)
2019-04-29
物联网平台Node-red初涉——访问搭建的简易服务器
2019-04-29
2020-10-27
2019-04-29
OpenCV+python识别并打印HSV颜色
2019-04-29
2021-03-29
2019-04-29
网络攻击与防御--引言
2019-04-29
网络攻击与防御--网络协议漏洞
2019-04-29
sql注入: 判断注入点类型
2019-04-29
千人千面Elasticsearch实战学习笔记
2019-04-29
最大子数组问题(递归)(java)
2019-04-29
2021年第十二届蓝桥杯软件赛省赛第二场 C/C++ 大学 A 组
2019-04-29
2020年哨兵数据批量下载(USGS)
2019-04-29
简单3步快速生成千万级别mysql测试数据库,模拟电商数据
2019-04-29
EasyDSS平台接入设备量过多的情况下如何进行批量推流测试?
2019-04-29
开始使用gensim入门
2019-04-29
用inotify+rsync实现快速的实时同步
2019-04-29