文章数据
收藏(次)

【python数据分析】第六章 加载-存储-文件格式-2

加关注
分块加载数据:
piece = pd.read_csv('py/ex5.csv', header=None, chunksize=10)

# piece 是一个可以迭代对象,取出一块
next(piece)

# 也可以遍历
for p in piece:
print(p)
break
# 获取下一块
piece.get_chunk()

# 获取剩下的全部的块,变成 DataFrame 对象
piece.read()



源文件:ex5.csv 中有1000条数据, chunksize=10 说明把数据读出来,10条数据为一块
这时 piece 中有 100 块数据,每次遍历或者 next() 就能取出一块数据
================================================================================
保存文件:
--------------------------------------------------------------------------------
mydata = pd.read_csv('py/ex5.csv', names=['c1', 'c2'])
mydata.to_csv('py/save_1.csv')

保存后的文件:把索引,names 全部保存了,而且names前面有个,
,c1,c2
0,0 ,10000.0
1,1 ,10001.0

自定义分隔符:
mydata.to_csv('py/save_2.csv', sep='|')

如果有 NAN 的值保存时怎么保存
na_rep = 'NULL' 比如替换成 NULL 保存,默认时''
index=False index 要不要保存
header=False 表头是否保存
columns=['c2', 'c1'] 保存哪些列,这样也是人工排序一下


import sys
c4 = pd.read_csv('py/ex4.csv')
c4.to_csv(sys.stdout, index =False, columns=['c1', 'c2']) # 保存重定向到标准输出,不一定是文件


Series也有一个to_csv方法:
--------------------------------------------------------------------------------
to_csv 方法定义:
dataframe.to_csv(
path_or_buf: Union[str, pathlib.Path, IO[~AnyStr], NoneType] = None,
sep: str = ',',
na_rep: str = '',
float_format: Union[str, NoneType] = None,
columns: Union[Sequence[Union[Hashable, NoneType]], NoneType] = None,
header: Union[bool, List[str]] = True,
index: bool = True,
index_label: Union[bool, str, Sequence[Union[Hashable, NoneType]], NoneType] = None,
mode: str = 'w',
encoding: Union[str, NoneType] = None,
compression: Union[str, Mapping[str, str], NoneType] = 'infer',
quoting: Union[int, NoneType] = None,
quotechar: str = '"',
line_terminator: Union[str, NoneType] = None,
chunksize: Union[int, NoneType] = None,
date_format: Union[str, NoneType] = None,
doublequote: bool = True,
escapechar: Union[str, NoneType] = None,
decimal: Union[str, NoneType] = '.',
) -> Union[str, NoneType]
================================================================================
直接使用Python内置的csv模块: 没看明白作者的意图时要干嘛,先过掉
import csv
f = open('examples/ex7.csv')
reader = csv.reader(f)
for line in reader:
pass
================================================================================
JSON 数据:
--------------------------------------------------------------------------------
JSON 字符串:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
} """

obj 格式化以后:
{
"name":"Wes",
"places_lived":[
"United States",
"Spain",
"Germany"
],
"pet":null,
"siblings":[
{
"name":"Scott",
"age":30,
"pets":[
"Zeus",
"Zuko"
]
},
{
"name":"Katie",
"age":38,
"pets":[
"Sixes",
"Stache",
"Cisco"
]
}
]
}

import json
res = json.loads(obj) # res 就是 dict 对象了
asjson = json.dumps(res) # 把对象变成 json 字符串, asjson 时字符串了

# dict 对象变成 DataFrame 对象
pd.DataFrame(res['siblings'], columns=['names', 'age', 'pets'])
--------------------------------------------------------------------------------
JSON 格式的文件:
json 文件的内容:
[{"a": 1, "b": 2, "c": 3},
{"a": 4, "b": 5, "c": 6},
{"a": 7, "b": 8, "c": 9}]

pd.read_json('py/ex6.json')

输出:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
--------------------------------------------------------------------------------
d = pd.read_json('py/ex6.json')
d.to_json() # '{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}'
格式变成这种了:
{
"a":{
"0":1,
"1":4,
"2":7
},
"b":{
"0":2,
"1":5,
"2":8
},
"c":{
"0":3,
"1":6,
"2":9
}
}
原来的数据是:适合查阅
[
学生1:{姓名,年龄,序号,...},
学生2:{姓名,年龄,序号,...},
....
]
现在成了:适合统计分析,比较
[
姓名:[学生1的,学生2的, ....],
年龄:[学生1的,学生2的, ....],
序号:[学生1的,学生2的, ....],
]

那就这样变化一下:
d.to_json(orient='records')
输出: [{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
================================================================================
Xml 和 html 格式数据, 略
================================================================================
二进制格式数据:

frame = pd.read_csv('examples/ex1.csv')
frame.to_pickle('examples/frame_pickle')
pd.read_pickle('examples/frame_pickle')

注意:
pickle仅建议用于短期存储格式。其原因是很难保证该格式永远是稳定的;
今天pickle的对象可能无法被后续版本的库unpickle出来。虽然我尽力保证
这种事情不会发生在pandas中,但是今后的某个时候说不定还是得“打破”该pickle格式。
--------------------------------------------------------------------------------
使用HDF5格式(二进制格式的一种)
对于那些非常大的无法直接放入内存的数据集,HDF5就是不错的选择,因为它可以高效地分块读写。

数据存入这里就像存到内层中的一个 字典里,可以指定要获取哪些数据
--------------------------------------------------------------------------------
mydata = pd.read_csv('py/ex5.csv', names=['c1', 'c2'])
store = pd.HDFStore('mydata.h5')
store['obj1'] = mydata
store['obj1_c1'] = mydata.c1
store # 输出:<class 'pandas.io.pytables.HDFStore'> File path: mydata.h5

store['obj1_c1'] # 上面保存的 mydata.c1 取出来了
--------------------------------------------------------------------------------
HDFStore支持两种存储模式,'fixed'和'table'。后者通常会更慢,但是支持使用特殊语法进行查询操作:
store.put('obj3', mydata, format='table')
store.select('obj3', where=['index > 10 and index < 20'])

用完以后要关闭:
store.close()

mydata.to_hdf('mydata.h5', 'has_1000_data', format='table') # 'has_1000_data' 要满足变量命名
# mydata.h5 还是上面 # store = pd.HDFStore('mydata.h5') 的那个文件,感觉就像数据库
pd.read_hdf('mydata.h5', 'has_1000_data', where=['index < 10']) # 取出数据
================================================================================
Excel 数据:有依赖包
xlsx = pd.ExcelFile('examples/ex1.xlsx')
pd.read_excel(xlsx, 'Sheet1')
--------------------------------------------------------------------------------
借助 ExcelWriter 保存
writer = pd.ExcelWriter('examples/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()
--------------------------------------------------------------------------------
直接保存
frame.to_excel('examples/ex2.xlsx')
================================================================================
使用数据库数据
--------------------------------------------------------------------------------
1、先取出,直接构建 DataFrame 对象
--------------------------------------------------------------------------------
2、直接连接:
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql('select * from test', db)
--------------------------------------------------------------------------------



>> 目录 << 


 

分享
收藏
点赞人
举报
文章标签
评论列表

推荐

暂无推荐