编程基础---不同编程语言学习---不同编程语言中文件存取相关操作

发布日期：2021-07-24 12:00:32 浏览次数：2 分类：技术文章

本文共 14592 字，大约阅读时间需要 48 分钟。

当在不同语言间切换的时候，对一些相似操作容易产生混淆。在这里记录下不同语言中文件读取的区别，这篇文章始终不会写完，我偶尔碰到一点就记录一点。

Python中文件操作

文件路径

在输入文件路径时，注意文件名前是双斜杠，如’D:\tobacco\dataformat\orginal\’

python在描述路径时可以有多种方式，现列举常见的三种

方式一:转义的方式

'd:\\a.txt'

方式二:显式声明字符串不用转义

'd:r\a.txt'

方式三:使用Linux的路径/

'd:/a.txt'

我强烈推荐第三种写法，这在Linux和window下都是行的通的。

os 模块下有两个函数：

for root, dirs, files in os.walk(filepath):      for file in files:         df_temp=pd.read_csv('%s/%s'%(filepath,file),sep ='\t',encoding ='utf-16')        df=pd.concat([df,df_temp],axis=0)

或者

for file in os.listdir(filepath):      df_temp=pd.read_csv('%s/%s'%(filepath,file),sep ='\t',encoding ='utf-16')    df=pd.concat([df,df_temp],axis=0)

通过循环调用也可以实现拉取子目录的子目录中的文件

其他可参考

输出多个三元tupple(dirpath, dirnames, filenames),它会遍历所有的目录，每一个三元组的结构如下。

其中第一个为起始路径，

第二个为起始路径下的文件夹,

第三个是起始路径下的文件.

dirpath是一个string，代表目录的路径,

dirnames是一个list，包含了dirpath下所有子目录的名字,

filenames是一个list，包含了非目录文件的名字.这些名字不包含路径信息,如果需要得到全路径,需要使用

下面是一个例子

import osfor i in os.walk('c:'+os.sep+'ant'):    print i

输出:

('c:\\ant', ['bin', 'docs', 'etc', 'lib', 'Project'], ['fetch.xml', 'get-m2.xml', 'INSTALL', 'KEYS', 'LICENSE', 'NOTICE', 'README', 'WHATSNEW'])('c:\\ant\\bin', [], ['ant', 'ant.bat', 'ant.cmd', 'antenv.cmd', 'antRun', 'antRun.bat', 'antRun.pl', 'complete-ant-cmd.pl', 'envset.cmd', 'lcp.bat', 'runant.pl', 'runant.py', 'runrc.cmd'])('c:\\ant\\docs', ['ant2', 'antlibs', 'images', 'manual', 'projects', 'slides', 'webtest'], ['antnews.html', 'ant_in_anger.html', 'ant_task_guidelines.html', 'appendix_e.pdf', 'breadcrumbs.js', 'bugs.html', 'bylaws.html', 'contributors.html', 'external.html', 'faq.html', 'favicon.ico', 'index.html', 'legal.html', 'LICENSE', 'license.html', 'mail.html', 'mission.html', 'nightlies.html', 'page.css', 'problems.html', 'projects.html', 'resources.html', 'svn.html'])('c:\\ant\\docs\\ant2', [], ['actionlist.html', 'features.html', 'FunctionalRequirements.html', 'original-specification.html', 'requested-features.html', 'requested-features.txt', 'VFS.txt'])('c:\\ant\\docs\\antlibs', ['antunit', 'compress', 'dotnet', 'props', 'svn'], ['bindownload.cgi', 'bindownload.html', 'charter.html', 'index.html', 'proper.html', 'sandbox.html', 'srcdownload.cgi', 'srcdownload.html'])('c:\\ant\\docs\\antlibs\\antunit', [], ['index.html'])('c:\\ant\\docs\\antlibs\\compress', [], ['index.html'])('c:\\ant\\docs\\antlibs\\dotnet', [], ['index.html'])('c:\\ant\\docs\\antlibs\\props', [], ['index.html'])...

如果要获取所有子文件夹，及其文件，可用这种方式。

python之shutil模块

隐藏

Python 提供了必要的函数和方法进行默认情况下的文件基本操作。你可以用 file 对象做大部分的文件操作。

你必须先用Python内置的open()函数打开一个文件，创建一个file对象，相关的方法才可以调用它进行读写。

语法：

file object = open(file_name [, access_mode][, buffering])

例如：fo = open(“XXX.txt”, “r+”)

各个参数的细节如下：

file_name：file_name变量是一个包含了你要访问的文件名称的字符串值。access_mode：access_mode决定了打开文件的模式：只读，写入，追加等。所有可取值见如下的完全列表。这个参数是非强制的，默认文件访问模式为只读(r)。buffering:如果buffering的值被设为0，就不会有寄存。如果buffering的值取1，访问文件时会寄存行；如果将buffering的值设为大于1的整数，这就是寄存区的缓冲大小；如果取负值，寄存区的缓冲大小则为系统默认。

不同模式打开文件的完全列表：

模式	描述	例子
r	以只读方式打开文件。文件的指针将会放在文件的开头。这是默认模式。	读文本文件；input = open('data', 'r')；第二个参数默认为r input = open('data')；
rb	以二进制格式打开一个文件用于只读。文件指针将会放在文件的开头。这是默认模式。	读二进制文件:input = open('data', 'rb')
r+	打开一个文件用于读写。文件指针将会放在文件的开头。
rb+	以二进制格式打开一个文件用于读写。文件指针将会放在文件的开头。
w	打开一个文件只用于写入。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。	output = open('data', 'w')
wb	以二进制格式打开一个文件只用于写入。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。	output = open('data', 'wb')
w+	打开一个文件用于读写。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。	output = open('data', 'w+')
wb+	以二进制格式打开一个文件用于读写。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。
a	打开一个文件用于追加。如果该文件已存在，文件指针将会放在文件的结尾。也就是说，新的内容将会被写入到已有内容之后。如果该文件不存在，创建新文件进行写入。
ab	以二进制格式打开一个文件用于追加。如果该文件已存在，文件指针将会放在文件的结尾。也就是说，新的内容将会被写入到已有内容之后。如果该文件不存在，创建新文件进行写入。
a+	打开一个文件用于读写。如果该文件已存在，文件指针将会放在文件的结尾。文件打开时会是追加模式。如果该文件不存在，创建新文件用于读写。
ab+	以二进制格式打开一个文件用于追加。如果该文件已存在，文件指针将会放在文件的结尾。如果该文件不存在，创建新文件用于读写。

读取所有内容file_object = open('thefile.txt')try:     all_the_text = file_object.read( )finally:     file_object.close( ) 读固定字节file_object = open('abinfile', 'rb')try:    while True:         chunk = file_object.read(100)        if not chunk:            break         do_something_with(chunk)finally:     file_object.close( ) 读每行list_of_all_the_lines = file_object.readlines( )如果文件是文本文件，还可以直接遍历文件对象获取每行：for line in file_object:     process line

写数据file_object = open('thefile.txt', 'w')file_object.write(all_the_text)file_object.close( ) 写入多行file_object.writelines(list_of_text_strings)注意，调用writelines写入多行在性能上会比使用write一次性写入要高。在处理日志文件的时候，常常会遇到这样的情况：日志文件巨大，不可能一次性把整个文件读入到内存中进行处理，例如需要在一台物理内存为 2GB 的机器上处理一个 2GB 的日志文件，我们可能希望每次只处理其中 200MB 的内容。在 Python 中，内置的 File 对象直接提供了一个 readlines(sizehint) 函数来完成这样的事情。以下面的代码为例：file = open('test.log', 'r')sizehint = 209715200   # 200Mposition = 0lines = file.readlines(sizehint)while not file.tell() - position < 0:       position = file.tell()       lines = file.readlines(sizehint)每次调用 readlines(sizehint) 函数，会返回大约 200MB 的数据，而且所返回的必然都是完整的行数据，大多数情况下，返回的数据的字节数会稍微比 sizehint 指定的值大一点（除最后一次调用 readlines(sizehint) 函数的时候）。通常情况下，Python 会自动将用户指定的 sizehint 的值调整成内部缓存大小的整数倍。

执行文件操作

file在python是一个特殊的类型，它用于在python程序中对外部的文件进行操作。在python中一切都是对象，file也不例外，file有file的方法和属性。下面先来看如何创建一个file对象：

file(name[, mode[, buffering]])

file()函数用于创建一个file对象，它有一个别名叫open()，可能更形象一些，它们是内置函数。来看看它的参数。它参数都是以字符串的形式传递的。name是文件的名字。

mode是打开的模式，可选的值为r w a U，分别代表读（默认）写添加支持各种换行符的模式。用w或a模式打开文件的话，如果文件不存在，那么就自动创建。此外，用w模式打开一个已经存在的文件时，原有文件的内容会被清空，因为一开始文件的操作的标记是在文件的开头的，这时候进行写操作，无疑会把原有的内容给抹掉。由于历史的原因，换行符在不同的系统中有不同模式，比如在 unix中是一个\n，而在windows中是‘\r\n’，用U模式打开文件，就是支持所有的换行模式，也就说‘\r’ ‘\n’ '\r\n’都可表示换行，会有一个tuple用来存贮这个文件中用到过的换行符。不过，虽说换行有多种模式，读到python中统一用\n代替。在模式字符的后面，还可以加上+ b t这两种标识，分别表示可以对文件同时进行读写操作和用二进制模式、文本模式（默认）打开文件。

buffering如果为0表示不进行缓冲;如果为1表示进行“行缓冲“;如果是一个大于1的数表示缓冲区的大小，应该是以字节为单位的。

file对象有自己的属性和方法。

利用file对象的属性可以对实现对文件的操作主要有：获取文件相关信息、读文件（获取内容）、写文件、关闭file对象

以下是和file对象相关的属性列表：

属性	描述	示例	输出结果
file.mode()	返回被打开文件的访问模式。	fo=open("Python读写文件.txt",'r') print(fo.mode)	r
file.name()	返回文件的名称。	fo=open("Python读写文件.txt",'r') print(fo.name)	Python读写文件.txt
file.tell()	返回指针在当前文件中的位置，按字符算，下一次读取会从该位置之后进行	fo=open("Python读写文件.txt",'r',encoding='UTF-8') content=fo.read(10) print(content) print(fo.tell())	try somet 12 ：似乎结果不同
file.seek(offset,[,from])	重新调整指针位置，from表示起始的参考位置，offset表示离参考位置的距离	fo=open("Python读写文件.txt",'r',encoding='UTF-8') content=fo.read(10) print(content) print(fo.tell()) position=fo.seek(0,0) content=fo.read(10) print(content) print(fo.tell())	try somet 12 try somet 12
file.read()	按字符读取文件内容，可以是二进制数据，而不仅仅是文字	fo=open("Python读写文件.txt",'r',encoding='UTF-8') content=fo.read(10) print(content)	try somet
file.readline()	整行读取	fo=open("Python读写文件.txt",'r',encoding='UTF-8') content=fo.readline() print(content)	……
file.readlines()	整个文件一次性读取，此时要注意文件大小，以免产生错误	fo=open("Python读写文件.txt",'r',encoding='UTF-8') content=fo.readlines() print(content)	……
file.write()	往文件中写内容，file对象要以写方式或追回方式打开	fo=open("Python读写文件.txt",'a',encoding='UTF-8') content=fo.write("我只是试试")
file.closed()	关闭file对象	fo=open("Python读写文件.txt",'a',encoding='UTF-8') content=fo.write("我只是试试") fo.close()

python逐行读取文件内容的三种方法方法一：复制代码 代码如下:f = open("foo.txt")             # 返回一个文件对象  line = f.readline()             # 调用文件的 readline()方法  while line:      print line,                 # 后面跟 ',' 将忽略换行符      # print(line, end = '')　　　# 在 Python 3中使用      line = f.readline()  f.close()  方法二：复制代码 代码如下:for line in open("foo.txt"):      print line,  方法三：复制代码 代码如下:f = open("c:\\1.txt","r")  lines = f.readlines()#读取全部内容  for line in lines      print line

#-*-coding:utf8-*-alphaList=['a','b','c','d','e','f','g','h','i','j'        ,'k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']replaceList=['aa','bb','cc','dd','ee','ff','gg','hh','ii','jj'    ,'kk','ll','mm','nn','oo','pp','qq','rr','ss','tt','uu','vv','ww','xx','yy','zz']with open(r"AAA.py",encoding='utf8')   as f:    lines = f.read()#读取全部内容    for i in range(len(alphaList)):        lines = lines.replace(alphaList[i], replaceList[i])with open(r"BBB.txt",mode='w',encoding='utf8')   as f:    content=f.write(lines)with open(r"BBB.txt",encoding='utf8')   as f:    lines = f.read()#读取全部内容    for i in range(len(replaceList)):        lines = lines.replace(replaceList[i], alphaList[i])# print(lines)

python获取路径

import os,sys使用sys.path[0]、sys.argv[0]、os.getcwd()、os.path.abspath(__file__)、os.path.realpath(__file__)

sys.path是Python会去寻找模块的搜索路径列表，sys.path[0]和sys.argv[0]是一回事因为Python会自动把sys.argv[0]加入sys.path。

如果你在C:\test目录下执行python getpath\getpath.py，那么os.getcwd()会输出“C:\test”，sys.path[0]会输出“C:\test\getpath”。

如果你用py2exe模块把Python脚本编译为可执行文件，那么sys.path[0]的输出还会变化：

如果把依赖库用默认的方式打包为zip文件，那么sys.path[0]会输出“C:\test\getpath\libarary.zip”；

如果在setup.py里面指定zipfile=None参数，依赖库就会被打包到exe文件里面，那么sys.path[0]会输出“C:\test\getpath\getpath.exe”。

#!/bin/env python#-*- encoding=utf8 -*-import os,sysif __name__=="__main__":    print "__file__=%s" % __file__    print "os.path.realpath(__file__)=%s" % os.path.realpath(__file__)    print "os.path.dirname(os.path.realpath(__file__))=%s" % os.path.dirname(os.path.realpath(__file__))　　    print "os.path.split(os.path.realpath(__file__))=%s" % os.path.split(os.path.realpath(__file__))[0]　　    print "os.path.abspath(__file__)=%s" % os.path.abspath(__file__)    print "os.getcwd()=%s" % os.getcwd()    print "sys.path[0]=%s" % sys.path[0]    print "sys.argv[0]=%s" % sys.argv[0]输出结果:D:\>python ./python_test/test_path.py__file__=./python_test/test_path.pyos.path.realpath(__file__)=D:\python_test\test_path.pyos.path.dirname(os.path.realpath(__file__))=D:\python_testos.path.split(os.path.realpath(__file__))=D:\python_testos.path.abspath(__file__)=D:\python_test\test_path.pyos.getcwd()=D:\sys.path[0]=D:\python_testsys.argv[0]=./python_test/test_path.py

os.getcwd() “D:\”，取的是起始执行目录

sys.path[0]或sys.argv[0] “D:\python_test”，取的是被初始执行的脚本的所在目录

os.path.split(os.path.realpath(file))[0] “D:\python_test”，取的是__file__所在文件test_path.py的所在目录

正确获取当前的路径：

__file__是当前执行的文件# 获取当前文件__file__的路径print "os.path.realpath(__file__)=%s" % os.path.realpath(__file__)# 获取当前文件__file__的所在目录print "os.path.dirname(os.path.realpath(__file__))=%s" % os.path.dirname(os.path.realpath(__file__)) 　　# 获取当前文件__file__的所在目录print "os.path.split(os.path.realpath(__file__))=%s" % os.path.split(os.path.realpath(__file__))[0]

python 读取parquet格式数据

python单机读取

import pandas as pdimport pyarrow.parquet as pqimport osimport timefrom sqlalchemy import create_enginetemp_table = pq.read_table ('C:/Users/资料/test/fcebb413-0dd8-4dde-9e30-29f2bbee2545.parquet')column_names = []for column in temp_table.itercolumns():    column_names.append(column.name)temp_df = pd.DataFrame()for column_name in column_names:    if column_name in ['XXX', 'BBBB']:        temp_df[column_name] = temp_table.column (column_name).to_pandas ().astype('str')    else:        temp_df[column_name] = temp_table.column(column_name).to_pandas()temp_df.to_csv('./temp_B_df.csv',header=True)

利用pyspark读取

from pyspark.sql import SparkSessionspark=SparkSession.builder.enableHiveSupport().getOrCreate()hdfsPath="hdfs://ndhdfs/user/test/AAA-DATA2/day=20181202/*/*.parquet"BBB =spark.read.parquet(hdfsPath)BBB.filter(BBB.sample_ts>'2018-12-02 07:58:26').filter(BBB.sample_ts<'2018-12-02 07:59:00').show()BBB.filter(BBB.sample_ts>'2018-12-02 07:58:26').filter(BBB.sample_ts<'2018-12-02 07:59:00').count()

其他注意事项

由于文件读写时都有可能产生IOError，一旦出错，后面的f.close()就不会调用。所以，为了保证无论是否出错都能正确地关闭文件，我们可以使用try … finally来实现

file_object = open("Python读写文件.txt",'r',,encoding='UTF-8')try:     all_the_text = file_object.read( )finally:     file_object.close( )

注：不能把open语句放在try块里，因为当打开文件出现异常时，文件对象file_object无法执行close()方法。

更简洁的方式

with open('/path/to/file', 'r') as f:    print(f.read())

自动帮我们调用close()方法

关于中文乱码问题

要写入特定编码的文本文件，请给open()函数传入encoding参数，将字符串自动转换成指定编码。如果读入UTF-8编码的txt文件提示

“builtins.UnicodeDecodeError: 'gbk' codec can't decode byte 0xa0 in position 76: illegal multibyte sequence”

的错误，可以如此操作，fo=open(“Python读写文件.txt”,‘r’,encoding=‘UTF-8’)；但是以这种用realines读入时，文件数据开头有“\ufeff”，原因如下：

在Windows下用文本编辑器创建的文本文件，如果选择以UTF-8等Unicode格式保存，会在文件头（第一个字符）加入一个BOM标识。

什么是BOM？BOM = Byte Order MarkBOM是Unicode规范中推荐的标记字节顺序的方法。比如说对于UTF-16，如果接收者收到的BOM是FEFF，表明这个字节流是Big-Endian的；如果收到FFFE，就表明这个字节流是Little-Endian的。UTF-8不需要BOM来表明字节顺序，但可以用BOM来表明“我是UTF-8编码”。BOM的UTF-8编码是EF BB BF（用UltraEdit打开文本、切换到16进制可以看到）。所以如果接收者收到以EF BB BF开头的字节流，就知道这是UTF-8编码了。所有的BOM在C/C++/Java中都被处理为"\uFEFF"（？？？貌似不一定。。。）

更详细可

这个标识在python和Java读取文件的时候，不会被去掉。如果用readLine()读取第一行存进String里面，这个String的length会比看到的大1，而且第一个字符就是这个BOM。

好在python会统一把BOM变成“\ufeff”，这样的话，就可以自己手动解决了（判断后，用substring()或replace()去除掉这个BOM）：

if(line.startsWith("\ufeff")){   //line = line.substring(1);   line = line.replace("\ufeff", "");  }

python 读取ANSI 格式数据

path=r'E:\users\huanghe\Desktop\abnormal\20190503_143605_0018-1.json'with open(path,encoding='gb2312') as f:    listlist=f.readlines()print('ANSI打开成功')i=1for i in range(10):    print(listlist[i])

利用Pandas库读取文件数据

pandas中有很多种读数据的函数，像读文本的read_table()函数，读csv文件的read_csv()函数。

spark中读取文件或获取数据的方法

textFile(name, minPartitions=None, use_unicode=True)

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)

wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)

Spark SQL中读取文件和获取数据

，以下的spark均表示已存在的SparkSession

spark.read.json

# spark is an existing SparkSessiondf = spark.read.json("examples/src/main/resources/people.json")# Displays the content of the DataFrame to stdoutdf.show()# +----+-------+# | age|   name|# +----+-------+# |null|Michael|# |  30|   Andy|# |  19| Justin|# +----+-------+

spark.read.load

df = spark.read.load("examples/src/main/resources/users.parquet")df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

df = spark.read.load("examples/src/main/resources/people.json", format="json")df.select("name", "age").write.save("namesAndAges.parquet", format="parquet")

spark.read.parquet

df = spark.read.parquet("examples/src/main/resources/users.parquet")

JDBC To Other Databases

R中文件操作

获取文件操作权限

执行文件操作

其他注意事项

关于中文乱码问题

Java中文件操作

java中怎么写文件路径

第一个双斜线　：　“C:\mydoc\aa.doc”

第二个单斜线　：　“C:/mydoc/aa.doc”

第三个File.separator　：

String path = "C:"+File.separator+"my.doc" ;System.out.println(path);File.separator 　这是用你所用的系统默认的文件分割符

获取文件操作权限

执行文件操作

其他注意事项

关于中文乱码问题

回车换行

转载地址：https://blog.csdn.net/qingqing7/article/details/78445362 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：数据基础---不同软件中的数据类型

下一篇：项目实例---随机森林在Kaggle实例:Titanic中的应用(一)

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

Python中文件操作

文件路径

python之shutil模块

执行文件操作

python获取路径

python 读取parquet格式数据

其他注意事项

关于中文乱码问题

python 读取ANSI 格式数据

利用Pandas库读取文件数据

spark中读取文件或获取数据的方法

R中文件操作

获取文件操作权限

执行文件操作

其他注意事项

关于中文乱码问题

Java中文件操作

获取文件操作权限

执行文件操作

其他注意事项

关于中文乱码问题

回车换行

发表评论

最新留言

关于作者

推荐文章