python3 large file unzip and basic operations

  • 2020-06-19 10:34:11
  • OfStack

First of all: the so-called big file is not the size of the compressed file, a few 10 megabytes of files but after the decompression of hundreds of megabytes. It will encounter the unzipping is not successful., read small files when successful, large files when failure


def unzip_to_txt_plus(zipfilename):
  zfile = zipfile.ZipFile(zipfilename, 'r')
  for filename in zfile.namelist():
    data = zfile.read(filename)
    # data = data.decode('gbk').encode('utf-8')
    data = data.decode('gbk', 'ignore').encode('utf-8')
    file = open(filename, 'w+b')
    file.write(data)
    file.close()


if __name__ == '__main__':
  zipfilename = "E:\\share\\python_excel\\zip_to_database\\20171025.zip"
  unzip_to_txt_plus(zipfilename)


Note the parameter: 'ignore', since the default is strict coding, if you do not add this parameter, an error will be reported.
Since this function has already compiled the file into utf-8, the later reading of the file was successful. Here is the code to read the large file (ignoring the database).


# - coding: utf-8 -
import csv
import linecache
import xlrd
import MySQLdb


def txt_todatabase(filename, linenum):
   # with open(filename, "r", encoding="gbk") as csvfile:
   #   Read = csv.reader(csvfile)
   #   count =0
   #   for i in Read:
   #   #   print(i)
   #      count += 1
   #      # print('hello')
   #   print(count)
   count = linecache.getline(filename, linenum)
   print(count)
   # with open("new20171028.csv", "w", newline="") as datacsv:
   #   # dialect To open the csv File, default is excel . delimiter="\t" Parameter refers to the separator at write time 
   #   csvwriter = csv.writer(datacsv, dialect=("excel"))
   #   # csv File is inserted into the 1 Row the data into each of the following lists 1 Entry into the 1 2 cells (multiple rows can be inserted in a loop) 
   #   csvwriter.writerow(["A", "B", "C", "D"])


def bigtxt_read(filename):
  with open(filename, 'r', encoding='utf-8') as data:
    count =0
    while 1:
      count += 1
      line = data.readline()
      if 1000000 == count:
        print(line)
      if not line:
        break
    print(count)


if __name__ == '__main__':
  filename = '20171025.txt'
  txt_todatabase(filename, 1000000)
  bigtxt_read(filename)

After comparison, it is found that the two speeds are basically 1 kind of fast. There's no pressure on two million rows.


Related articles: