三行python代码，永久消除linux解压zip包乱码-六虎

布景

写了一段代码，需求调用python的shutil规范库解压zip紧缩包，详细代码如下

import shutil


def unzip(self, src_path: str, dst_path: str):
  # shutil.unpack_archive("../README.md.zip", "../")
  # shutil.unpack_archive("../docxx.zip", "../")
  shutil.unpack_archive(src_path, dst_path)

成果发现解压后的文件名称呈现了乱码，可是文件内容是正常的，没有呈现乱码

python代码的运转环境是ubnutu，发现运用unzip命令解压，也会呈现这个问题

$ unzip HEAP.zip
Archive:  HEAP.zip
  inflating: 20230329-SOC│╡╘╞╥╗╠╗▒-1.pptx 
  inflating: 20230330-EEA-╒√│╡╡╫╙╡╞╝▄╣╣ v1.pptx 
  inflating: 20230412-╬╩╠╨▐╕─.md

而在windows环境下，就不会呈现这种问题，代码是正常的，而巧合的是这个紧缩包也是windows下打的，所以根本能够清晰，这是由于不同操作体系的默许编码不同导致的

问题处理

shutil规范库解压zip包，调用的是zipfile规范库，调用代码如下

def _unpack_zipfile(filename, extract_dir):
  """Unpack zip `filename` to `extract_dir`
   """
  import zipfile # late import for breaking circular dependency

  if not zipfile.is_zipfile(filename):
    raise ReadError("%s is not a zip file" % filename)

  zip = zipfile.ZipFile(filename)
  try:
    for info in zip.infolist():
      name = info.filename

      # don't extract absolute paths or ones with .. in them
      if name.startswith('/') or '..' in name:
        continue

      target = os.path.join(extract_dir, *name.split('/'))
      if not target:
        continue

      _ensure_directory(target)
      if not name.endswith('/'):
        # file
        data = zip.read(info.filename)
        f = open(target, 'wb')
        try:
          f.write(data)
        finally:
          f.close()
          del data
  finally:
    zip.close()

其间问题就呈现在这行代码

      name = info.filename

读取了文件名，可是没有依照正确的编码格局进行解码，只需如下处理就能够处理这个问题

      if info.flag_bits & 0x800: # #utf-8 #编码
        name = info.filename
      else:
        try:
          # zipfile 默许运用 #cp437 编码 & #utf-8 编码
          name = info.filename.encode('cp437').decode('gbk') # gbk编码兼容ASCII
        except UnicodeDecodeError as e:
          name = info.filename

问题原因也很简单，获取文件名之后，没有依照windows下的编码格局，而是运用了cp437，所以呈现了乱码

info.flag_bits是一个标志位，其间的一位是用于判别是否运用utf-8编码，详解见下末节。

有的教程会教我们如何修正python规范库的源码，以处理这个问题，可是这是一种很危险的操作，不建议如此。

我选用的方案是经过shutil.unregister_unpack_format()和shutil.register_unpack_format()办法动态的替换运转时解压zip包的函数。

show me code

完好代码如下

def _unpack_zipfile(filename, extract_dir):
  """Unpack zip `filename` to `extract_dir`
   """
  import zipfile # late import for breaking circular dependency

  if not zipfile.is_zipfile(filename):
    raise shutil.ReadError("%s is not a zip file" % filename)

  zip = zipfile.ZipFile(filename)
  try:
    for info in zip.infolist():
      # name = info.filename

      # 支撑windows下的打得zip包 不会乱码 ==========================
      if info.flag_bits & 0x800: # #utf-8 #编码
        name = info.filename
      else:
        try:
          # zipfile 默许运用 #cp437 编码 & #utf-8 编码
          name = info.filename.encode('cp437').decode('gbk') # gbk编码兼容ASCII
        except UnicodeDecodeError as e:
          name = info.filename
      # ========================================================

      # don't extract absolute paths or ones with .. in them
      if name.startswith('/') or '..' in name:
        continue

      target = os.path.join(extract_dir, *name.split('/'))
      if not target:
        continue

      ensure_dir(target)
      if not name.endswith('/'):
        # file
        data = zip.read(info.filename)
        f = open(target, 'wb')
        try:
          f.write(data)
        finally:
          f.close()
          del data
  finally:
    zip.close()


shutil.unregister_unpack_format('zip')
shutil.register_unpack_format('zip', ['.zip'], _unpack_zipfile, [], "ZIP file")

原因剖析

知其然，还要知其所以然

zip(紧缩文件格局)是一种古老的规范，最早呈现在ibm的dos体系下，zip属于当前几种干流的紧缩格局之一。当年的dos不能像今日这样支撑unicode和utf-8编码，不同国家的电脑需求装置不同的代码页(code page)，并只能兼容当地(国家/地区)的文字。在这种情况下，zip和dos相同，设计初期并没有考虑unicode统一编码的问题，所以紧缩时分会依照各个操作体系默许编码存储文件。

现如今，跟着新的unicode和utf-8编码的兴盛，越来越多的体系开端支撑utf-8规范(这是一种能够支撑全球一切文字的编码方法)。zip中也增加了新的标志位，用来表示zip文件的紧缩编码是否是utf-8。但是，干流操作体系针对zip的紧缩功能代码年久失修，很多功能都没有遵从最新的zip规范，不同操作体系的文件体系对编码格局支撑不统一。如linux下默许不支撑gbk编码；windows操作体系的中文默许编码为gbk，而且至今windows 10依旧选用兼容代码页(code page)的方法判别体系语言，因此windows的zip紧缩会运用本地码紧缩(默许是gbk编码)，而不会敞开utf-8标志位，但会运用zip一个特别的功能“zip拓展文件名字段”，并在拓展字段里运用“utf-8”编码的文件名；而macos操作体系虽然选用中文默许编码utf-8，但因为mac的代码页(code page)便是utf-8，所以紧缩的时分依照utf-8紧缩，且不会敞开utf-8标志位。而且，不同操作体系对大/小写文件名识别的方法也不一致，如linux下区分大小写，mac、windows下默许不区分大小写。

由于文件识别体系无法获悉即将解码的zip文件是由哪种体系编码的，也就无法提供与zip文件相匹配的解码方法。

三行python代码，永久消除linux解压zip包乱码

布景

问题处理

show me code

原因剖析

相关文章

电子稳像技术介绍

Java的线程池是怎么回事？来看看这篇文章吧

Nginx静态资源压缩

后端数据脱敏实现简单总结

作者信息