iconv

第一个想法是使用iconv等linux工具进行转换，但是我随便的挑了一个名为1006.txt的文本，看一下它的编码：

zsh >> file 1006.txt 
1006.txt: ISO-8859 text, with very long lines, with CRLF line terminators

在windows下使用notepad++打开该文件，分析出的编码是GB2312。

若指定原编码为iso-8859，iconv直接报错：

zsh >> iconv  -f iso-8859 -t utf-8 1006.txt 
iconv: conversion from `iso-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.

若指定原编码为gb2312，

zsh >> iconv  -f gb2312 -t utf-8 1006.txt  

&nbsp;&nbsp;　　2000多万年前欧洲、美洲大陆板块分离的时候，在北大西洋深处扯出一道裂缝，岩浆从裂缝中喷射而出，形成了冰岛――“地球上最美的一道伤痕”　　

　　乡村教堂是冰岛少见的人文景观。岛民的先祖在830年前跟随海盗船到达冰岛

　　冰岛夏季的极昼现象：午夜时分，太阳在地平线上徘徊不落
　　□曼陀罗　文／图
　　在地广人稀的冰岛不停地走，仿佛走到了世界的尽头。
　　1iconv: illegal input sequence at position 374

iconv只处理了一部分就报错了。

暂时没找到解决方法。准备写一个。

使用chardet探测文件编码

# !/usr/bin/python
# -*- encoding:utf-8 -*-

import chardet

def detect_file_encoding(file_path):
    ''' 返回文件的编码 '''
    f = open(file_path, 'r')
    data = f.read()
    predict =  chardet.detect(data)
    f.close()
    return predict['encoding']
    
if __name__ == '__main__':
    file_path = './1006.txt'
    print detect_file_encoding(file_path)

输出：

GB2312

使用codecs库读取文件内容

# !/usr/bin/python
# -*- encoding:utf-8 -*-

import codecs
def get_file_content(file_path):
    ''' 获取文件内容，最终为utf-8 '''
    file_encoding = detect_file_encoding(file_path)
    f = codecs.open(file_path, 'r', file_encoding, errors="ignore")
    return f.read()
if __name__ == '__main__':
    file_path = './1006.txt'
    print get_file_content(file_path)

注意，codecs.open()中必须有errors="ignore"，否则处理容易中断和报类似下面的错误：

UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 374-375: illegal multibyte sequence

上面的代码运行结果和notepad++显示的结果是相同的。

剩下的任务就是将所有的文件转换编码了，在这个过程中遇到了文件编码为None的情况，用file命令看一下：

zsh >> file ./C000008/1789.txt
./C000008/1789.txt: data

zsh >> file ./C000023/1170.txt
./C000023/1170.txt: ISO-8859 text, with very long lines, with CRLF line terminators

先不解决了。一共有17912个文档，少这两个也没什么问题。

最终代码：

# !/usr/bin/python
# -*- encoding:utf-8 -*-

import chardet, codecs, os

import sys
reload(sys)
sys.setdefaultencoding('UTF-8')

def detect_file_encoding(file_path):
    ''' 返回文件的编码 '''
    f = open(file_path, 'r')
    data = f.read()
    predict =  chardet.detect(data)
    f.close()
    return predict['encoding']

def get_file_content(file_path):
    ''' 获取文件内容，最终为utf-8 '''
    file_encoding = detect_file_encoding(file_path)
    if file_encoding is None:
        return ''
    f = codecs.open(file_path, 'r', file_encoding, errors="ignore")
    data = f.read()
    f.close()
    return data

def get_all_file(dir_path):
    ''' 获取 dir_path下的所有文件的路径 '''
    dir_list = [dir_path]
    file_list = []
    
    while len(dir_list) != 0:
        # print dir_list
        curr_dir = dir_list.pop(0)
        for path_name in os.listdir(curr_dir):
            full_path = os.path.join(curr_dir, path_name)
            if os.path.isdir(full_path):
                dir_list.append(full_path)
            else:
                file_list.append(full_path)
    return file_list
      
def write2file(content, file_path):
    ''' 将utf-8编码的content写入文件file_path '''
    
    with codecs.open(file_path, 'w', 'utf-8', errors='ignore') as f:
        f.write(content)
    
def del_file(file_path):
    ''' 删除文件 '''
    os.remove(file_path)
    

def translate_dir(dir_path):
    ''' 将整个目录下的所有文件转换为utf-8编码 '''
    for file_path in get_all_file(dir_path):
        # print file_path
        content = get_file_content(file_path)
        del_file(file_path)
        write2file(content, file_path)
        
def count_encoding_none(dir_path):
    ''' 看看哪些文件的编码为None '''
    all_num = 0
    none_num = 0
    none_files = []
    for file_path in get_all_file(dir_path):
        print file_path
        encoding = detect_file_encoding(file_path)
        all_num += 1
        if encoding is None:
            none_num += 1
            none_files.append(file_path)
    return all_num, none_num, none_files
        
        
if __name__ == '__main__':
    dir_path = './'
    translate_dir(dir_path)

使用python转换文件编码

iconv

使用chardet探测文件编码

使用codecs库读取文件内容