python – 如何确定文本的编码?

我收到了一些编码的文本,但我不知道使用了什么字符集.有没有办法使用Python确定文本文件的编码? How can I detect the encoding/codepage of a text file处理C​​#.
最佳答案
始终正确地检测编码是不可能的.

(来自chardet FAQ 🙂

However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn’t English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text’s language.

有一个chardet库使用该研究来尝试检测编码. chardet是Mozilla中自动检测代码的一个端口.

您也可以使用UnicodeDammit.它将尝试以下方法:

>在文档本身中发现的编码:例如,在XML声明中或(对于HTML文档)的http-equiv META标记.如果Beautiful Soup在文档中找到这种编码,它会从头开始再次解析文档并尝试新编码.唯一的例外是如果您明确指定了编码,并且该编码实际上有效:那么它将忽略它在文档中找到的任何编码.
>通过查看文件的前几个字节来嗅探编码.如果在此阶段检测到编码,则它将是UTF- *编码,EBCDIC或ASCII之一.
>如果安装了chardet库,则会对其进行嗅探.
> UTF-8
> Windows-1252

转载注明原文:python – 如何确定文本的编码? - 代码日志