inputFileHandle = open(inputFileName, 'r') row = 0 for line in inputFileHandle: row = row + 1 if line_meets_condition: outputFileHandle.write(line) else: lstIgnoredRows.append(row)
我检查了源文件中的行尾，它们作为换行符(ascii char 10)检出。拉出问题行并将其解析为按预期方式工作。我在这里碰到一些蟒蛇的限制吗？第一个异常文件中的位置大约是4GB的标记。
Now, the explanation of the bug; it’s not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread().
In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment “This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF.”
Oddly, there is an almost exact copy of this function in Perl source code:
The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?]
At this point, the function thinks that the next read() will return the LF, but it won’t because the file pointer was not moved back.
But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().
转载请明显位置注明出处：用python解析大(20GB)文本文件 – 读取2行为1