Opening a file with universal newlines in binary mode in python 3

Ask Time：2019-05-30T08:24:24 Author：craigds

We're (finally) upgrading an application to Python 3.

One thing we have to upgrade is rewriting a CSV file using normal newlines.

The original (python 2) code looks like this:


import csv

IN_PATH = 'in.csv'
OUT_PATH = 'out.csv'

# Opens the original file in 'text mode' (which has no effect on Python 2)
# and with 'universal newlines',
# meaning \r, \n, and \r\n all get treated as line separators.
with open(IN_PATH, 'rU') as in_csv:
    with open(OUT_PATH, 'w') as out_csv:
        csv_reader = csv.reader(in_csv)
        csv_writer = csv.writer(out_csv)

        for tupl in csv_reader:
            csv_writer.writerow(tupl)

These CSV files are user-provided. This means:

we have no control over what newline characters they use, so we need to handle all of them.
we don't know the encoding of the file at this stage in the process.

Because we don't know the encoding, we cannot decode the bytestrings into text.

To make this work on Python 3, first we changed it to use io.open(), which is mostly compatible with py3's open(). Now we can't use 'text mode' any more, because on Python 3 that requires decoding the bytestrings, and we don't know the encoding.

However, using 'binary mode' means we can no longer use universal-newlines, since that's only available in text mode.


# Opens the original file in 'binary mode'
# (because we don't know the encoding, so we can't decode it)
# FIXME: How to get universal newline support?
with io.open(IN_PATH, 'rb') as in_csv:
    with io.open(OUT_PATH, 'wb') as out_csv:

Note that, although the U mode character is no longer supported in python 3, it does use universal newlines by default in text mode. It doesn't appear to have any way of using universal newlines in binary mode.

How can we make this code work in Python 3?

Author:craigds，eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/56370136/opening-a-file-with-universal-newlines-in-binary-mode-in-python-3

MisterMiyagi :

TLDR: Use ASCII with surrogate escapes on Python3:\n\ndef text_open(*args, **kwargs):\n return open(*args, encoding='ascii', errors='surrogateescape', **kwargs)\n\n\n\n\nThe recommended approach if you know only a partial encoding (e.g. ASCII \\r and \\n) is to use surrogate escapes for unknown code points:\n\n\n What can you do if you need to make a change to a file, but don’t know\n the file’s encoding? If you know the encoding is ASCII-compatible and\n only want to examine or modify the ASCII parts, you can open the file\n with the surrogateescape error handler:\n\n\nThis uses reserved placeholders to embed the unknown bytes in your text stream. For example, the byte b'\\x99' becomes the \"unicode\" code point '\\udc99'. This works for both reading and writing, allowing you to preserve arbitrary embedded data.\n\nThe common line endings (\\n, \\r, \\r\\n) are all well-defined in ASCII. It is thus sufficient to use ASCII encoding with surrogate escapes.\n\nFor compatibility code, it is easiest to provide separate Python 2 and Python 3 versions of the divergent functionality. open is sufficiently similar that for most use cases, you just need to insert the surrogate escape handling.\n\nif sys.version_info[0] == 3:\n def text_open(*args, **kwargs):\n return open(*args, encoding='ascii', errors='surrogateescape', **kwargs)\nelse:\n text_open = open\n\n\nThis allows using universal newlines without knowing the exact encoding. You can use this to directly read or transcribe files:\n\nwith text_open(IN_PATH, 'rU') as in_csv:\n with text_open(OUT_PATH, 'wU') as out_csv:\n for line in in_csv:\n out_csv.write(line)\n\n\nIf you need further formatting of the csv module, the text stream provided by text_open is sufficient as well. To handle non-ascii delimiters/padding/quotes, translate them from a bytestring to the appropriate surrogate.\n\nif sys.version_info[0] == 3:\n def surrogate_escape(symbol):\n return symbol.decode(encoding='ascii', errors='surrogateescape')\nelse:\n surrogate_escape = lambda x: x\n\nDezimeter = surrogate_escape(b'\\xA9\\x87')\n",

2019-06-02T10:01:53

Peter Henry :

I don't think there is a built-in way to do what you want in Python 3. Without knowing the encoding, you only know for sure that you have a bunch of bytes - you're not sure which ones of them mean the characters \\r or \\n.\n\nYour Python 2 code was probably using the system default encoding according to sys.getdefaultencoding() to inform the built-in universal newline normalizer (don't quote me, I haven't looked at the implementation), and if your system is like mine, that was probably ascii.\n\nFortunately I think most encodings (including utf-8) only differ in the mappings of their higher-order characters (above the ascii range). So, it's not a terrible assumption to make that the byte 10 means \\n and 13 means \\r for all common encodings - meaning you could just do the replacement yourself by reading the input byte-by-byte (or rather, using a sliding two-byte window).\n\nWarning: I haven't exhaustively tested the following code for behavior around repeated sequences like \\r\\r\\r or weird things like \\n\\r, so while it may handle those sanely it also may not. Please do test on your own data.\n\nfrom __future__ import print_function\n\nimport io\nimport six # optional (but hugely helpful for a 2 to 3 port)\n\n\ndef normalize(prev, curr):\n ''' Given current and previous bytes, get tuple of bytes that should be written\n\n :param prev: The byte just before the read-head\n :type prev: six.binary_type\n :param curr: The byte at the read-head\n :type curr: six.binary_type\n :returns : A tuple containing 0, 1, or 2 bytes that should be written\n :rtype : Tuple[six.binary_type]\n '''\n R = six.binary_type(b'\\r')\n N = six.binary_type(b'\\n')\n if curr == R: # if we find R, can't dump N yet because it might be start of RN sequence and we must consume N too\n return ()\n elif curr == N: # if we find N, doesn't matter what previous byte was - dump N\n return (N,)\n elif prev == R: # we know current not N or R; if previous byte was R - dump N, then the current byte\n return (N, curr)\n else: # we know current not N or R and prev not R - dump the current byte\n return (curr,)\n\n\nif __name__ == '__main__':\n\n IN_PATH = 'in.csv'\n OUT_PATH = 'out.csv'\n\n with io.open(IN_PATH, mode='rb') as in_csv:\n with io.open(OUT_PATH, mode='wb') as out_csv:\n prev = None # at start, there is no previous byte\n curr = six.binary_type(in_csv.read(1)) # at start, the current byte is the input file's first byte\n while curr: # loop over all bytes in the input file\n for byte in normalize(prev, curr): # loop over all bytes returned from the normalizing function\n print(repr(byte)) # debugging\n out_csv.write(byte) # write each byte to the output file\n prev = curr # update value of previous byte\n curr = six.binary_type(in_csv.read(1)) # update value of current byte\n\n\nThis works for me on both Python 2.7.16 and 3.7.3, using an input file I created (using Python 3) like this:\n\nimport io\n\nwith io.open('in.csv', mode='wb', encoding='latin-1') as fp:\n fp.write('à,b,c\\n')\n fp.write('1,2,3\\r')\n fp.write('4,5,6\\r\\n')\n fp.write('7,8,9\\r')\n fp.write('10,11,12\\n')\n fp.write('13,14,15')\n\n\nIt also works using encoding='UTF-8' (as it should).\n\nIt's not necessary to use six.binary_type() like I did, but I find it a helpful reminder of the semantics of the data I'm working with, esp when writing cross-version code.\n\nI spent a while trying to figure out if there was a nicer way to do this than manually examining all the bytes, but was unsuccessful. If anyone else finds a way, I'm interested in seeing it!",

2019-05-30T19:19:57

Opening a file with universal newlines in binary mode in python 3