Home:ALL Converter>Split complicated strings in Python dynamically

Split complicated strings in Python dynamically

Ask Time:2016-06-23T02:43:09         Author:msleevi

Json Formatter

I have been having difficulty with organizing a function that will handle strings in the manner I want. I have looked into a handful previous questions 1, 2, 3 among others that I have sorted through. Here is the set up, I have well structured but variable data that needs to be split from a string read from the file, to an array of strings. The following showcases some examples of the data I am dealing with

('Vdfbr76','gsdf','gsfd','',NULL),
('Vkdfb23l','gsfd','gsfg','[email protected]',NULL),
('4asg0124e','Lead Actor/SFX MUA/Prop designer','John Smith','[email protected]',NULL),
('asdguIux','Director, Camera Operator, Editor, VFX','John Smith','',NULL),
...
(492,'E1asegaZ1ox','Nysdag_5YmD','145872325372620',1,'long, string, with, commas'),

I want to split these strings based on commas, however, there are commas occasionally contained within the strings which causes problems. In addition to this, developing an accurate re.split(regex, line) becomes difficult becomes the number of items in each line changes throughout the read.

Some solutions that I have tried up to this point.

def splitLine(text, fields, delimiter):
    return_line = []

    regex_string = "(.*?),"

    for i in range(0,len(fields)-1):

        regex_string+=("(.*)")

        if i < len(fields)-2:
            regex_string+=delimiter

    return_line = re.split(regex_string, text)

    return return_line

This will give a result where we have the following output

 regex_string
 return_line

However the main problem with this is that it occasionally lumps two fields together. In the case the 3rd value in the array.

(.*?),(.*),(.*),(.*),(.*),(.*)
['', '\t(222', "'Vy1asdfnuJkA','Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']

Where the ideal result would look like:

['', '\t(222', "'Vy1asdfnuJkA'", "'Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']

It is a small change, but it has a huge influence on the result. I tried manipulating the regex string to better suit what I was trying to do, but with each case I solved, another broke it unfortunately.

Another case which I played around with came from user Aaron Cronin in this post 4, which looks like below

def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
result = []
buff = ""
level = 0
is_quoted = False

for char in text:
    if char in delimiter and level == 0 and not is_quoted:
        result.append(buff)
        buff = ""
    else:
        buff += char

        if char in opens:
            level += 1
        if char in closes:
            level -= 1
        if char in quotes:
            is_quoted = not is_quoted

if not buff == "":
    result.append(buff)

return result

The results of this look like so:

["\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n"]

The main problem is that it comes out as the same string. Which puts me in a feedback loop.

The ideal result would look like:

[\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n]

Any help is appreciated, I am not sure what the best approach is in this scenario. I am happy to clarify any questions that arise as well. I tried to be as complete as possible.

Author:msleevi,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/37975964/split-complicated-strings-in-python-dynamically
yy