'Python2 tokenization and add to dictonary

I have some texts that I need to generate tokens splitting by space. Furthermore, I need to remove all punctuation, as I need to remove everything inside double braces [[...]] (including the double braces).

Each token I will put on a dictionary as the key that will have a list of values.

I have tried regex to remove these double braces patterns, if-elses, but I can't find a solution that works. For the moment I have:

tokenDic = dict()
splittedWords =  re.findall(r'\[\[\s*([^][]*?)]]',  docs[doc], re.IGNORECASE) 
tokenStr = splittedWords.split()

for token in tokenStr:
    tokenDic[token].append(value);


Solution 1:[1]

To remove everything inside [[]] you can use re.sub and you already have the correct regex so just do this.

 x = [[hello]]w&o%r*ld^$
 y = re.sub("\[\[\s*([^][]*?)]]","",x)
 z = re.sub("[^a-zA-Z\s]","",y)
 print(z)

This prints "world"

Solution 2:[2]

Is this what you're looking for?

import re
value_list = []
inp_str = 'blahblah[[blahblah]]thi ng1[[junk]]hmm'
tokenDic = dict()
#remove everything in double brackets
bracket_stuff_removed = re.sub(r'\[\[[^]]*\]\]', '', inp_str)

#function to keep only letters and digits
clean_func = lambda x: 97 <= ord(x.lower()) <= 122 or 48 <= ord(x) <= 57

for token in bracket_stuff_removed.split(' '):
    cleaned_token = ''.join(filter(clean_func, token))
    tokenDic[cleaned_token] = list(value_list)

print(tokenDic)

Output:

{'blahblahthi': [], 'ng1hmm': []}

As for appending to the list, I don't have enough info right now to tell you the best way in your situation.

If you want to set the value when you're adding the key, do this:

tokenDic[cleaned_token] = [val1, val2, val3]

If you want to set the values after the key has been added, do this:

val_to_add = "something"
if tokenDic.get(cleaned_token, -1) == -1:
    print('ERROR', cleaned_token, 'does not exist in dict')
else:
    tokenDic[cleaned_token].append(val_to_add)

If you want to directly append to the dict in both cases, you'll need to use defaultdict(list) instead of dict.. then if the key does not exist in the dict, it will create it, make the value an empty list, and then add your value.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2