'Using Symspell compound with Parallel processing
I am new to NLP related tasks and I'm doing this with Pandas (Python) but the idea is that each row has a text that I'm trying to perform spell corrector on (sentence length may vary) and the total pandas dataframe is slightly over ~ 1 million records currently, likely to increase in the future.
Initially, I thought to use the symspell lookup_compound directly via the apply function with Pandas but it took such a long time (>12hours) and there was still no results.
def symspell_compound(input_term, max_edit_distance=2):
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance)
for suggestion in suggestions:
return suggestion.term
df['text_data'].apply(symspell_compound)
Then I came across the Parallel function with joblib and I wasn't able to find much examples on it but it seems to work on lists. So after extracting the text_data into a list, I applied the Parallel() together with the symspell_compound function but yet the processing was still slow (refer to code verbose print out below).
text_list = df['text_data'].to_list()
test_parallel = Parallel(n_jobs=4, verbose=10)(delayed(symspell_compound)(i) for i in text_list[:1000])
This is the code verbose printout when I tried it on a sample of 1000 records.
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 tasks | elapsed: 52.2s
[Parallel(n_jobs=4)]: Done 10 tasks | elapsed: 1.7min
[Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 2.8min
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 3.9min
[Parallel(n_jobs=4)]: Done 33 tasks | elapsed: 5.2min
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 6.6min
[Parallel(n_jobs=4)]: Done 53 tasks | elapsed: 8.2min
[Parallel(n_jobs=4)]: Done 64 tasks | elapsed: 9.9min
[Parallel(n_jobs=4)]: Done 77 tasks | elapsed: 11.9min
Any ideas on what has gone wrong (e.g. on function parameter etc), or how can I do this more efficiently? Thanks in advance.
Side note: I'm doing this in CDSW workbench using 4CPU and 8GB memory (as this is the max allowed so far)
Solution 1:[1]
Performance-wise Python probably isn't the best choice. Using the C# implementation LookupCompound can reach 5000 words/s, single-core on 2012 Macbook. ( see https://seekstorm.com/blog/sub-millisecond-compound-aware-automatic.spelling-correction/ ). One of the following ports with Python bindings might help to improve the performance by orders of magnitude:
Rust port https://github.com/reneklacan/symspell
Python bindings for Rust port https://github.com/zoho-labs/symspell
Python bindings for C++ port
Original C# version https://github.com/wolfgarbe/symspell
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Wolf Garbe |