'MemoryError: Unable to allocate 1.88 GiB for an array with shape (2549150, 99) and data type object
I have a problem. I want to normalize with pd.json_normalize(...)
a list
with inside dict
but unfortunately I got a MemoryError
. Is there an option to work around this error? Well it worked with pd.json_normalize(my_data[:2000000], sep="_")
but not with the complete data (2549150)
I looked at MemoryError: Unable to allocate MiB for an array with shape and data type, when using anymodel.fit() in sklearn , and Unable to allocate array with shape and data type
my_data = [
{'_id': 'orders/213123',
'contactEditor': {'name': 'Max Power',
'phone': '1234567',
'email': '[email protected]'},
'contactSoldToParty': {'name': 'Max Not',
'phone': '123456789',
'email': '[email protected]'},
'isCompleteDelivery': False,
'metaData': {'dataOriginSystem': 'Goods',
'dataOriginWasCreatedTime': '10:12:12',},
'orderDate': '2021-02-22',
'orderDateBuyer': '2021-02-22',
},
{'_id': 'orders/12323',
'contactEditor': {'name': 'Max Power2',
'phone': '1234567',
'email': '[email protected]'},
'contactSoldToParty': {'name': 'Max Not',
'phone': '123456789',
'email': '[email protected]'},
'isCompleteDelivery': False,
'metaData': {'dataOriginSystem': 'Goods',
'dataOriginWasCreatedTime': '10:12:12',},
'orderDate': '2021-02-22',
'orderDateBuyer': '2021-02-22',
},
]
df = pd.json_normalize(my_data, sep="_")
[OUT]
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11136/3519902863.py in <module>
----> 1 df= pd.json_normalize(my_data, sep='_')
MemoryError: Unable to allocate 1.88 GiB for an array with shape (2549150, 99) and data type object
What I want
id contactEditor_name contactEditor_phone contactEditor_email ...
orders/213123 Max Power ... ... ...
orders/12323 Max Power2 ... ... ...
Length of len(my_data)
is 2549150`
Solution 1:[1]
I have faced similar issues previously. You should split the DataFrame, as you have already mentioned in your question. Do it for X number rows first, then X more and so on. Finally combine the results into a single Dataframe.
Solution 2:[2]
You can try this simple walkaround.
split_index = 1500000
df = pd.concat([
pd.json_normalize(my_data[:split_index], sep="_"),
pd.json_normalize(my_data[split_index:], sep="_"),
], ignore_index=True)
If memory restriction is very tight, you may need to split it into more than 3 segments.
Solution 3:[3]
As @ken has pointed, splitting in smaller sections may solve the problem.
However, in the example data that you posted, there are around 12 columns, not 99. Could it be that one of the entries is corrupted and messing up your parsing?
How does your loaded 2M rows look like if you ask df.info()
?
You could also do a kind of "lazy loading" using max_level=1
and then expanding items when needed. It may help if problem comes from many nested levels.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Haakh |
Solution 2 | |
Solution 3 | Zaero Divide |