'MemoryError: Unable to allocate 1.88 GiB for an array with shape (2549150, 99) and data type object

I have a problem. I want to normalize with pd.json_normalize(...) a list with inside dict but unfortunately I got a MemoryError. Is there an option to work around this error? Well it worked with pd.json_normalize(my_data[:2000000], sep="_") but not with the complete data (2549150)

I looked at MemoryError: Unable to allocate MiB for an array with shape and data type, when using anymodel.fit() in sklearn , and Unable to allocate array with shape and data type

my_data = [
{'_id': 'orders/213123',
 'contactEditor': {'name': 'Max Power',
  'phone': '1234567',
  'email': '[email protected]'},
 'contactSoldToParty': {'name': 'Max Not',
  'phone': '123456789',
  'email': '[email protected]'},
 'isCompleteDelivery': False,
 'metaData': {'dataOriginSystem': 'Goods',
  'dataOriginWasCreatedTime': '10:12:12',},
 'orderDate': '2021-02-22',
 'orderDateBuyer': '2021-02-22',
},
{'_id': 'orders/12323',
 'contactEditor': {'name': 'Max Power2',
  'phone': '1234567',
  'email': '[email protected]'},
 'contactSoldToParty': {'name': 'Max Not',
  'phone': '123456789',
  'email': '[email protected]'},
 'isCompleteDelivery': False,
 'metaData': {'dataOriginSystem': 'Goods',
  'dataOriginWasCreatedTime': '10:12:12',},
 'orderDate': '2021-02-22',
 'orderDateBuyer': '2021-02-22',
 },
]

df = pd.json_normalize(my_data, sep="_")
[OUT]
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11136/3519902863.py in <module>
----> 1 df= pd.json_normalize(my_data, sep='_')
MemoryError: Unable to allocate 1.88 GiB for an array with shape (2549150, 99) and data type object

What I want

id             contactEditor_name contactEditor_phone contactEditor_email ...
orders/213123  Max Power          ...                 ...                 ...
orders/12323   Max Power2         ...                 ...                 ...

Length of len(my_data) is 2549150`



Solution 1:[1]

I have faced similar issues previously. You should split the DataFrame, as you have already mentioned in your question. Do it for X number rows first, then X more and so on. Finally combine the results into a single Dataframe.

Solution 2:[2]

You can try this simple walkaround.

split_index = 1500000
df = pd.concat([
    pd.json_normalize(my_data[:split_index], sep="_"),
    pd.json_normalize(my_data[split_index:], sep="_"),
], ignore_index=True)

If memory restriction is very tight, you may need to split it into more than 3 segments.

Solution 3:[3]

As @ken has pointed, splitting in smaller sections may solve the problem.

However, in the example data that you posted, there are around 12 columns, not 99. Could it be that one of the entries is corrupted and messing up your parsing?

How does your loaded 2M rows look like if you ask df.info()?

You could also do a kind of "lazy loading" using max_level=1 and then expanding items when needed. It may help if problem comes from many nested levels.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Haakh
Solution 2
Solution 3 Zaero Divide