'Data cleaning before, during or after data ingestion?

I am building a self-contained data analytics project in Python. As the project needs to be scalable it requires a fairly solid pipeline of data processing and analytics.

So far I'm planning to use Singer (https://www.singer.io/) to ingest the data from multiple sources, with a PostgreSQL target.

The pipeline currently looks a bit like this: data sources --> ingest --> store in postgreSQL DB --> data processing layer --> analytics environment.

I have already written Pandas code to clean data in the data processing layer - but I'm not sure if cleaning data as it is being pulled from the database into the analytics environment is the best practice. Especially as the data processing will then be repeated each time the data is pulled. Should I process the data in the ingestion layer? How would I do that with a Singer pipeline?



Solution 1:[1]

As always it depends.

Cleaning data before ingest

Pros

  • It lowers network traffic / data volume
  • It requires less storage

Cons

  • It requires extra steps from each datasource
  • It is hard to orchestrate, monitor these

Cleaning data during ingest

Pros

  • Preliminary checks are located in a single place
  • You are able to report metrics
    • ingested, dropped, ingested-dropped ratio, etc.
  • This step could be orchestrated and monitored easier

Cons

  • It is just a preliminary check
    • During data modelling you might need to do further cleaning
  • The maintenance of these rules is a responsibility of the data pipeline engineer

Cleaning data after ingest

Pros

  • It can be used not just for preliminary checks
    • For example: deduplication, filter unwanted outliers, etc.
  • Different cleansing steps can be defined on a data model basis

Cons

  • It requires more storage
  • Each data scientist has to implement his/her own cleansing steps

This is not an exhaustive list but I hope it shows how you should start thinking about this problem.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Peter Csala