The dirty truth of data lakes

Reimagining the data warehouse

Posted by steve on May 23, 2024

The architecture principle that drove the creation of the data-lake paradigm was:

  1. you cannot determine all the future use-cases for data at the point of capture
  2. the opportunity cost from analysis delays is higher than the cost of storing the data
  3. the operational cost of archiving data is higher than the cost of retaining it
  4. the operational cost of updating data is higher than processing cost of aggregating it.

These principles give rise to the data-lake pattern, where considerable investment by web-scale companies {google, amazon, Alibaba, Facebook, etc} continue to accrue new insights from the click-stream of users; this in turn led to the widespread adoption of Hadoop and its many derivatives as a data-storage pattern.

The allegory of a lake is appealing because lakes store water for later use, but also imply an effortless natural process rather than the effort and cost of a building reservoir. Data-sewer is the anti-pattern of the data-lake pattern, you really do not know whether you’ve built a lake or sewer until you try to accrue value from it.

The reason for is post is to highlights three traits that lead to data-sewers rather than a data-lakes:

  1. missing the opportunity to undertake rudimentary analysis up-front and normalisation (either by missing commonality (treating {facility, loan, mortgage, repo,..} as exceptions) or misclassifying (treating a derivative contract as a legal agreement rather than an instrument)
  2. missing the opportunity to map lifecycle (hedging, securitisation, late-booking)
  3. missing the opportunity to move to a real-time event-model, with a focus on batch cycles.

The Architecture mistake is to see Hadoop as a paradigm shift in technology rather than a (potentially) cheaper data-warehouse. When cloud providers offer hybrid solutions that combine traditional MPP databases (SQL/Server PDW, Oracle Exadata, etc) with Hadoop/Spark/Kafka integration and block-storage replacing HDFS.. it is not unreasonable for business sponsors to question whether all the effort was a waste of time.

Microsoft Synapse is one example of technology advance obsoleting chief data offices