Data Lake as an answer — The evolution, standards and future driving force

Aleksander Bircakovic, Data Teach Lead @ Levi9


Designing a Data Lake: cloud or on-prem system?

Efficiency and scalability

Assessment of the current needs and prediction of potential growth can be a challenging task. When talking about on-prem system, it is necessary to assess the current needs as well as the potential growth in the upcoming period in order to put together a business justification for securing the funds.
On the other hand, Cloud Platforms usually charge for services based on used or reserved processing power and used storage, and with this billing model, they enable quick start of the journey towards an MVP solutions. As the complexity of requirements increases as well as the amount of data, the Cloud platform system can be easily scaled up. Storing data in the form of blobs is usually very cheap and practically unlimited. Database servers can be scaled as needed with the allocation of stronger instances, while processing power in the form of code packaged in containers or distributed systems that are terminated after the work is done is charged according to the used processing power and other resources. Tools like AWS Glue, Google DataFlow, AWS Cloud Functions etc. are just some of the options that offer those capabilities.

Data Catalog and service integration

Reliability, maintenance, and security

  • free disk space,
  • processor and memory allocation,
  • sharing of hardware resources with other applications (shared & noisy hardware) and users,
  • failure of one node in the cluster and redistribution of the topics to another or,
  • in a slightly more extreme case, the termination of the master node.

Cost optimization

AWS, GCP or Azure? Similar concepts, different skin

Data lake and lake-house

Data lake or data mesh? Technological or organizational dilemma?

Data lake layers

Solution as a service — Databricks

Conclusion