Build your bigdata platform with Galeo Data platform solution

Galeo, driven by the growing demand for cloud data platforms, has designed a solution with Azure services, easy to manage, scalable and open source software for those services not widespread among cloud providers. The architecture is designed to support any future challenges or use cases.

In this post we will detail the different components of the architecture, their functionalities and the technologies used.

Build your bigdata platform with Galeo Data platform solution,

1. EXTRACTION from different sources

Within a data platform, the sources can be very diverse depending on each use case.

We can find events, which may be streaming from physical devices or from a proprietary database through a CDC. An event-driven architecture is based on the publish-subscribe principle. Messages are sent once to a topic and we can have N subscribers consuming that same message. Producers and consumers are totally decoupled.

We can have data sources that would be more in the batch or micro-batch world, such as a massive extraction from a database, a CRM, SAP, etc. or any kind of periodic query to APIs.

2. BATCH INGESTION AND STREAMING

2.1 Streaming Ingestion

Here the casuistry would depend mainly on the data source and the specific use case. For example, we can have streaming data entries from a CDC such as Debezium, which is an open-source project that allows us to detect and transmit in the form of events the changes in a database, as well as any other connector developed by Galeo or any other provider.

Azure Kubernetes Services is chosen for this, as it provides us with:

  • Scalability
  • Application isolation
  • Deployment facilities
  • High availability
  • Ability to test new technologies outside the Azure environment

For events produced by IoT devices, our starting point would be the Azure IoT Hub thanks to its easy integration with other Azure resources such as Azure Event Hub, Azure Event Grid or Azure Logic Apps.

Once we have the initial streaming upload done, our events would be passed to a theme-based messaging manager such as Azure Event Hub:

  • Scalable, depending on performance and usage requirements
  • Secure, as it protects data in real time.
  • With the possibility of partitioning messages by a key into partitions

2.2 Batch Ingestion

For the batch world, Databricks has been chosen for the vast majority of cases motivated by:
The ease with which it can be managed

  • The ease with which it can be managed
  • Acceleration of Spark development thanks to notebooks as a support tool
  • Its SQL engine, as well as the subsequent possibility of working in Delta tables, solves the eternal problem of intra-partition data updates in data lakes.
  • Schedule workloads (Jobs) that run on dedicated clusters for this purpose, which have a much lower cost than interactive clusters for development.
  • Reuse of the resource for different profiles (Data Engineers, SQL Analysts, ML Engineers).

There are in cases, such as native Office 365 extractions or one-off data copy activities, where Azure Data Factory could also be used.

3. LAKE HOUSE

It is the cornerstone of a data platform, this is where all the information converges, our inputs deposit raw data to be read, transformed and processed into a final solution.

The Lake House allows us to store the following information:

  • Structured data (relational, SQL).
  • Semi-structured data (No-SQL).
  • Unstructured or binary data (documents, video, images).

Our choice within the Azure ecosystem has been the Azure Data Lake Storage Gen2 service. This resource provides us with many functionalities:

  • Hierarchical namespace: this difference compared to the classic Blob Storage allows us to significantly optimize the work with big data in tools such as Hive, Spark, etc… In turn, the reduction in latency is equivalent to a reduction in cost.
  • Hadoop-compatible access: this allows us to mount a Hadoop distributed file system (HDFS).
  • POSIX permissions: gives us the possibility to define a security model compatible with ACL and POSIX permissions at directory or file level.
  • Volume: it is capable of storing and processing large amounts of data, with low latency.
Build your bigdata platform with Galeo Data platform solution,

Azure Data Lake Storage Gen2 together with Databricks, gives us the possibility to have our Lake House with tables in Delta format. What advantages does this give us?

  • ACID (Atomic, Consistent, Isolated, Durable) transactions: this guarantees consistency with parties reading or writing data at the same time.
  • Schema Evolution: supports schema evolution and compliance, supporting DW schema architectures such as star/snowflake.
  • Time Travel: versioning of tables. As a code repository, users can have versions of their data whenever the data set changes.
  • Support for a variety of data types, from unstructured to structured data.
  • Support for BI tools: allows us to use tools directly on the source data.

Since we do not need a shared metastore, we will use Databricks’ own Hive Metastore to persist the metadata from the Delta tables.

We will divide our data into three layers:

  • Bronze area: this is where the raw data from both the streaming and batch parts converge.
  • Silver zone: the tables are cleaned and processed to make them searchable (normalization process).
  • Golden zone: here we deposit the aggregated tables containing the calculation of the KPIs for each use case.

In addition to this, we will use the DBT tool, deployed on our AKS. DBT is a tool to organize and document the transformations performed on our Lake House. The way it operates is as follows:

  • Each consultation is a model. For example “SELECT A, B FROM LAKE_DATA”.
  • This model can be materialized in a table or a view.
  • To this table or view, there is a documentation block to associate the description of the query, the description of each field and tests on the data.

In DBT, with Spark, we can have incremental models on the tables, either by inserting new records (append) or overwriting them(insert_overwrite), either on a partition or on the entire table.

Using Delta tables, we have one more type of model, which can update old records and insert new ones at the same time(merge), based on a unique key(unique_key).

4. DATA WAREHOUSE

A Datawarehouse is a unified repository for data collected from a company’s various systems.

In our architecture we use Snowflake as an auxiliary tool to the Lake House to load the data to be read from Power BI.

Databricks has a native connector for Power BI with Delta tables, but it is not sufficiently operational if you do not have the cluster up constantly, since every time you need to update the data, you will have to wait for the cluster up time and set a low auto-off time so as not to be generating overhead for each update you make.

There is also the possibility of setting up a SQL Endpoint, with the same problems as mentioned above. We would still need a serverless data warehouse or have an always-on clustered endpoint, unless a serverless version of this feature is developed in Azure.

Snowflake, on the other hand, has the advantage of being serverless and data from our Delta tables can be easily transformed and loaded from Databricks.

5. DATA STORAGE

5.1 Low Latency & operational

For operations that require very low latency, with a large number of queries of specific data in real time, we use No-SQL databases.
Depending on the use case, we can use Redis Cache or CosmosDB.

What is Azure Cosmos DB?

It is a fully managed NoSQL database service built for fast and predictable performance, high availability, elastic scalability, global distribution and ease of development.

What is Redis?

It is an in-memory database that persists on disk. It is an advanced open source key-value store. It is often referred to as a data structure server, since keys can contain strings, hashes, lists, etc.

As mentioned, it would depend on the specific use case. In terms of cost, Redis is usually cheaper than CosmosDB, especially when there are many transactions, since Redis bills by cache size and number of nodes, while CosmosDB bills by throughput (Request Unit per second).

6. Visualization (Web App)

Depending on our use case, we can have a web service and a reporting tool such as Power BI. Although these two resources can coexist perfectly, having a web service that is only accessible by certain users, where they can perform, for example, activities on other Azure resources through API calls and also have a section for viewing reports that come out of Power BI.

Our solution would choose to deploy the web part over our AKS, taking advantage of:

  • Hyper Scalability.
  • Full control over virtual machines.
  • Use of tools such as Apache Kudu, which is an open source tool that provides low latency reads along with efficient analytical access patterns on structured data. In addition, it can be used in conjunction with Apache Spark to access data.
  • Low cost compared to Azure App Service.

To store the metadata and tokens of any of our applications, either from the web or any resource that needs it, we will use a small PostgreSQL.

7. CATALOG and data quality

It is becoming increasingly important to have a data catalog within a large platform. Data offers increasing opportunities for business strategies. Often, however, poor knowledge of the data and its lack of availability prevent users from taking full advantage of its value. A data catalog is intended to fill this gap.
We have chosen DataHub for this because of:

  • End-to-end search: ability to integrate with databases, datalakes, BI platforms, ETLs, etc.
  • Easy understanding of the data path from one end to the other thanks to its dashboards.
  • It provides dataset profiles to understand how that dataset has evolved over time.
  • Data governance and access controls.
  • Platform usage analysis.

In addition to this, we will use Great Expectations to validate, document and profile our data. Great Expectations gives us the possibility to automate tests to detect data problems quickly, basically unit tests. In addition, we can also create documentation and quality reports based on these expectations. It is quite useful for monitoring ETLs that acquire data in a Datalake or Datawarehouse.

These tools would be deployed in our AKS along with the input connectors, DBT, the PostgreSQL for metadata, the web and all other open source tools needed to be deployed in the future.