Big data architectures - Azure Architecture Center (2024)

A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional database systems. The threshold at which organizations enter into the big data realm differs, depending on the capabilities of the users and their tools. For some, it can mean hundreds of gigabytes of data, while for others it means hundreds of terabytes. As tools for working with big datasets advance, so does the meaning of big data. More and more, this term relates to the value you can extract from your data sets through advanced analytics, rather than strictly the size of the data, although in these cases they tend to be quite large.

Over the years, the data landscape has changed. What you can do, or are expected to do, with data has changed. The cost of storage has fallen dramatically, while the means by which data is collected keeps growing. Some data arrives at a rapid pace, constantly demanding to be collected and observed. Other data arrives more slowly, but in very large chunks, often in the form of decades of historical data. You might be facing an advanced analytics problem, or one that requires machine learning. These are challenges that big data architectures seek to solve.

Big data solutions typically involve one or more of the following types of workload:

Batch processing of big data sources at rest.
Real-time processing of big data in motion.
Interactive exploration of big data.
Predictive analytics and machine learning.

Consider big data architectures when you need to:

Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or with low latency.

Components of a big data architecture

The following diagram shows the logical components that fit into a big data architecture. Individual solutions may not contain every item in this diagram.

Most big data architectures include some or all of the following components:

Data sources. All big data solutions start with one or more data sources. Examples include:
- Application data stores, such as relational databases.
- Static files produced by applications, such as web server log files.
- Real-time data sources, such as IoT devices.
Data storage. Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. This kind of store is often called a data lake. Options for implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
Batch processing. Because the data sets are so large, often a big data solution must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these jobs involve reading source files, processing them, and writing the output to new files. Options include running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time message ingestion. If the solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing. This might be a simple data store, where incoming messages are dropped into a folder for processing. However, many solutions need a message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable delivery, and other message queuing semantics. This portion of a streaming architecture is often referred to as stream buffering. Options include Azure Event Hubs, Azure IoT Hub, and Kafka.
Stream processing. After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually running SQL queries that operate on unbounded streams. You can also use open source Apache streaming technologies like Spark Streaming in an HDInsight cluster.
See Also
When Should I Use Heat or Ice for Pain?The Differences Between Cold, Warm, and Hot Storage How Do Clouds Affect Earth’s Climate?What is Cold Data? Cold Data Definition – Komprise
Machine learning. Reading the prepared data for analysis (from batch or stream processing), machine learning algorithms can be used to build models that can predict outcomes or classify data. These models can be trained on large datasets, and the resulting models can be used to analyze new data and make predictions. This can be done using Azure Machine Learning
Analytical data store. Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. The analytical data store used to serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional business intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale, cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve data for analysis.
Analysis and reporting. The goal of most big data solutions is to provide insights into the data through analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts. For these scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling these users to use their existing skills with Python or Microsoft R. For large-scale data exploration, you can use Microsoft R Server, either standalone or with Spark.
Orchestration. Most big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. To automate these workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.

Lambda architecture

When working with very large data sets, it can take a long time to run the sort of queries that clients need. These queries can't be performed in real time, and often require algorithms such as MapReduce that operate in parallel across the entire data set. The results are then stored separately from the raw data and used for querying.

One drawback to this approach is that it introduces latency — if processing takes a few hours, a query may return results that are several hours old. Ideally, you would like to get some results in real time (perhaps with some loss of accuracy), and combine these results with the results from the batch analytics.

The lambda architecture, first proposed by Nathan Marz, addresses this problem by creating two paths for data flow. All data coming into the system goes through these two paths:

A batch layer (cold path) stores all of the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored as a batch view.
A speed layer (hot path) analyzes data in real time. This layer is designed for low latency, at the expense of accuracy.

The batch layer feeds into a serving layer that indexes the batch view for efficient querying. The speed layer updates the serving layer with incremental updates based on the most recent data.

Data that flows into the hot path is constrained by latency requirements imposed by the speed layer, so that it can be processed as quickly as possible. Often, this requires a tradeoff of some level of accuracy in favor of data that is ready as quickly as possible. For example, consider an IoT scenario where a large number of temperature sensors are sending telemetry data. The speed layer may be used to process a sliding time window of the incoming data.

Kappa architecture

A drawback to the lambda architecture is its complexity. Processing logic appears in two different places — the cold and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of managing the architecture for both paths.

The kappa architecture was proposed by Jay Kreps as an alternative to the lambda architecture. It has the same basic goals as the lambda architecture, but with an important distinction: All data flows through a single path, using a stream processing system.

There are some similarities to the lambda architecture's batch layer, in that the event data is immutable and all of it is collected, instead of a subset. The data is ingested as a stream of events into a distributed and fault tolerant unified log. These events are ordered, and the current state of an event is changed only by a new event being appended. Similar to a lambda architecture's speed layer, all event processing is performed on the input stream and persisted as a real-time view.

If you need to recompute the entire data set (equivalent to what the batch layer does in lambda), you simply replay the stream, typically using parallelism to complete the computation in a timely fashion.

Internet of Things (IoT)

From a practical viewpoint, Internet of Things (IoT) represents any device that is connected to the Internet. This includes your PC, mobile phone, smart watch, smart thermostat, smart refrigerator, connected automobile, heart monitoring implants, and anything else that connects to the Internet and sends or receives data. The number of connected devices grows every day, as does the amount of data collected from them. Often this data is being collected in highly constrained, sometimes high-latency environments. In other cases, data is sent from low-latency environments by thousands or millions of devices, requiring the ability to rapidly ingest the data and process accordingly. Therefore, proper planning is required to handle these constraints and unique requirements.

Event-driven architectures are central to IoT solutions. The following diagram shows a possible logical architecture for IoT. The diagram emphasizes the event-streaming components of the architecture.

The cloud gateway ingests device events at the cloud boundary, using a reliable, low latency messaging system.

Devices might send events directly to the cloud gateway, or through a field gateway. A field gateway is a specialized device or software, usually collocated with the devices, that receives events and forwards them to the cloud gateway. The field gateway might also preprocess the raw device events, performing functions such as filtering, aggregation, or protocol transformation.

After ingestion, events go through one or more stream processors that can route the data (for example, to storage) or perform analytics and other processing.

The following are some common types of processing. (This list is certainly not exhaustive.)

Writing event data to cold storage, for archiving or batch analytics.
Hot path analytics, analyzing the event stream in (near) real time, to detect anomalies, recognize patterns over rolling time windows, or trigger alerts when a specific condition occurs in the stream.
Handling special types of nontelemetry messages from devices, such as notifications and alarms.
Machine learning.

The boxes that are shaded gray show components of an IoT system that are not directly related to event streaming, but are included here for completeness.

The device registry is a database of the provisioned devices, including the device IDs and usually device metadata, such as location.
The provisioning API is a common external interface for provisioning and registering new devices.
Some IoT solutions allow command and control messages to be sent to devices.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Zoiner Tejada | CEO and Architect

Next steps

See the following relevant Azure services:

Azure IoT Hub
Azure Event Hubs
Azure Stream Analytics
Azure Data Explorer

Azure IoT reference architecture
IoT architecture design
Big data architecture style

Big data architectures - Azure Architecture Center (2024)

FAQs

What are the 5 pillars of architecture Azure? ›

The five pillars of the Azure Well-Architected Framework are reliability, cost optimization, operational excellence, performance efficiency, and security. While each pillar is important, the pillars can be prioritized based on your specific workload.

Read On ›

What is 3 tier architecture in Azure? ›

Three-tier architecture is a well-established software application architecture that organizes applications into three logical and physical computing tiers: the presentation tier, or user interface; the application tier, where data is processed; and the data tier, where the data associated with the application is ...

Discover More Details ›

What is Azure datacenter architecture? ›

An Azure data center is a unique physical building that contains thousands of physical servers with it's own power, cooling and networking infrastructure. These data ceneters are located all over the globe. As of this course recording, there are over 160+ Azure datacenters.

What architecture is used in Azure data warehouse? ›

Azure SQL Data Warehouse Architecture

The Massively Parallel Processing (MPP) engine of Azure SQL Data Warehouse is at the core of these operations. It divides complex queries into coordinates and parallel processes that support processing over several compute nodes.

See Details ›

What are the 6 principles of Azure AI? ›

Microsoft developed a Responsible AI Standard. It's a framework for building AI systems according to six principles: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability.

Find Out More ›

What are the 4 primary elements of architecture? ›

Point, line, plane and volume are the basic geometric elements that define architectural form. A point marks a position in space and has no dimensions.

Tell Me More ›

Is MVC same as 3-tier architecture? ›

Three-tier architecture vs MVC pattern

The MVC pattern is only concerned with organizing the logic in the user interface (presentation layer). Three-tier architecture has a broader concern. It's about organizing the code in the whole application.

Show Me More ›

Is 3 schema and 3-tier architecture same? ›

The three schema architecture is also called ANSI/SPARC architecture or three-level architecture. This framework is used to describe the structure of a specific database system. The three schema architecture is also used to separate the user applications and physical database.

Explore More ›

What is the difference between N tier and 3-tier architecture? ›

N tier architecture also referred to as distributor architecture, where n stands for the number of tiers. The difference between the 3 tier and n tier is that there is more than one application server intermediating between the user interface layer and the database layer.

What is Azure data architecture? ›

Azure Data Architecture is a set of services and tools that allow us to gather and process the data with the use of cloud computing. It consists of components for data storing (data bases, data warehouses) and components that enable data processing (e.g. Virtual Machines).

Show Me More ›

How do you explain Azure architecture? ›

Microsoft Azure architecture runs on a massive collection of servers and networking hardware, which, in turn, hosts a complex collection of applications that control the operation and configuration of the software and virtualized hardware on these servers. This complex orchestration is what makes Azure so powerful.

Read The Full Story ›

How many data centers are there in Azure? ›

How Many Azure Data Centers Are There? As of March 2023, there are 160 active Azure data centers in 60 regions around the world. Azure regions are grouped into Azure geographies, which are defined by data residency and compliance requirements.

See Details ›

What are the 3 data warehouse architectures? ›

There are three main data warehouse architecture types: single-tier, two-tier and three-tier data warehouses. Every data warehouse has the same vital components within its architecture, namely: ETL tools, databases, metadata, bus & data marts and access tools.

Get More Info Here ›

What is data lake architecture in Azure? ›

A data lake is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are optimized for scaling their size to terabytes and petabytes of data. The data typically comes from multiple diverse sources and can include structured, semi-structured, or unstructured data.

Is Azure a data warehouse or data lake? ›

That's because organizations rely on comprehensive data lakes platforms, such as Azure Data Lake, to keep raw data consolidated, integrated, secure, and accessible. Scalable storage tools like Azure Data Lake Storage can hold and protect data in one central place, eliminating silos at an optimal cost.

What are the 5 pillars of cloud architecture? ›

The Five Pillars of Cloud

Operational Excellence.
Security.
Reliability.
Performance Efficiency.
Cost Optimization.

View Details ›

What are the 5 elements of architecture? ›

From ancient Greek statues to modern graphic design, every art form can be broken down into basic design fundamentals—line, shape, form, color, value, shape, and texture. Architecture is no exception and can also be dissected using these basic design elements.

What are the 5 pillars of solution architect? ›

This will allow you to focus on the other aspects of design, such as functional requirements.

Operational Excellence.
Security.
Reliability.
Performance Efficiency.
Cost Optimization.

Learn More ›

What are the 5 pillars of modern architecture? ›

In the course of his work as an architect, Le Corbusier developed a series of architectural principles, which he used as the basis of his designs. The design principles include the following five points by Le Corbusier: Pilotis (pillars), roof garden, open floor plan, long windows and open facades.

Discover More Details ›