Real-time data streaming is a technology that allows you to continuously transfer data from source to recipient without delay. It is used in various fields to provide instant analysis and response. For example, in financial markets, this technology helps monitor and analyze trading activities in real time. In social media, it quickly processes and analyzes posts and user interactions. Based on this technology, systems for real-time data stream management (RTDSM) have been developed. They help improve the efficiency and responsiveness of business processes, enhancing overall productivity and decision-making. One of the most popular tools in the field of real-time data streaming is the Apache Kafka platform. In this article, we will tell you what it is, how it works, and where it is used.
What is Apache Kafka
Apache Kafka is a distributed open-source platform for collecting, storing, and processing streaming data. It was developed by the Apache Software Foundation in Java and Scala. Its first version was released in 2011.
Kafka receives information in the form of real-time event streams from thousands of sources: databases, cloud services, programs and applications, mobile devices, sensors, and more. It then processes them sequentially and adds them to storage for retrieval, manipulation, response, and other operations. Streaming supports the continuous reception and interpretation of data about current or past events.
The service has three main functions:
- Storing streams of records in chronological order.
- Processing streams in real time.
- Publish and subscribe to streams.
Among the main advantages of the platform are:
- Scalability. Apache Kafka's partitioned logging model distributes data across multiple servers. This allows them to be scaled more flexibly than when stored on a single server. The platform supports production clusters scaling to hundreds of thousands of partitions, petabytes of data, and trillions of messages per day.
- Reliability. Storage sections filled with information are distributed and replicated among multiple servers, and the data stored in the system is written to a physical disk. This effectively protects servers from failures, maintaining their long and stable operation.
- Speed. Kafka partitions data during ingestion and processing using a cluster of devices. This reduces latency to 2 ms, significantly speeding up all the processes it runs.
- Permanent storage. Users have the opportunity to securely store data streams on distributed, stable, and fault-tolerant servers.
- Availability. The system allows you to transfer clusters between availability zones or select those located in specific geographic regions.
Key Components of Apache Kafka
Apache Kafka architecture is characterized by high scalability, fault tolerance, and the ability to process large volumes of data in real time. It consists of the following components:
- Producers. These are applications or services that send data (messages) to Kafka. They can post them to one or more topics.
- Consumers. These are applications or services that read data from Kafka. They can read messages from one or more topics.
- Brokers. These are Kafka servers that receive messages from producers, store them, and serve requests from consumers. Each broker manages one or more topic partitions.
- Topics. These are categories or channels for messages. Each message in Kafka belongs to a specific topic. Topics can be divided into multiple partitions for parallel processing and increased scalability.
- Partitions. Each topic in Kafka is divided into several partitions. Partitions allow you to distribute data across multiple brokers and enable parallel processing of data. Each partition is an organized message log.
- Replication. To ensure reliability and fault tolerance, the data in each partition can be replicated to multiple brokers. This means that if one of the brokers fails, the data will be available on the others.
- Log. Within each partition, data is stored in an organized log. Each message has an offset, which is a unique identifier for that message in the partition.
- Zookeeper. Kafka uses Apache ZooKeeper to orchestrate distributed brokers and manage metadata. This tool tracks the status of brokers, topics, and partitions.
The system supports two types of topics: regular and compressed. The former are configured to specify storage space or storage time limits. If the specified parameters are exceeded, old data can be automatically deleted, freeing up space for new data. The default retention period for messages in regular topics is 7 days but can be extended indefinitely if necessary. Compressed topics do not impose any restrictions on the volume and storage period of records. Users can permanently delete messages from the database by writing a special message (tombstone message) with a null value for a specific key.
In addition, the Apache Kafka platform provides:
- Real-time data streaming. The system provides a set of tools for streaming event processing: joins, aggregations, filters, transformations, and more. Users have access to both processing during an event and one-time processing.
- Integrations. Connect's ready-to-use API integrates Kafka with hundreds of sources and sinks, including PostgreSQL, JMS, Elasticsearch, AWS S3, and more.
- User libraries. They support reading, writing, and processing event streams in a variety of programming languages.
- An ecosystem of open-source tools created by community members.
- Variety of resources. The platform gives access to training materials, technical documentation, videos, example projects, online training, Stack Overflow, and so on.
In addition to all of the above, Kafka has five APIs:
- Admin API – for checking and managing brokers, topics, and other objects.
- Producer API – for publishing (recording) a stream of events to one or more topics.
- Consumer API – for subscribing (reading) to topics and processing their event stream.
- Streams API – for implementing stream processing applications and microservices.
- Connect API – for creating and running reusable data import and export connectors. They read and write event streams from and to external systems for the purpose of integrating them with Kafka.
How Does Apache Kafka Work
Kafka is a distributed system of servers and clients, data exchange between which occurs via the TCP protocol. It can be deployed on virtual machines, physical hardware, and containers in on-premises and cloud environments. The platform operates as a cluster of one or more servers located in data centers or cloud regions.
Some servers form a storage layer. They are called brokers. Servers of the second type use Kafka connectors for continuous import and export of data in the form of event streams. This allows them to be integrated with other clusters or third-party systems (for example, relational databases). Each cluster has high fault tolerance and scalability. If any server malfunctions, it is immediately replaced by others, ensuring stable operation without data loss.
Kafka clients enable you to develop distributed applications and microservices. They are capable of reading, writing, and processing event streams in parallel. Some are built into its functionality; dozens of others are available in community resources. Among them are clients for Java and Scala (including the high-level Kafka Streams library), Go, Python, C/C++, and other languages.
The platform combines two messaging models: queuing and publish-subscribe. The first distributes data processing across multiple client instances, providing high scalability. The second approach allows you to send a specific message to the appropriate subscriber. However, it cannot be used to distribute work among multiple worker processes. Kafka Apache offers a split logging model to combine the two solutions. It includes an ordered sequence of records divided into segments or sections corresponding to different subscribers. One topic can have multiple subscribers, and each one gets its own section for higher scalability. The replay feature allows different applications to read data streams independently at their own speed.
End-to-end performance monitoring is necessary to measure the performance of producers, consumers and brokers. It is also used to track ZooKeeper, which is used to coordinate interactions between clients. Today, there are several platforms for monitoring Kafka performance. In addition, data collection can be carried out using Java tools, in particular, JConsole.
Use Cases of Apache Kafka
To conclude our review, we will tell you what Apache Kafka is used for. According to the developers of the platform, it has more than 1,000 applications. Among the most common:
- Process payments, transfers and other financial transactions in real time for banks, stock exchanges, and insurance companies.
- Monitoring the movement of cars and trucks, as well as other vehicles in real time for carriers and logistics companies.
- Continuous collection and analysis of data from Internet of Things devices and other equipment used in industry and other areas.
- Monitoring patients in medical institutions and predicting the dynamics of their condition.
- Collection, storage of data, and provision of access to it to various departments of companies.
- Development of data platforms, event-driven architectures, and microservices.
Today, Kafka is used by many well-known companies, including Box, Goldman Sachs, Target, Cisco, Intuit, and others. It provides them with powerful tools to empower and innovate. Enterprises effectively optimize their data strategies with event streaming architecture.
The most famous Apache Kafka use cases:
- Adidas. The sports brand uses this system as the core of its fast streaming platform. Integration of source systems enables real-time event processing for monitoring, analytics, and reporting solutions.
- Agoda. The hotel booking service uses Kafka for its data pipeline. This enables trillions of events to be streamed daily across multiple data centers.
- Amadeus. The travel aggregator leverages the platform for real-time data processing, batch processing, operational tasks, and event logging for streaming applications.
- Box. The cloud storage service leverages Kafka for its production analytics pipeline and real-time monitoring infrastructure.
- Cisco. The network equipment vendor maintains its OpenSOC security operations center infrastructure based on the platform.
- Cloudflare. The network provider leverages Kafka for its log processing and analytics pipeline, collecting hundreds of billions of events per day and data from thousands of servers.
- Coursera. The online education platform scales its processes using Kafka, which acts as a data pipeline for analytics and real-time learning dashboards.
- Foursquare. The social platform enables online messaging through Kafka tools. It is integrated with monitoring systems, production systems, and stand-alone Hadoop infrastructure.
- LinkedIn. A well-known social network uses the system to process activity flow data and operational metrics. This powers a number of its products, including LinkedIn Newsfeed, LinkedIn Today, and offline analytics.
- Oracle. One of the top IT corporations provides native connectivity to Kafka from its Enterprise Service Bus product called OSB (Oracle Service Bus). This allows developers to use the built-in capabilities of OSB to implement staged data pipelines.
- PayPal. The payment system uses Kafka best practices to track, stream, and aggregate application health metrics. In addition, it uses it for database synchronization, application log aggregation, batch processing, and other tasks.
Conclusion
Apache Kafka is in high demand as a platform for collecting, storing, and processing streaming data. It provides an impressive set of tools for managing these processes, a solid set of user libraries, a large selection of community resources, and 5 APIs for integration with external software. Kafka is characterized by high scalability, reliability, speed of data ingestion and processing, availability, and a number of other advantages. The platform is popular with many companies that use it to streamline data-related tasks and processes.