Optimizing Word Frequency Statistics with Kafka and Spark Streaming

Testing Tech Hub 2023-10-26 12:15 390

In this article, we present an optimized solution using Kafka and Spark Streaming, which significantly reduces the cost of querying large numbers of indexes and improves the timeliness of word frequency updates.

Background of the problem

The problem of word frequency statistics is to calculate the frequency of each keyword in a customized vocabulary, in each channel, and in different time periods

Original offline solution

Periodically pull the index data of each word in each channel during the incremental time period, and then perform time-sharing frequency statistics. Complexity: If the vocabulary size increases to 10w and the number of channels reaches 5000, 500 million index queries will be required per round. Under this scheme, the data update timeliness of the word frequency statistics-related module is very low, usually at the day level.

Optimization Goals:

1. Reduce the cost of constantly querying a large number of indexes (affecting online services)

2. Improve the timeliness of word frequency updates and align it with the timeliness of data crawling (minute-level)

3. Even out the pressure of intensive writing to the database during peak periods

Solutions:

1. Query index -> Raw data

For computational demands like word frequency statistics, there is actually no need to use index resources for individual document positioning. Instead, raw text can be used directly for statistics.

2. Offline computation -> Real-time computation

When comparing offline (map-reduce) and real-time (spark, storm) solutions, the main consideration is that word frequency calculation has independence and does not require operations like join or global calculations. Map-reduce requires loading data from various nodes, resulting in high IO and network overhead. After the raw data is collected, it needs to be written into HBase. It is entirely possible to utilize its cache to directly perform various calculations.

When adopting real-time computation, it should also ensure that: 1. The collection module does not need to block and wait for the calculation to complete, 2. New data must be guaranteed to complete the calculation, and only be calculated once, 3. The peak arrival times of data can be staggered, allowing calculations to be performed at an even pace, and results to be stored in the database.

Problem Expansion:

Starting from the word frequency statistics problem, after some thought, it is realized that what is needed is a highly available and efficient streaming computation engine. This engine can also perform other non-blocking real-time computation tasks, including data statistical analysis, business log statistics, and real-time monitoring of background logs.

Technology Selection:

When researching technology options, the following aspects are given priority:

1. High write throughput: As our web crawling data collection capabilities have significantly improved, the amount of data written during peak periods has also greatly increased. It is necessary to ensure O(1) write latency and high concurrency capabilities.

2. Scalability: As data channels continue to increase, the demand for handling capacity will also increase. Sufficient scalability is required in terms of access capabilities, computational capabilities, and storage capabilities.

3. Availability: Ideally, the technology should have a mature application background.

4. Simplicity: Comprehensive development documentation is needed, with low learning costs, deployment costs, and development costs.

Based on these principles, data collection tools such as Scribe, Flume, Chukwa, Kafka, and other MQ technologies were compared. For data computation, Spark and Storm technologies were mainly compared. In the end, the Kafka + Spark Streaming streaming computation architecture, which has mature applications at LinkedIn, was chosen. The C++ librdkafka interface was used on the producer side, while Python was used for development on the consumer side.

Actual Solution Description:

As shown in the diagram, producer data from various Spiders, business logs, and background logs are directly pushed to Kafka for message persistence with O(1) time. Spark Streaming is responsible for subscribing to messages in Kafka and subsequently removing messages in batches to execute consumer tasks. The calculation results of each batch are directly written into the database or file system.

Kafka is responsible for reliable fault-tolerant copying of messages and maintains an at least-processed-once primitive (i.e., each piece of data is guaranteed to be processed at least once) between itself and Spark Streaming. We can implement an exactly-processed-once primitive in our business logic to ensure that data is not duplicated in calculations.

By utilizing the rich map-reduce primitives in Spark Streaming, we can efficiently perform multi-dimensional groupings on data and improve computation throughput through parallelization. For example, in scenarios where we have many statistics by channel and by word, we can group channels and keywords for parallel computation.

The logic schematic of word frequency statistics is as follows:

1. The raw data crawled by the spider, including channel, content, and time information, is pushed to Kafka in real-time.

2. Spark Streaming subscribes to the data with a 5-minute cycle (one batch) (time granularity can be configured) and aggregates the data of each batch by channel: <channel1, [content1, content 2, ...]>, <channel2, [content 1, content 2, ...]>

3. The data of each channel is further aggregated by each keyword and counted: <word1, count1>, <word2, count2>

4. The updated word frequency of each channel is stored for query.

With this solution, the timeliness of word frequency can reach N+TC(s), where N is the number of batches, and TC is the statistical overhead for each time. If N is set to 5s, the timeliness of the statistical results can reach the data collection timeliness of 5 minutes.

Performance:

On a 24-core Intel(R) Xeon(R) CPU E5-26400@2.5GHz, 64GB system, using the C++ librdkafka producer to write messages serially, the performance is 100,000 times/s.

A single machine can easily support 8,000 TPS for statistical business.

Conclusion:

In fact, it only took two weeks from research to development for this streaming computation solution, but it brought significant business improvement with good cost-performance. The key to choosing an open-source technology framework is to conduct sufficient research and comparison of various mature solutions in the early stages, and to consider the existing business language and framework to choose the best practice.

The biggest challenge in using Spark Streaming is understanding the programming paradigm of Spark's distributed computation. It is necessary to be clear about the computation context of each transfer or action, reasonably utilize data parallelization and persistence capabilities to improve efficiency, and fully adopt resource pool technology to reduce overhead.

Currently, the understanding of Spark and Streaming is still at a preliminary stage, and further learning and in-depth understanding are needed to make it more valuable in various business scenarios.

Read Previous Post >>

Explore the Role of Data Access in the Lifecycle of Game Data Services