You are on page 1of 13

REAL TIME DATA STREAM PROCESSING ENGINE

Ashish Kumar Gupta1 and Abhinav Rastogi2


1

Nibble Computer Society JSS Academy Of Technical Education, Noida 201301, India 1 ashish.theone@gmail.com Quanta Electronics Society JSS Academy Of Technical Education, Noida 201301, India 2 abhihackz@gmail.com

Abstract: The rapid growth in information science and technology in general and the complexity and volume of data in particular have introduced new challenges for the research community. Databases are growing incessantly and many sources produce data continuously. In many cases, we need to extract some sort of knowledge from this continuous stream of data. Examples include customer click streams, telephone records, large sets of web pages, multimedia data, and sets of retail chain transactions. These sources are called data streams. If the process is not strictly stationary (as most of real world applications), the target concept could gradually change over time. Keywords Internet traffic monitoring, on-line stream analysis, sliding windows, frequent item queries

1. Introduction
A Stream Processing Engine is a computing platform for capturing, integrating, understanding, and reacting to business events as they occur. Data streaming systems are increasingly used as infrastructure for critical monitoring applications such as financial alerts and network intrusion detection. These monitoring applications often have many concurrent users asking similar but different queries over a common data stream. For example, a system that monitors stock market trades might have multiple users interested in the total value of trades in a sliding window. While some of these users might care about stocks of a particular sector, or only about high

volume trades, others might compute complex user-defined predicates on fluctuating quantities like stock price. Similarly, the aggregation window that different users are interested in can vary widely. Money managers in financial institutions who run algorithmic trading systems might want aggregates over 5-10 minute windows reported every 60-90 seconds depending on the specific financial models they use. In contrast, day traders with individual investing strategies might only need these results every 5-10 minutes. Clearly such a system will have to support hundreds of queries. Therefore the need of such system arises which handles all the queries within no time and processes it. Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. In India the systems are getting automatised every day. So the traditional database systems are not sufficient to work for on fly data. There this system will help to improve the performance. Examples: Financial services, telecommunication, stock market, medical department, military and industrial process control are some of the areas that will benefit from stream processing. As these sectors receive a large amount of data within no time so storing data and querying on this stored data would take much time and a bulk of data would be queuing. So this processing engine will process the data and query on it.

2. Objective
The emerging real-time information environment is being fueled by an unprecedented increase in the amount of live data that needs to be understood and reacted to instantaneously. The traditional store and query model cannot address the needs of a world where in many cases information's value may exist for only a moment. Stream Processing Engine provides the infrastructure able to support this growing class of problems. Description The major issues: What will this engine do? Why is it required at the moment? What is the underlying technique? These questions are not far to be fetched as I have already stated that now a days stream of data have gained relevance over the contemporary data that was stored in databases and processed the case was that the data grew redundant over time and was considered

obsolete now the intellectuals thought of an idea to implement some mechanism that can process the data in the stream itself and store only the data that was relevant.

Real-time Feeds Remot e

Alerts Actions

Embedded local storage

Data store

Techniques: The challenges faced by Stream Processing Engines are manifold. They relate to the size of the data set and sometimes by the size of the sliding windows and intermediate results of queries. Often exact data are not needed for aggregate queries. In real life, data streams are not continuous but often have bursts of data (e.g. network traffic). Processing these bursts of data without compromising system performance is a key challenge. Implementing Join processing with minimal resources in data streams is also a major challenge.

Recent research on these problems has given birth to the following major contemporary techniques: Performance of aggregate queries benefit greatly from computation of "Sketch Summaries" (i.e. summaries which are representative of the overall data) to provide approximate answers. [2] Adaptive load-aware scheduling of query operators can be used to minimize resource consumption during peak loads. [3] Join processing can be made approximate for sliding windows of data. [4]

Exploiting similarities between incoming queries can lead to better resource.

3. Algorithms used or proposed


3.1 Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Theorem: Given a system with k queries, all operator selectivities 1, Let C(t) = # of blocks of memory used by Chain at time t. At every time t, any algorithm must use C(t) - k memory.

Opt1 B l o c k S i z e

Opt2 Lower envelope

Opt3

Calculate lower envelope Priority = slope of lower envelope segment Always schedule highest-priority available operator Break ties using operator order in pipeline Favor later operators

In many applications involving continuous data streams, data arrival is bursty and data rate fluctuates over time. Systems that seek to give rapid or real-time query responses in such an environment must be prepared to deal gracefully with bursts in data arrival without compromising system performance. We discuss one strategy for processing bursty streams adaptive, load-aware scheduling of query operators to minimize resource consumption during times of peak load. We show that the choice of an operator scheduling strategy can have significant impact on the run-time system memory usage. We then present Chain scheduling, an operator scheduling strategy for data stream systems that is near-optimal in minimizing run-time memory usage for any collection of single stream queries involving selections, projections and foreign-key joins with stored relations. Chain scheduling also performs well for queries with sliding-window joins over multiple streams, and multiple queries of the above types. Important components of systems research that have received less attention to date are run-time resource allocation and optimization. In this paper we focus on one aspect of run-time resource allocation, namely operator scheduling. Therefore, adaptivity becomes critical to a data stream system as compared to a traditional DBMS.

Various approaches to adaptive query processing are possible given that the data may exhibit different types of variability. For example, a system could modify the structure of query plans, or dynamically reallocate memory among query operators in response to changing conditions, as suggested or take a holistic approach to adaptivity and do away with fixed query plans altogether, as in the Eddies architecture .While these approaches focus primarily on adapting to changing characteristics of the data itself (e.g., changing selectivities), we focus on adaptivity towards changing arrival characteristics of the data. As mentioned earlier, most data streams exhibit considerable burstiness and arrival-rate variation. It is crucial for any stream system to adapt gracefully to such variations in data arrival, making sure that we do not run out of critical resources such as main memory during the bursts. The focus of this paper is to design techniques for such adaptivity. Query execution can be captured by a data flow diagram, where every tuple passes through a unique operator path. Thus queries can be represented as rooted trees. Every operator is a filter that operates on a tuple and produces s tuples, where s is the operator selectivity. Obviously, the selectivity assumption does not hold at the granularity of a single tuple but is merely a convenient abstraction to capture the average behavior of the operator. For example, we assume that a select operator with selectivity 0.5 will select about 500 tuples of every 1000 tuples that it processes. Henceforth, a tuple should not be thought of as an individual tuple, but should be viewed as an convenient abstraction of a memory unit, such as a page, that contains multiple tuples. Over adequately large memory units, we can assume that if an operator with selectivity s operates on inputs that require one unit of memory, its output will require s units of memory. How it works :Inputs: - Data flow path(s) consisting of sequences of operators - For each operator we know: - Execution time (per block) - Selectivity

Query #1
Time: t2 Selectivity: s2 Time: t4 Selectivity: s4

Query #2

Time: t1 Selectivity: s1
(0,1) Opt1 Bl o c k Si z

Time: t3 Selectivity: s3

Stream
(1,0.5) Opt2

Stream

(4,0.25) Opt3 (0,0) Time (6,0)

Greedy algorithm: Operator priority = selectivity per unit time (si/ti) Always schedule the highest-priority available operator

Memory Usage
3 2.5 Block Size 2 1.5 1 0.5 0 0 2 4 6 8 10 12 14 16 Time 18 FIFO Chain

3.2 Sliding Window Algorithms: Many infinite stream algorithms do not have obvious counterparts in the sliding window model. For example, one counter suffices to maintain the minimum element in an infinite stream, but keeping track of the minimum element in a sliding window of size N takes (N) spaceconsider an increasing sequence of values, in which the oldest item in any window is the minimum and must be replaced whenever the window moves forward. The fundamental problem is that as new items arrive, old items must be simultaneously evicted from the window, meaning that we need to store some information about the order of the packets in the window. Zhu and Shasha introduce Basic Windows to incrementally compute simple windowed aggregates in [5]. The window is divided into equally-sized Basic Windows and only a synopsis and a timestamp are stored for each Basic Window.When the timestamp of the oldest Basic Window expires, that window is dropped and a fresh Basic Window is added. This method does not require the storage of the entire sliding window, but results are refreshed only after the stream fills the current Basic Window. If the available memory is small, then the number of synopses that may be stored is small and hence the refresh interval is large. Exponential Histograms (EH) have been introduced by Datar et al. [6] and recently expanded in [7] to provide approximate answers to simple window aggregates at all times. The idea is to build Basic Windows with various sizes and maintain a bound on the error caused by counting those elements

in the oldest Basic Window which may have already expired. The algorithm guarantees an error of at most _while using O (( 1/e) log2 N) space.

3.3

Proposed algorithm We propose the following simple algorithm. Frequent, that employs the Basic Window approach (i.e. the jumping window model) and stores a top-k synopsis in each Basic Window. We fix an integer k and for each Basic Window, maintain a list of the k most frequent items in this window. We assume that a single Basic Window fits in main memory, within which we may count item frequencies exactly. Let i be the frequency of the kth most frequent item in the ith Basic Window. Then = Pi i is the upper limit on the frequency of an item type that does not appear on any of the top-k lists. Now, we sum the reported frequencies for each item present in at least one top-k synopsis and if there exists a category whose reported frequency exceeds , we are certain that this category has a true frequency of at least . The pseudo code is given below, assuming that N is the sliding window size, b is the number of elements per Basic Window, and N/b is the total number of Basic Windows. An updated answer is generated whenever the window slides forward by b packets. Implementation Repeat: 1. For each element e in the next b elements: If a local counter exists for the type of element e: Increment the local counter. Otherwise: Create a new local counter for this element type and set it equal to 1. 2. Add a summary S containing identities and counts of the k most frequent items to the back of queue Q. 3. Delete all local counters. 4. For each type named in S: If a global counter exists for this type: Add to it the count recorded in S. Otherwise: Create a new global counter for this element type and set it equal to the count recorded in S. 5. Add the count of the kth largest type in S to . 6. If sizeOf(Q)> N/b: (a) Remove the summary S_ from the front of Q and subtract the count of the kth largest type in S_ from . (b) For all element types named in S_:

Subtract from their global counters the counts recorded in S_. If a counter is decremented to zero: Delete it. (c) Output the identity and value of each global counter > .

3.4

Processing Complex Aggregate Queries over Data Streams Defined once, and run until user terminates them

Q is a selection: then size(A) may be unbounded. Thus, we cannot guarantee we can store it. Q is a self-join: If we want to provide only NEW results, then we need unlimited storage to guarantee no duplicates exist in result Q contains aggregation: then tuples in A might be deleted by new observed tuples. Ex: Select A, sum(B) From Stream X Group by A Having sum(B) > 100

What if B < 0 ?

What if we can delete tuples in the Stream? What if Q contains a blocking operator near the top (example: aggregation)? Online Aggregation Techniques useful

Work on Self-Maintenance: important to limit size of Scratch. If a view can be self-maintainable, any auxiliary storage much occupy bounded space Work on Data Expiration: important for knowing when to move elements from Scratch to Throw. Goal: Group similar queries over data sources, to eliminate common processing needed and minimize response time and storage needed . Niagara (joint work of Wisconsin, Oregon) Tukwilla (Washington) Telegraph (Berkeley)

Eddy Knowing State of Tuples Passes Tuples by Reference to Operators (avoids copying) When Eddy does not have any more input tuples, it polls the sources for more input. Tuples need to be augmented with additional information: Ready Bits: Which operators need to be applied Done Bits: Which operators have been applied Queries Completed: Signals if tuple has been output or rejected by the query Completion Mask (per query): To know when a tuple can be output for a query (completion mask & done bits = mask)

Queries with no joins are partitioned per data source (to save space in the bits required) Queries with Disjunctions (ORs) are transformed into conjunctive normal form (and of ors). Range/exact predicates are found in Grouped filter

Stems and Joins SteMs: Multiway-Pipelined Joins Double- Pipelined Joins maintain a hash index on each relation. When N relations are joined, at least n-2 inflight indices are needed for intermediate results even for left-deep trees. Previous approach cannot change query plan without re-computing intermediate indices.

4. Conclusion and Future Work:


In this paper we have we have tried to find out the practical implementation of real time data stream processing engine by analysis of the algorithms and their feasibility and found them to be applicable on the test scenario. Next we would be developing our own engine, which captures the data packets over a socket interprects it into meaning full data , processes them and stores the relevant information in the data base and drops the rest of the data which has been processesed and is not required any more.

These Real Time Stream Data Processing Engines are the need of the hour since the processes of data mining is growing critical day by day for the development of an individual, economic growth of the society and advancement of Technology.

5. References
Research Projects: Aurora (supports cq, ad-hoc query, and materialized view) - Aims to better support monitoring applications Borealis (distributed SPE, QoS based techniques) - A distributed stream processing engine based on Aurora and Medusa

Shivnath Babu and Rajeev Motwani , Department of Computer Science , StanfordUniversity Stanford, CA 94305 Cornell University, dobra@cs.cornell.edu

Minos Garofalakis, Bell Labs, Lucent, minos@bell-labs.com Johannes Gehrke, Cornell University.johannes@cs.cornell.edu Rajeev Rastogi, Bell Labs, Lucent, rastogi@bell-labs.com

[1] A. Arasu et al.. Resource sharing in continuous sliding-window aggregates. In VLDB. 2004. [2] A. Arasu, et al.. The CQL continuous query language: Semantic foundations and query execution. VLDB Journal, (To appear). [3] F. Bancilhon, et al.. FAD, a powerful and simple database language. In VLDB. 1987. [4] D. Carney, et al.. Monitoring streams - a new class of data management applications. In VLDB. 2002. [5] S. Chandrasekaran, et al.. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR. 2003. [6] S. Chandrasekaran et al.. Streaming queries over streaming data. In VLDB. 2002. [7] J. Chen, et al.. NiagaraCQ: a scalable continuous query system for Internet databases. In SIGMOD. 2000. [8] C. D. Cranor, et al.. Gigascope: A stream database for network applications. In SIGMOD. 2003. [9] M. Denny et al.. Predicate result range caching for continuous queries. In SIGMOD. 2005. [10] P. M. Deshpande, et al.. Caching multidimensional queries using chunks. In SIGMOD. 1998. [11] C. L. Forgy. Rete: A fast algorithm for the many pattern/many object match problem. Artifical Intelligence, 19(1):1737, September 1982. [12] M. J. Franklin, et al.. Design considerations for high fan-in systems: The HiFi approach. In CIDR. 2005.

[13] L. Golab et al.. Update-pattern-aware modeling and processing of continuous queries. In SIGMOD. 2005. [14] G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73170, June 1993. [15] J. Gray, et al.. Data Cube: a relational aggregation operator generalizing group-by, cross-tab and sub-total. In ICDE. February 1996.

You might also like