Time-out Bloom Filter: A New Sampling Method for
每天喝茶
Recording More Flows
啼怎么组词
Shijin Kong1, Tao He2, Xiaoxin Shao3, Xing Li4
{ksj001, sxx033}@mails.tsinghua.edu
ppt的背景Department of Electronic Engineering, Tsinghua University, Beijing, P.R.China, 100084春节庙会
{hetao2, xing4}@cernet.edu
China Education and Rearch Network, Beijing, P.R.China, 100084 Abstract. Packet sampling is widely deployed to generate flow records on high
speed links. However, random sampling in which 1 in N packets is chon suf-
fers from omitting majority of flows, most of which are short flows (within N
packets). Although usage-bad applications work well by recording long flows
and neglecting short ones, there are many other applications which depend on
nearly per-flow information other than flow lengths and sizes. In this paper, a
novel sampling method is propod to remedy the flow loss flaw for tho ap-
plications. We u a Time-out Bloom Filter to alleviate the sampling bias to-
wards long flows. Compared with random sampling, in our solution, short flows
have a much greater probability to be sampled while long flows are always
sampled, but with much fewer sampled packets. Experimental results show that,
with the same sampling rate, our solution records veral times more short
flows than random sampling. Particularly, up to 99% original flows can be re-
trieved from sampled data. Besides, we also propo an adaptive TBF system in
老婆过生日fast SRAM to perform online sampling.
1 Introduction
With the increasing network traffic, most measurement systems employ packet sampling to reduce resource consumption. First, there is not much resource remained for data collection on routers and switches becau of their heavy workload. Forming flow statistics on a sampled substream of the original traffic reduces memory con-sumption and frequency of flow lookups. Second, transmitting sampled data to collec-tors, which is common for many applications, can greatly save bandwidth for collec-tion as well as processing and storage cost at collectors. Moreover, some time-consuming process such as flow records generation can be executed on collectors. The volume of sampled data is small to transmit and the burden on routers and switches is alleviated at the same time.
However, random sampling in which 1 in N packets is chon caus great loss of flow information and it is difficult to recover. A flow is defined as a t of packets with a same key which consists of some fields in packet header. When a packet ar-rives at the router with a flow key different from tho of all current active flows, a new flow record is generated with the new key. A flow is terminated when the time
2 Shijin Kong, Tao He, Xiaoxin Shao, Xing Li
毛竹since the arrival of its latest packet exceeds a time-out threshold. Flow length is de-fined as the num
ber of its packets, and flow size is the total bytes in it. If any packet of a flow is sampled, we call this flow is sampled. In random sampling, a short flow (within N packets) is easily lost if none of its packets is sampled, and a long flow has a much greater probability to be sampled. The bias towards longer flows caus ma-jority of short flows lost, and thus brings a great loss in total sampled flows since most flows in traffic are short flows (e.g. 82.3% in the trace for our experiments).
Few studies have been addresd to the flow loss flaw in random sampling and no sampling method has been designed to meliorate the great loss of flows. Experiments have shown that the distribution of flow lengths is heavy-tailed (e, e.g., [1]), that is, most traffic is carried by a small proportion of long flows. For this reason, sampling methods of many usage-bad applications focus on long flows and neglect short ones [13]. However, there are still a lot of applications which depend on nearly per-flow information other than flow lengths and sizes. Here are some examples.
Attacks Detection: information of short flows is very important to detect network intrusions such as port scanning and SYN flooding. The attacks usually consist of numerous short flows with only veral packets. It is hard to discover the attacks unless nearly per-flow state is maintained [16], including flow key and correct count of SYN/FIN flags. Moreover, identifying a victim requires enough flow records.
Traffic Identification: in particular, P2P traffic can be effectively identified by us-ing connection patterns [2] instead of checking payloads of packets. This method counts <IP, Port> pairs retrieved from the flow information. Any obvious loss of flow records will cau fallacious counting results and hurt identification accuracy.
Network Deployment Characterizing: the diversity of flow records reflects the spatial distribution of flows in the network. Workloads of network devices (e.g. the size of route tables) can be balanced according to the distribution, which helps to de-ploy and manage network more efficiently.
In order to meet the needs of tho applications, we prent a novel packet sam-pling method in this paper to meliorate the flow loss flaw. Little attention has been paid to exact values of flow lengths and sizes since tho applications do not care about them. In our solution, the number of total sampled flows is incread by re-cording more short flows which are lost in random sampling. Packets are lected by a data structure called Time-out Bloom Filter (TBF). In TBF, some packets can have a great probability to be sampled, and others are definitely discarded. For short flows, considerable proportions of their packets have such a probability. But for long flows, the proportions are very small. Thus, compared with random sampling, our solution has a smaller sampling bias towards long flows. Experimental results show that our method can sample veral times more short
flows than random sampling with same sampling rate (the ratio of sampled packets count to original packets count). Particu-larly, up to 99% of total original flows can be retrieved from sampled data while ran-dom sampling only records 37%.
The rest of this paper is organized as follows. Section 2 reviews related work. Sec-tion 3 propos our measurement method and ction 4 makes the comparison with random sampling. Results of measurement on experimental traces are prented in Section 5 and ction 6 propos an adaptive TBF system. Finally, ction 7 con-cludes the whole paper and discuss the future work.
Time-out Bloom Filter: A New Sampling Method for Recording More Flows 3 2 Related Work
In this ction, we review previous work on flow-related sampling, and hash-bad applications which are similar to ours.
Classical uniform sampling methods like random sampling are reviewed in Duf-field’s paper [3]. Flow records on randomly sampled packets are commonly generated by Cisco NetFlow [4] with configurable sampling period N. In [5], an adaptive Net-Flow with dynamic sampling rate is devid, and one of its primary contributions is to give accurate numbers of non-TCP flows. Sampling methods for long flows such as smart sampling [6, 7] can give an accurate estimation of total usage f
or each flow size. Although per-flow detailed information can not be told by the methods, re-arch has been done to estimate the distribution of flow lengths [8, 9].
Hash-bad sampling methods have also been applied for veral purpos. Trajec-tory sampling [10, 11] puts particular flow keys in hash tables. Later packets with tho keys are lected at each node to detect spatial distribution of flows. Space-Code Filter [12] us a group of Bloom Filters with different resolutions to estimate flow length of any given key. Multistage Filter [13] lects packets bad on veral hash functions to identify large flows. And Partial Completion Filters in [14] propo
a scalable solution to detect network attacks.
3 Our Sampling Method
In this ction, we introduce Time-out Bloom Filter for packets sampling. TBF is derived from standard Bloom Filter [15]. A BF is a hash table with m bits, denoted as b[0], b[1], …, b[m-1], each of which can be 0 or 1. There are d independent hash functions, h1(x), h2(x), …, h d(x), attached, and each of them maps a given key to one of the m bits. To inrt a key c into the table, all the d bits b[h1(c)], b[h2(c)], …, b[h d(c)] are t to 1. Initially there are n keys in the table. Later, the table is us
ed to check whether a given key c’ has been inrted. If b[h1(c’)], b[h2(c’)], …, b[h d(c’)] are all t to 1, c’ is recognized as one of the n initial keys, otherwi it is not. On aver-age, BF has a complexity of O(1) for querying a given key, which is much faster than traditional hash method with complexity O(m). However, since different keys can have same values calculated by hash functions, each bit can be t by veral keys. If all the d bits of a non-initial key have already been t to 1 by initial keys, it will be mistaken as one of them. This mistake is called a fal positive error.
TBF is ud to tell whether an incoming packet can be sampled. Fast query of BF is retained and the fal-positive error is reconsidered in TBF. Section 3.1 shows the principles of TBF and in Section 3.2 we explain the motivation for designing TBF. 3.1 Time-out Bloom Filter
TBF is similar to BF except that it does not have m bits, but m buckets instead, each of which contains a timestamp. The m timestamps are denoted as t[0], t[1], …, t[m-1]. Besides, it has a bucket time-out value t0.
4 Shijin Kong, Tao He, Xiaoxin Shao, Xing Li
Our algorithm is described as follows. When a new packet with key c and time-stamp t comes, the d timestamps, t[h1(c)], t[h2(c)], …, t[h d(c)], are compared with t. If any of the d timestamps, the i th for
example, follows t - t[h i(c)] > t0 (or we say b[h i(c)] gets time-out), the packet is sampled, otherwi it is discarded. After comparison, all tho d timestamps are updated to t even if the packet is not sampled. Figure 1 illus-trates this process with d=3.
Fig. 1. A Time-out Bloom Filter with d=3
TBF differs from BF in two aspects: (1) TBF ts a timestamp for each bucket in-stead of simply tting it to 1, and updates the timestamps after every packet lec-tion; (2) TBF samples a packet as long as any of the d buckets gets time-out while BF requires that all the d bits are t to 1.
3.2 Why Time-out Bloom Filter
Our initial motivation is to sample more flows, so ideally for each flow at least one packet should be sampled. A simple solution is to lect the first packet of each flow and discard all the rest packets of it. In this process, we do not have to know any more information of the flow (e.g. IP address, ports). The only thing we want to know is whether an incoming packet belongs to an active flow or it is the first packet of a new flow. BF can tell the result quickly. We put all the keys of existed flows in BF and query it every time when a packet comes. According to the query result, an incoming packet is discarded if it is already in BF, or otherwi it is sampled as the first packet of a new flow. Hence a packet is sampled if any of the m bits is not t to 1. This is the origin of the cond difference between BF and TBF mentioned in Section 3.1.
Unfortunately, there are two fatal drawbacks using BF.
(1). Not all first packets can be sampled becau fal positive error happens to mistake the first packet of a new flow for a packet of an existed flow. When this error occurs, the first packet will not be sampled and this flow is lost.
(2). More riously, the longer the sampling is performed, the more existed flows are put into BF. Finally all m bits will be t to 1. A full-filled BF caus fal posi-tive error in every query, rejecting p
先进后出
ackets of newly generated flows to be sampled. A solution to this drawback is to clear the BF periodically, but the filter will be full-filled quickly on high-speed links (usually less than one cond). There is not enough time for any practical sampling implementation to clear BF every cond.
Time-out Bloom Filter: A New Sampling Method for Recording More Flows 5 Thus TBF is applied to meliorate tho drawbacks. In TBF, a bucket getting time-out can be viewed as a bit t to “0” and a bucket not getting time-out as a bit t to “1”. When an incoming packet arrives at time t, only flows that have packets updated within [t-t0, t] are still kept in the filter. All other buckets are logically t to “0”. As time elapsing, the “1” to ”0” transforming process is automatically executed. If t0 is small enough, TBF is never full-filled with “1”. Hence drawback (2) is avoided.
Each flow can have multiple packets sampled in TBF rather than exactly one. When the time interval between two continuous packets of a flow is smaller than t0, the latter one is definitely discarded since none of its d buckets gets time-out. When the interval is greater than t0, the latter one may not be discarded. As long as any of tho buckets is not updated by other flows during the interval, the packet is sampled, or otherwi a fal positive error occurs. A flow is lost only if all its packets encoun-ter fal positive errors, which greatly meliorates drawback (1). But as a result, re-dundant sampled traffic is generated becau flows unnecessarily have multiple pack-ets sampled.
4 Comparing Sampling Methods
In this ction, we compare our sampling method bad on TBF with random sam-pling, and then explain why our sampling method can sample more flows.
4.1 Random Sampling: Sampling Bias towards Long Flows
Assuming that 1 in every N packets is lected by random sampling, it is easy to figure out the probability of a packet to be sampled is
P s=1/N(1) Let F(k) denote all the flows with length k and M(k) denote the number of F(k). On average, k M(k)/N of packets of F(k) are sampled. So the proportion of sampled flows of F(k) has a minimum 1/N when M(k)/N of F(k) have one packet sampled for each, and a maximum k/N when k/N of F(k) have all their packets sampled.
交通标志有哪些Usually, N is greater than 10 so that for k much smaller than N, the proportion of sampled flows is very small. For longer flows with k much greater than N, almost all of them are sampled. The discrepancy of sampling probabilities reprents the sampling bias towards long flows.
4.2 TBF: Greater Packet Sampling Probability and Less Biad Sampling
First, we u the conclusion of BF in [15]. The probability that a bucket gets “1” is
P1=1 – (1 – 1/m)Ld(2) where L is the number of S(t0), the t of flows sampled during previous t0 interval. L can be measured by placing an empty TBF on the link for t0 time and counting the flows in it. For example, t m=65,536, d=3, t0=0.2s, then L for the trace in Section 5
6 Shijin Kong, Tao He, Xiaoxin Shao, Xing Li
is 5,882 on average (we divide the trace into thousands of t0 periods, and measure L
on each t0 interval to get the average). Thus, a typical value of P1 is 0.23. If L does not
vary much whenever it is measured, P1 is always around 0.23.
d 1 2 3 4
P1’/P10.14/0.09 0.18/0.16 0.27/0.23 0.32/0.30
Table 1.P1’ and P1 with m=65,536, t0=0.2s and veral d
Now, we suppo a packet with a key c, which does not belong to S(t0), comes. For each i (1≤i≤d), o
ne would easily expect the probability that b[h i(c)] gets “1” (denoted
as P1’) is P1. However, that’s not the ca. P1’ is usually greater than P1. We think this inconsistency appears due to the correlation among packets. For example, some appli-cations start veral flows with same source IP address at nearly the same time. Any hash function only us source IP address for calculation will map the packets of all the flows to a same bucket. As long as any of the flows is in S(t0), P1’ for packets
of other flows is definitely 1 rather than P1. Besides, there are many other reasons that cau P1’ to be 1. So on average, P1’ is greater than P1.
To minimize the correlation, hash functions must be carefully chon. A hash func-
tion should u all fields of flow keys for calculation. Thus, even if two flow keys, c1
and c2, are only different in one field, h i(c1) and h i(c2) are not the same for each i
(1≤i≤d). We u such a group of all-fields-dependent hash functions in Section 5, and
P1’ is measured and compared with P1 in Table 1. As we e, P1’ is very clo to P1, especially when d >1. The correlation is effectively avoided. Thus, if a packet does
not belong to S(t0), the probability that any of its d buckets gets “0” (time-out) is
P s=1 - (P’1)d≈1 - P1d(3)
P s is small if d is t to 3 or larger. Again with the example mentioned above, P1 is
0.23 and P s is 0.98. It is much greater than that of random sampling (in Equation 1).
Fig. 2. (a) Average inter-packet interval (b) The proportion of “candidates” for t0=0.2s
We define a “candidate” as a packet that has a probability P s to be sampled. A packet that is definitely not sampled is not a “candidate”. In random sampling, all the packets are “candidates”. In TBF, however, only if the interval between two continu-
ous packets in a flow is greater than t0, the latter one can be a “candidate”. In Figure
2(a), we can e that the longer the flow, the shorter the average inter-packet interval