Please wait a minute...
Big Data Mining and Analytics  2018, Vol. 1 Issue (1): 47-56    DOI: 10.26599/BDMA.2018.9020005
    
Online Internet Traffic Monitoring System Using Spark Streaming
Baojun Zhou, Jie Li*, Xiaoyan Wang, Yu Gu, Li Xu, Yongqiang Hu, Lihua Zhu
Baojun?Zhou and Jie?Li are with the Department of Computer Science, University of Tsukuba, Tsukuba 305-8577, Japan. E-mail: zhoubaojun@osdp.cs.tsukuba.ac.jp.
Xiaoyan?Wang is with the College of Engineering, Ibaraki University, Hitachi 316-8511, Japan. E-mail: xiaoyan.wang.shawn@vc.ibaraki.ac.jp.
Yu?Gu is with the School of Computer and Information, Hefei University of Technology, Hefei 230601, China. E-mail: yugu.bruce@ieee.org.
Li?Xu is with the College of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350007, China. E-mail: xuli@fjnu.edu.cn.
Yongqiang Hu and Lihua Zhu are with the Institute of Scientific and Technical Information of Qinghai, Xining 810008, China. E-mail: yqhua@163com; zlh97330@163.com.
Download: PDF (1926 KB)      HTML  
Export: BibTeX | EndNote (RIS)      

Abstract  

Owing to the explosive growth of Internet traffic, network operators must be able to monitor the entire network situation and efficiently manage their network resources. Traditional network analysis methods that usually work on a single machine are no longer suitable for huge traffic data owing to their poor processing ability. Big data frameworks, such as Hadoop and Spark, can handle such analysis jobs even for a large amount of network traffic. However, Hadoop and Spark are inherently designed for offline data analysis. To cope with streaming data, various stream-processing-based frameworks have been proposed, such as Storm, Flink, and Spark Streaming. In this study, we propose an online Internet traffic monitoring system based on Spark Streaming. The system comprises three parts, namely, the collector, messaging system, and stream processor. We considered the TCP performance monitoring as a special use case of showing how network monitoring can be performed with our proposed system. We conducted typical experiments with a cluster in standalone mode, which showed that our system performs well for large Internet traffic measurement and monitoring.



Key wordsspark streaming      network monitoring      big data      TCP performance monitoring     
Received: 11 August 2017      Published: 08 January 2020
Corresponding Authors: Jie Li   
Cite this article:

Baojun Zhou, Jie Li, Xiaoyan Wang, Yu Gu, Li Xu, Yongqiang Hu, Lihua Zhu. Online Internet Traffic Monitoring System Using Spark Streaming. Big Data Mining and Analytics, 2018, 1(1): 47-56.

URL:

http://bigdata.tsinghuajournals.com/10.26599/BDMA.2018.9020005     OR     http://bigdata.tsinghuajournals.com/Y2018/V1/I1/47

Fig. 1 Architecture of the proposed online monitoring system.
TransformationMeaning
map()Map each element in the source stream to a new value.
flatMap()Similar to map(), but each element can be mapped to 0 or more output items.
mapValues()Map the value of each key-value pair without change the key.
reduce()Aggregate each 2 elements in the source stream to 1 new element.
reduceByKey()Aggregate 2 key-value pairs with the same key to a new key-value pair.
groupByKey()Group all key-value pairs with the same key together.
countByValue()Count the frequency of each element, and return a key-value pair stream whose key is the element, value is the count.
join()Join two key-value pair streams (K, V) and (K, W) together, return a new stream of (K, V, W) pairs with all pairs of elements for each key.
Table?1 Various useful transformation Spark Streaming APIs.
Fig. 2 Typical TCP keep-alive.
Fig. 3 System structure.
ComponentConfiguration
Collector2 machines. Model t2.micro, 1 GB memory
Messaging system1 machine. Model r4.large, High-frequency Intel Xeon E5-2686 v4 (Broadwell) Processors, 15.25 GB memory. Bandwidth up to 10 Gbps
Stream processor5 machines. Model c4.large, High-frequency Intel Xeon E5-2666 v3 (Haswell) processors, 3.75 GB memory. Bandwidth of 500 Mbps
Table?2 Configurations for each component.
Fig. 4 Network performance measured by our system.
Fig. 5 Performance statistics of stream processor in Spark UI, where processing time is the time taken to process all jobs for a batch, and scheduling delay is the time to ship the jobs from scheduler to executor.
Fig. 6 System performance changes when a slave crashes.
[1]   Cisco Visual Networking Index, Forecast and methodology, 2016-2021, White Paper, San Jose, CA, USA: Cisco, 2016.
[2]   Lee Y., Kang W., and Son H., An Internet traffic analysis method with MapReduce, in Proc. 2010 IEEE/IFIP Network Operations and Management Symposium Workshops (NOMS Wksps), Osaka, Japan, 2010, pp. 357-361.
[3]   Brauckhoff D., Tellenbach B., Wagner A., May M., and Lakhina A., Impact of packet sampling on anomaly detection metrics, in Proc. 6th ACM SIGCOMM Conf. Int. Measurement, Rio de Janeriro, Brazil, 2006, pp. 159-164.
[4]   Qiao Y. Y., Lei Z. M., Yuan L., and Guo M. J., Offline traffic analysis system based on Hadoop, J. China Univ. Posts Telecommun., vol. 20, no. 5, pp. 97-103, 2013.
[5]   Hadoop, , 2017
[6]   Kambatla K., Kollias G., Kumar V., and Grama A., Trends in big data analytics, J. Parallel Distrib. Comput., vol. 74, no. 7, pp. 2561-2573, 2014.
[7]   Apache Spark, , 2017.
[8]   Zaharia M., Chowdhury M., Franklin M. J., Shenker S., and Stoica I., Spark: Cluster computing with working sets, in Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing, Boston, MA, USA, 2010, p. 10.
[9]   Liu J., Liu F., and Ansari N., Monitoring and analyzing big traffic data of a large-scale cellular network with Hadoop, IEEE Netw., vol. 28, no. 4, pp. 32-39, 2014.
[10]   Lee Y. and Lee Y., Toward scalable internet traffic measurement and analysis with Hadoop, ACM SIGCOMM Comput. Commun. Rev., vol. 43, no. 1, pp. 5-13, 2013.
[11]   Chen Z. J., Xu G. B., Mahalingam V., Ge L. Q., Nguyen J., Yu W., and Lu C., A cloud computing based network monitoring and threat detection system for critical infrastructures, Big Data Res., vol. 3, pp. 10-23, 2016.
[12]   Gupta A., Birkner R., Canini M., Feamster N., Mac-Stoker C., and Willinger W., Network monitoring as a streaming analytics problem, in Proc. 15th ACM Workshop on Hot Topics in Networks, Atlanta, GA, USA, 2016, pp. 106-112.
[13]   Karimi A. M., Niyaz Q., Sun W. Q., Javaid A. Y., and Devabhaktuni V. K., Distributed network traffic feature extraction for a real-time IDS, in Proc.2016 IEEE Int. Conf. Electro Information Technology (EIT), Grand Forks, ND, USA, 2016, pp. 522-526.
[14]   Chen C. L. P. and Zhang C. Y., Data-intensive applications, challenges, techniques and technologies: A survey on big data, Inf. Sci., vol. 275, pp. 314-347, 2014.
[15]   Shahrivari S., Beyond batch processing: Towards real-time and streaming big data, Computers, vol. 3, no. 4, pp. 117-129, 2014.
[16]   Paxson V., Bro: A system for detecting network intruders in real-time, Comput. Netw., vol. 31, nos. 23&24, pp. 2435-2463, 1999.
[17]   Roesch M., Snort-lightweight intrusion detection for networks, in Proc. 13th USENIX Conf. System Administration, Seattle, WA, USA, 1999, pp. 229-238.
[18]   Suricata, , 2017 .
[19]   Kafka performance, , 2017.
[20]   Spark Streaming, , 2017.
[21]   Acknowledgment ambiguity, , 2017.
[1] Mingda Li, Hongzhi Wang, Jianzhong Li. Mining Conditional Functional Dependency Rules on Big Data[J]. Big Data Mining and Analytics, 2020, 03(01): 68-84.
[2] Sunil Kumar, Maninder Singh. A Novel Clustering Technique for Efficient Clustering of Big Data in Hadoop Ecosystem[J]. Big Data Mining and Analytics, 2019, 2(4): 240-247.
[3] Thosini Bamunu Mudiyanselage, Yanqing Zhang. Feature Selection with Graph Mining Technology[J]. Big Data Mining and Analytics, 2019, 2(2): 73-82.
[4] Sunil Kumar, Maninder Singh. Big Data Analytics for Healthcare Industry: Impact, Applications, and Tools[J]. Big Data Mining and Analytics, 2019, 2(1): 48-57.
[5] Xuedi Qin, Yuyu Luo, Nan Tang, Guoliang Li. DeepEye: An Automatic Big Data Visualization Framework[J]. Big Data Mining and Analytics, 2018, 1(1): 75-82.
[6] Rossella Arcucci, Christopher Pain, Yi-Ke Guo. Effective Variational Data Assimilation in Air-Pollution Prediction[J]. Big Data Mining and Analytics, 2018, 01(04): 297-307.
[7] Qianyu Meng, Kun Wang, Xiaoming He, Minyi Guo. QoE-Driven Big Data Management in Pervasive Edge Computing Environment[J]. Big Data Mining and Analytics, 2018, 01(03): 222-233.
[8] Ling Hu, Qiang Ni, Feng Yuan. Big Data Oriented Novel Background Subtraction Algorithm for Urban Surveillance Systems[J]. Big Data Mining and Analytics, 2018, 01(02): 137-145.
[9] Yan Yang, Hao Wang. Multi-view Clustering: A Survey[J]. Big Data Mining and Analytics, 2018, 01(02): 83-107.