Please wait a minute...
Big Data Mining and Analytics  2021, Vol. 4 Issue (4): 279-297    DOI: 10.26599/BDMA.2021.9020012
    
Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks
Sudhir Kumar Patnaik*(),C. Narendra Babu(),Mukul Bhave()
Department of Computer Science and Engineering, M. S. Ramaiah University of Applied Sciences, Bangalore 560054, India
Gibraltar India Solutions LLP, Bangalore 560103, India
Download: PDF (4600 KB)      HTML  
Export: BibTeX | EndNote (RIS)      

Abstract  

Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.



Key wordsadaptive web scraping      deep learning      Long Short-Term Memory (LSTM)      Web data extraction      You only look once (Yolo)     
Received: 18 April 2021      Published: 30 August 2021
Corresponding Authors: Sudhir Kumar Patnaik     E-mail: skpatnaik9@gmail.com;narendrababu.c@gmail.com;mukulbhave@gmail.com
About author: Sudhir Kumar Patnaik received the MEng degree in electronics and communication from National Institute of Technology, Rourkela, India in 1995. He is currently a PhD candidate in computer science (machine learning) at M. S. Ramaiah University of Applied Science, Bangalore, India. He is working as the vice president of engineering and site leader at Gibraltar India Solutions LLP, Bangalore, India. Prior to Gibraltar India, he was the VP of platform engineering at Intuit India, for 13 years. His research interests are in the areas of data extraction, deep learning, and machine learning. He is a member of Industry Advisory Board at International Institute of Information Technology, Bangalore, and a member of the Board of Studies for Computer Science at Vellore Institute of Technology, Andhra Pradesh. He is also a senior member of IEEE, a fellow at Institution of Engineers (India), and a member of CSI, ACM, and ISTE.|C. Narendra Babu received the BEng degree in CSE from Adichunchanagiri Institute of Technology, India in 2000, the MEng degree in CSE from M.S. Ramaiah Institute of Technology, India in 2004. and the PhD degree from Jawaharlal Nehru Technological University Anantapur, India in 2015. He is currently an associate professor at the Department of Computer Science and Engineering, M. S. Ramaiah University of Applied Sciences, Bangalore, India. His research interests include artificial intelligence, machine learning, data analytics, social media analytics, and time series and spatio-temporal data modeling. He is a senior member of IEEE, a member of the IEEE Education Society, and a member of IAENG. He has published a book chapter, over twelve refereed journal papers, and eleven refereed conference proceeding papers.|Mukul Bhave received the MS degree in mathematics from Bundelkhand University, Jhansi, India in 1997, and the MS degree in business administration and management from Pt. Ravishankar Shukla University, Raipur, India in 1999. He is working as a software engineer at Gibraltar India Solutions LLP, Bangalore, India. He was previously employed at Intuit where he worked on the data aggregation platform. With over 16 years of software development experience, he has worked with Digital Insight, SoftwareAG (webMethods), and MindTree. His research interests are building application servers and platforms, deep learning, and web data extraction.
Cite this article:

Sudhir Kumar Patnaik,C. Narendra Babu,Mukul Bhave. Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks. Big Data Mining and Analytics, 2021, 4(4): 279-297.

URL:

http://bigdata.tsinghuajournals.com/10.26599/BDMA.2021.9020012     OR     http://bigdata.tsinghuajournals.com/Y2021/V4/I4/279

ModelAlgorithm/Technique
R-CNNDataset: ImageNet
Classification: Binary SVM
Fast R-CNNRoI pooling
Faster R-CNNRPN
Mask R-CNNRoI align (pixel level segmentation)
RetinaNetResNet, Feature Pyramid Network (FPN)
Single Shot Detector (SSD)Single deep neural network, feed forward convolutional network
Histogram of Oriented Gradients (HOG)Detection window, RoI
Region-Fully Convolutional Network (R-FCN)Convolutional + RoI
Spatial Pyramid Pooling (SPP)Pyramid pooling
YoloClassification and regression, Deformable Parts Model (DPM) and R-CNN
Table 1 Deep learning models and techniques.
ToolExtraction ruleTechniquePrecision (%)Self-healing
TSIMMIS[29]Wrapper-basedTraditional/statisticalNot availableNo
WebOQL[30]Tag treeTraditional/statisticalNot availableNo
WHISK[31]Regular expressionSupervised learning69No
RAPIER[32]Logic rulesSupervised learning89No
SRV[33]Logic rulesSupervised learning58No
SoftMealy[34]Regular expressionSupervised learning58No
DEPTA[35]Tag treeUn-supervised learning98No
Trinity[36]Regular expressionUn-supervised learning96No
DeLA[37]Regular expressionUn-supervised learning80No
OLERA[38]Regular expressionSemi-supervised learning99No
Proposed systemObject detectionDeep learningTo be determinedYes
Table 2 Comparison of traditional, machine learning, and deep learning based web data extraction tools.
Fig. 1 Traditional, machine learning, and deep learning based web data extraction system.
Fig. 2 Traditional, machine learning, and deep learning based web data extraction system with core data extraction engine.
Fig. 3 Failure in core data extraction engine using traditional, machine learning, and deep learning technique for automated web data extraction.
Fig. 4 End-to-end automated web data extraction system architecture using Yolo and Tesseract.
Fig. 5 Yolo architecture for object detection in the proposed web data extraction system.
ToolDetail
Programming languagePython 3.8
Object detection ML libraryYolo
Text extraction ML libraryTesseract (4.1.1)
Data storagehdf5
GPU computing capacityYolo trained on Intel I7-10750H CPU@2.5 GHz
Installation and package managementAnaconda 4.9.0
Source code repositoryGitHub
Table 3 Development environment, tools, and technologies.
Fig. 6 Tesseract LSTM architecture for image-to-text extraction in the proposed web data extraction system.
ExperimentProductDomainErrorSelf-correction
1SingleRetailNoNo
2SingleRetailYesYes
3MultipleRetailNoNo
4MultipleRetailYesYes
5SingleNonretailNoNo
6SingleNonretailYesYes
7MultipleNonretailNoNo
Table 4 Experiment structure.
Fig. 7 Object detection with bounding boxes around single product detail without changes in the website layout or location (URL) of the product page.
Fig. 8 Object detection with bounding boxes around single product detail with changes in the website layout or location (URL) of the product page.
Fig. 9 Object detection with bounding boxes around single product detail and data extracted without changes in the website layout or location (URL) of the product page.
Fig. 10 Object detection with bounding boxes around single product detail and data extracted with changes in the website layout or Location (URL) of the product page.
Fig. 11 Object detection with bounding boxes around multiple product detail without changes in the website layout or location (URL) of the product page.
Fig. 12 Object detection with bounding boxes around multiple product detail with changes in website layout or location (URL) of the product page.
Fig. 13 Object detection with bounding boxes around multiple product detail and data extracted without changes in website layout or location (URL) of the product page.
Fig. 14 Object detection with bounding boxes around multiple product detail and data extracted with changes in website layout or location (URL) of the product page.
Fig. 15 Object detection with bounding boxes around single product detail in a nonretail website without changes in the website layout or location (URL) of the product page.
Fig. 16 Object detection with bounding boxes around single product detail in a nonretail website with changes in the website layout or location (URL) of the product page.
Fig. 17 Object detection with bounding boxes around product detail in a nonretail website without changes in the website layout or location (URL) of the product page.
Fig. 18 Object detection with bounding boxes around product detail and data extracted from a nonretail website with changes in the website layout or location (URL) of the product page.
Fig. 19 Object detection with bounding boxes around multiple product detail in a nonretail website without changes in the website layout or location (URL) of the product page.
Fig. 20 Object detection with bounding boxes around multiple product detail and data extracted from a nonretail website with changes in the website layout or location (URL) of the product page.
ParameterObject/character
mAP74%
Object extraction accuracy (precision and recall)97% and 48.8%
Character extraction accuracy (precision and recall)99% and 49.5%
Evaluation time1.02 s (Avg.)
Total loss6% (Avg.)
Validation loss6% (Avg.)
Table 5 Performance metrics of the proposed web data extraction system.
Fig. 21 Evaluation time and total loss of the proposed web data extraction system.
Fig. 22 Validation loss of the proposed web data extraction model.
[18]   Nagarajan S. and Perumal K., A deep neural network for information extraction from web pages, in Proc. of 2017 IEEE Int. Conf. Power, Control, Signals and Instrumentation Engineering (ICPCSI), Chennai, India, 2017, pp. 918-922.
[19]   Gogar T., Hubacek O., and Sedivy J., Deep neural networks for web page information extraction, in Artificial Intelligence Applications and Innovations. IFIP Advances in Information and Communication Technology, vol. 475, Iliadis L. and Maglogiannis I., eds. Thessaloniki, Greece: Springer, 2016, pp. 154-163.
[1]   Zhang Y. B., Image feature extraction algorithm in big data environment, Journal of Intelligent and Fuzzy Systems, vol. 39, no. 4, pp. 5109-5118, 2020.
[2]   Xie L., Tao J. L., Zhang Q. N., and Zhou H. Y., CNN and KPCA-based automated feature extraction for real time driving pattern recognition, IEEE Access, vol. 7, pp. 123765-123775, 2019.
[20]   Baumgartner R., Ceresna M., and Ledermuller G., DeepWeb navigation in web data extraction, in Proc. of Int. Conf. Computational Intelligence for Modelling, Control and Automation and Int. Conf. Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), Vienna, Austria, 2005, pp. 698-703.
[21]   Liu D., Ma L., and Liu X., Research on adaptive wrapper in deep web data extraction, in Internet of Vehicles-Safe and Intelligent Mobility. IOV 2015. Lecture Notes in Computer Science, vol. 9502, Hsu C. H., Xia F., Liu X., and Wang S., eds. Chengdu, China: Springer, 2015, pp. 409-423.
[3]   Tao J., Wang H. B., Zhang X. Y., Li X. Y., and Yang H. W., An object detection system based on YOLO in traffic scene, in Proc. of 2017 6th Int. Conf. Computer Science and Network Technology (ICCSNT), Dalian, China, 2017, pp. 315-319.
[4]   Ali F., Ali A., Imran M., Naqvi R. A., Siddiqi M. H., and Kwak K. S., Traffic accident detection and condition analysis based on social networking data, Accident Analysis & Prevention, vol. 151, p. 105973, 2021.
[22]   Girshick R., Donahue J., Darrell T., and Malik J., Rich feature hierarchies for accurate object detection and semantic segmentation, arXiv preprint arXiv: 1311.2524v5, 2014.
[23]   Basiri M. E., Nemati S., Abdar M., Cambria E., and Acharya U. R., ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis, Future Generation Computer Systems, vol. 115, pp. 279-294, 2021.
[5]   Islam N., Islam Z., and Noor N., A survey on optical character recognition system, Journal of Information & Communication Technology-JICT, vol. 10, no. 2, pp. 1-4, 2016.
[6]   Rao H. and Sashikumar D. R. M., A survey on automated web data extraction techniques for product specification from e-commerce web sites, International Journal of Advanced Research in Computer Science and Software Engineering, vol. 6, no. 8, pp. 310-316, 2016.
[24]   Redmon J. and Farhadi A., YOLO9000: Better, faster, stronger, in Proc. of 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6517-6525.
[25]   Girshick R., Fast R-CNN, in Proc. of 2015 IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1440-1448.
[7]   Uzun E., A novel web scraping approach using the additional information obtained from web pages, IEEE Access, vol. 8, pp. 61726-61740, 2020.
[8]   Salah M., Al Okush B., and Al Rifaee M., A comparison of web data extraction techniques, in Proc. of 2019 IEEE Jordan Int. Joint Conf. Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 2019, pp. 785-789.
[9]   Li S. L., Chen C., Luo K. W., and Song B., Review of deep web data extraction, in Proc. of 2019 IEEE Symp. Series on Computational Intelligence (SSCI), Xiamen, China, 2019, pp. 1068-1070.
[10]   Nadee W. and Prutsachainimmit K., Towards data extraction of dynamic content from JavaScript web applications, in Proc. of 2018 Int. Conf. Information Networking (ICOIN), Chiang Mai, Thailand, 2018, pp. 750-754.
[11]   Ujwal B. V. S., Gaind B., Kundu A., Holla A., and Rungta M., Classification-based adaptive web scraper, in Proc. of 16th IEEE Int. Conf. Machine Learning and Applications, Cancun, Mexico, 2017, pp. 125-132.
[12]   Park J. and Barbosa D., Adaptive record extraction from web pages, in Proc. of WWW 2007, Banff, Canada, 2007, pp. 1335-1336.
[26]   Ren S. Q., He K. M., Girshick R., and Sun J., Faster R-CNN: Towards real-time object detection with region proposal networks, arXiv preprint arXiv: 1506.01497v3, 2016.
[27]   Huang R., Pedoeem J., and Chen C. X., YOLO-LITE: A real-time object detection algorithm optimized for Non-GPU computers, in Proc. of 2018 IEEE Int. Conf. Big Data (Big Data), Seattle, WA, USA, 2018, pp. 2503-2510.
[13]   Liu C. J., Tao Y. F., Liang J. W., Li K., and Chen Y. H., Object detection based on YOLO network, in Proc. of 2018 IEEE 4th Information Technology and Mechatronics Engineering Conf. (ITOEC), Chongqing, China, 2018, pp. 799-803.
[14]   Hong J. L., Deep web data extraction, in Proc. of 2010 IEEE Int. Conf. Systems, Man and Cybernetics, Istanbul, Turkey, 2010, pp. 3420-3427.
[15]   Ali F., Khan P., Riaz K., Kwak D., Abuhmed T., Park D., and Kwak K. S., A fuzzy ontology and SVM-based web content classification system, IEEE Access, vol. 5, pp. 25781-25797, 2017.
[16]   Li W., Shao W., Ji S. X., and Cambria E., BiERU: Bidirectional emotional recurrent unit for conversational sentiment analysis, arXiv preprint arXiv: 2006.00492, 2021.
[28]   Redmon J., Divvala S., Girshick R., and Farhadi A., You only look once: Unified, real-time object detection, in Proc. of 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779-788.
[29]   Hammer J., McHugh J., and Garcia-Molina H., Semistructured data: The TSIMMIS experience, in Proc. of 1st East-European Symp. Advances in Databases and Information Systems (ADBIS), St. Petersburg, Russia, 1997, pp. 1-13.
[30]   Arocena G. O. and Mendelzon A. O., WebOQL: Restructuring documents, databases and webs, in Proc. of 14th IEEE Int. Conf. Data Engineering, Orlando, FL, USA, 1998, pp. 24-33.
[31]   Soderland S., Learning information extraction rules for semi-structured and free text, Machine Language, vol. 34, nos. 1-3, pp. 233-272, 1999.
[32]   Califf M. E. and Mooney R. J., Bottom-up relational learning of pattern matching rules for information extraction, The Journal of Machine Learning Research, vol. 4, pp. 177-210, 2003.
[33]   Freitag D., Information extraction from HTML: Application of a general machine learning approach, in Proc. of 15th National/Tenth Conf. Artificial Intelligence/Innovative Applications of Artificial Intelligence, Madison, WI, USA, 1998, pp. 517-523.
[17]   He K. M., Gkioxari G., Dollár P., and Girshick R., Mask R-CNN, in Proc. of 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980-2988.
[34]   Hsu C. N. and Dung M. T., Generating finite-state transducers for semi-structured data extraction from the web, Information Systems, vol. 23, no. 8, pp. 521-538, 1998.
[35]   Manjaramkar A. and Lokhande R. L., DEPTA: An efficient technique for web data extraction and alignment, in Proc. of Int. Conf. Advances in Computing, Communications and Informatics, Jaipur, India, 2016, pp. 2307-2310.
[36]   Sleiman H. A. and Corchuelo R., Trinity: On using Trinary trees for unsupervised web data extraction, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 6, pp. 1544-1556, 2014.
[37]   Wang J. Y. and Lochovsky F. H., Data extraction and label assignment for web databases, in Proc. of the 12th Int. Conf. World Wide Web, Budapest, Hungary, 2003, pp. 187-196.
[38]   Chang C. H. and Kuo S. C., OLERA: Semisupervised web-data extraction with visual support, IEEE Intell. Syst., vol. 19, no. 6, pp. 56-64, 2004.
[39]   Wang Y., A new concept using LSTM Neural Networks for dynamic system identification, in Proc. of 2017 American Control Conf. (ACC), Seattle, WA, USA, 2017, pp. 5324-5329.
[40]   Ferrara E., De Meo P., Fiumara G., and Baumgartner R., Web data extraction, applications and techniques: A survey, Knowledge-Based Systems, vol. 70, pp. 301-323, 2014.
[41]   Zhai Y. H. and Liu B., Web data extraction based on partial tree alignment, in Proc. 14th Int. Conf. World Wide Web, Chiba, Japan, 2005, pp. 76-85.
[42]   Kuamri S. and Babu C. N., Real time analysis of social media data to understand people emotions towards national parties, in Proc. of 8th Int. Conf. Computing, Communication and Networking Technologies (ICCCNT), Delhi, India, 2017, pp. 1-6.
[43]   Gregg D. G. and Walczak S., Adaptive web information extraction, Communications of the ACM, vol. 49, no. 5, pp. 78-84, 2006.
[1] Changjie Wang,Zhihua Li,Benjamin Sarpong. Multimodal Adaptive Identity-Recognition Algorithm Fused with Gait Perception[J]. Big Data Mining and Analytics, 2021, 4(4): 223-232.
[2] Chenyu Hou,Jiawei Wu,Bin Cao,Jing Fan. A Deep-Learning Prediction Model for Imbalanced Time Series Data Forecasting[J]. Big Data Mining and Analytics, 2021, 4(4): 266-278.
[3] Shuai Zhang,Hongyan Liu,Jun He,Sanpu Han,Xiaoyong Du. Deep Sequential Model for Anchor Recommendation on Live Streaming Platforms[J]. Big Data Mining and Analytics, 2021, 4(3): 173-182.
[4] Yong Bie,Yan Yang. A Multitask Multiview Neural Network for End-to-End Aspect-Based Sentiment Analysis[J]. Big Data Mining and Analytics, 2021, 4(3): 195-207.
[5] Krishna Kant Singh,Akansha Singh. Diagnosis of COVID-19 from Chest X-Ray Images Using Wavelets-Based Depthwise Convolution Network[J]. Big Data Mining and Analytics, 2021, 4(2): 84-93.
[6] Natarajan Yuvaraj,Kannan Srihari,Selvaraj Chandragandhi,Rajan Arshath Raja,Gaurav Dhiman,Amandeep Kaur. Analysis of Protein-Ligand Interactions of SARS-CoV-2 Against Selective Drug Using Deep Neural Networks[J]. Big Data Mining and Analytics, 2021, 4(2): 76-83.
[7] Youssef Nait Malek,Mehdi Najib,Mohamed Bakhouya,Mohammed Essaaidi. Multivariate Deep Learning Approach for Electric Vehicle Speed Forecasting[J]. Big Data Mining and Analytics, 2021, 4(1): 56-64.
[8] Wei Zhong, Ning Yu, Chunyu Ai. Applying Big Data Based Deep Learning System to Intrusion Detection[J]. Big Data Mining and Analytics, 2020, 3(3): 181-195.
[9] Sunitha Basodi, Chunyan Ji, Haiping Zhang, Yi Pan. Gradient Amplification: An Efficient Way to Train Deep Neural Networks[J]. Big Data Mining and Analytics, 2020, 3(3): 196-207.
[10] Chaity Banerjee, Tathagata Mukherjee, Eduardo Pasiliao Jr.. Feature Representations Using the Reflected Rectified Linear Unit (RReLU) Activation[J]. Big Data Mining and Analytics, 2020, 3(2): 102-120.
[11] Lujia Shen, Qianjun Liu, Gong Chen, Shouling Ji. Text-Based Price Recommendation System for Online Rental Houses[J]. Big Data Mining and Analytics, 2020, 3(2): 143-152.
[12] Zhenxing Guo, Shihua Zhang. Sparse Deep Nonnegative Matrix Factorization[J]. Big Data Mining and Analytics, 2020, 03(01): 13-28.
[13] Qile Zhu, Xiyao Ma, Xiaolin Li. Statistical Learning for Semantic Parsing: A Survey[J]. Big Data Mining and Analytics, 2019, 2(4): 217-239.
[14] Ying Yu, Min Li, Liangliang Liu, Yaohang Li, Jianxin Wang. Clinical Big Data and Deep Learning: Applications, Challenges, and Future Outlooks[J]. Big Data Mining and Analytics, 2019, 2(4): 288-305.
[15] Wenmao Wu, Zhizhou Yu, Jieyue He. A Semi-Supervised Deep Network Embedding Approach Based on the Neighborhood Structure[J]. Big Data Mining and Analytics, 2019, 2(3): 205-216.