Design process data storage and organize data scraping

Falentino Sembiring, Dian Permata Sari

Abstract


In this study Web scraping will explain the process of retrieving urls from similar sites for the erosion process and storing url data on daily, weekly, monthly, and annual databases, so that url data can be valid and invalid urls will be filtered. filtering will be done to make it easier for a number of processes to be moved into the database. The next process will distinguish url based on available content data based on title, tags, keywords like SEO. Each step will be stored in the data warehouse to create the url data center. Hopefully this is the stage to collect data for big data. Problems are limited by designing web crawlers by searching for similar sites and storing processes in the database. From the database it will be directed to the data warehouse data. after in the data warehouse, data will be processed in the interface to the user divided by classification


Keywords


Data Warehouse; Similar Site; Storing Data; Web Scraping

Full Text:

PDF

References


Adila, Nelawati. "Implementation of web scraping for journal data collection on the SINTA website." Sinkron: jurnal dan penelitian teknik informatika 7.4 (2022): 2478-2485.

Boeing, Geoff, and Paul Waddell. "New insights into rental housing markets across the united states: web scraping and analyzing craigslist rental listings." Journal of Planning Education and Research 37.4 (2017): 457-476.

Priyanto, Agung, and Muhammad Rifqi Ma'arif. "Implementasi web scrapping dan text mining untuk akuisisi dan kategorisasi informasi dari internet (Studi Kasus: Tutorial Hidroponik)." Indonesian Journal of Information Systems 1.1 (2018): 25-33.

Chaulagain, Ram Sharan, et al. "Cloud based web scraping for big data applications." IEEE International Conference on Smart Cloud (SmartCloud). IEEE, 2017.

Josi, Ahmad, and Leon Andretti Abdillah. "Penerapan teknik web scraping pada mesin pencari artikel ilmiah." (2014).




DOI: https://doi.org/10.17509/integrated.v1i1.16837

Refbacks

  • There are currently no refbacks.


Copyright (c) 2019 INTEGRATED (Journal of Information Technology and Vocational Education)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Journal has been indexed by: