Doorgaan naar hoofdnavigatie Doorgaan naar zoeken Ga verder naar hoofdinhoud

Current challenges and possible big data solutions for the use of web data as a source for official statistics

  • Piet Daas
  • , Jacek Maślankowski (Corresponding author)

Onderzoeksoutput: Bijdrage aan tijdschriftTijdschriftartikelAcademicpeer review

46 Downloads (Pure)

Samenvatting

Web scraping has become popular in scientific research, especially in statistics. Preparing an appropriate IT environment for web scraping is currently not difficult and can be done relatively quickly. Extracting data in this way requires only basic IT skills. This has resulted in the increased use of this type of data, widely referred to as big data, in official statistics. Over the past decade, much work was done in this area both on the national level within the national statistical institutes, and on the international one by Eurostat. The aim of this paper is to present and discuss current problems related to accessing, extracting, and using information from websites, along with the suggested potential solutions.
For the sake of the analysis, a case study featuring large-scale web scraping performed in 2022 by means of big data tools is presented in the paper. The results from the case study, conducted on a total population of approximately 503,700 websites, demonstrate that it is not possible to provide reliable data on the basis of such a large sample, as typically up to 20% of the websites might not be accessible at the time of the survey. What is more, it is not possible to know the exact number of active websites in particular countries, due to the dynamic nature of the Internet, which causes websites to continuously change.
Originele taal-2Engels
Pagina's (van-tot)49-64
Aantal pagina's16
TijdschriftWiadomości Statystyczne
Volume68
Nummer van het tijdschrift12
DOI's
StatusGepubliceerd - 29 dec. 2023

Vingerafdruk

Duik in de onderzoeksthema's van 'Current challenges and possible big data solutions for the use of web data as a source for official statistics'. Samen vormen ze een unieke vingerafdruk.

Citeer dit