Model degradation in web derived text-based models

Piet Daas, Jelmer Jansen

Research output: Chapter in Book/Report/Conference proceedingConference contributionProfessional

Abstract

Getting an overview of the innovative companies in a country is a challenging task. Traditionally, this is done by sending a questionnaire to a sample of large companies. For this an alternative approach has been developed: determining if a company is innovative by studying the text on the main page of its website. The text-based model created is able to reproduce the results from the survey and is also able to detect small innovative companies, such as startups. However, model stability was found to be a serious problem. It suffered from model degradation which resulted in a gradual decrease in the detection of innovative companies. The accuracy of the model dropped from 93% to 63% during a period of one year. In this paper this phenomenon is described and the data underlying it is studied in great detail. It was found that the combination of the inactivity of a subset of websites and changes in the composition of the words on company websites over time produced this effect. A solution for dealing with this phenomenon is presented and future research is discussed.
Original languageEnglish
Title of host publication3rd International Conference on Advanced Research Methods and Analytics
Pages77-84
Number of pages8
DOIs
Publication statusPublished - 1 Jul 2020
Event3rd International Conference on Advanced Research Methods and Analytics - Valencia, Spain
Duration: 8 Jul 20209 Jul 2020

Conference

Conference3rd International Conference on Advanced Research Methods and Analytics
Abbreviated titleCARMA 2020
Country/TerritorySpain
CityValencia
Period8/07/209/07/20

Fingerprint

Dive into the research topics of 'Model degradation in web derived text-based models'. Together they form a unique fingerprint.

Cite this