Identifying different types of companies via their website text

Piet Daas, Nick de Wolf

Onderzoeksoutput: Bijdrage aan congresPaperAcademic

Samenvatting

In this poster, we describe the findings of our work on identifying different types of companies based on the text on their websites. We have focused so far on identifying innovative, platform economy and artificial intelligence companies in the Netherlands. To enable this, for each case, at least 2000 company websites were used and split into an 80% training and an 20% test set. For innovation, the results of the Community Innovation Survey were used. This is a biannual European standardized survey to detect various forms of innovation; we focused on product innovation. For the other cases, positive examples of company websites were provided by experts. To these equal numbers of negative cases were added by taking a random sample of company websites from the Business Register (which is maintained at our office). After preprocessing, various classification algorithms included in the scikit-learn library of Python were applied to determine which of them was best able to discern between the two cases; e.g. innovative vs. non-innovative, platform vs. non-platform economy and artificial intelligence vs. non-artificial intelligence. In addition, the effect of adding WordEmbeddings was also tested. We found that logistic regression with WordEmbeddings worked best to detect innovation (accuracy 88%), linear-SVM worked best for platform economy websites (accuracy 82%) and logistic regression worked best to detect artificial intelligence companies (accuracy 92%). In the first case, only the text on the main page of the website could be used while for the other cases the text on all pages scraped was required. Including WordEmbeddings based vectors did not improve the findings for platform economy and artificial intelligence. When the probability-based classification results of the models were checked, clear U-shape distributions were found for the test set in all cases. This demonstrated that the models developed are well able to discern the cases in each application.
Originele taal-2Engels
StatusGepubliceerd - 3 jun. 2021
EvenementSymposium on Data Science & Statistics: Beyond Big Data - Shaping the Future -
Duur: 2 jun. 20214 jun. 2021
https://ww2.amstat.org/meetings/sdss/2021/onlineprogram/index.cfm

Congres

CongresSymposium on Data Science & Statistics
Verkorte titelSDSS 2021
Periode2/06/214/06/21
Internet adres

Vingerafdruk

Duik in de onderzoeksthema's van 'Identifying different types of companies via their website text'. Samen vormen ze een unieke vingerafdruk.

Citeer dit