TY - JOUR
T1 - Investigating the need for preprocessing of near-infrared spectroscopic data as a function of sample size
AU - Schoot, Mark
AU - Kapper, Christiaan
AU - van Kollenburg, Geert H.
AU - Postma, Geert J.
AU - van Kessel, Gijs
AU - Buydens, Lutgarde M. C.
AU - Jansen, Jeroen J.
PY - 2020/9/15
Y1 - 2020/9/15
N2 - Preprocessing of near-infrared (NIR) spectra is an essential part of multivariate calibration. It mainly aims to remove artefacts caused during measurement to improve prediction performance or interpretation. However, preprocessing can have undesired side-effects. Additionally, calibration algorithms can learn to deal with artefacts by themselves when enough samples are available. This may influence the effect preprocessing has on prediction performance when the calibration dataset size increases. In this paper we investigate the interaction between the size of the calibration data and preprocessing for NIR calibrations for several datasets. Results show that extending the calibration data with more samples improves prediction performance, regardless of the preprocessing strategy. Although prediction performance almost always benefits from preprocessing, extending the calibration data can reduce the effect of preprocessing on prediction performance. This means the optimal preprocessing strategy may change as a function of the number of samples. It is demonstrated that using a Design of Experiments (DoE) approach to determine the optimal preprocessing strategy leads to equal or better prediction performance for all calibration set sizes compared to the case of not preprocessing at all. Preprocessing is most valuable for small calibration sets, but as the calibration set increases can become obsolete or even harmful. Therefore, we recommend to always evaluate the effect of a preprocessing strategy before making or updating calibration models.
AB - Preprocessing of near-infrared (NIR) spectra is an essential part of multivariate calibration. It mainly aims to remove artefacts caused during measurement to improve prediction performance or interpretation. However, preprocessing can have undesired side-effects. Additionally, calibration algorithms can learn to deal with artefacts by themselves when enough samples are available. This may influence the effect preprocessing has on prediction performance when the calibration dataset size increases. In this paper we investigate the interaction between the size of the calibration data and preprocessing for NIR calibrations for several datasets. Results show that extending the calibration data with more samples improves prediction performance, regardless of the preprocessing strategy. Although prediction performance almost always benefits from preprocessing, extending the calibration data can reduce the effect of preprocessing on prediction performance. This means the optimal preprocessing strategy may change as a function of the number of samples. It is demonstrated that using a Design of Experiments (DoE) approach to determine the optimal preprocessing strategy leads to equal or better prediction performance for all calibration set sizes compared to the case of not preprocessing at all. Preprocessing is most valuable for small calibration sets, but as the calibration set increases can become obsolete or even harmful. Therefore, we recommend to always evaluate the effect of a preprocessing strategy before making or updating calibration models.
KW - Calibration modelling
KW - Preprocessing
KW - Design of experiments
KW - NIR
KW - Spectroscopy
KW - Model maintenance
U2 - 10.1016/j.chemolab.2020.104105
DO - 10.1016/j.chemolab.2020.104105
M3 - Article
SN - 0169-7439
VL - 204
JO - Chemometrics and Intelligent Laboratory Systems
JF - Chemometrics and Intelligent Laboratory Systems
M1 - 104105
ER -