TY - JOUR
T1 - A hitchhiker's guide to deep chemical language processing for bioactivity prediction
AU - Özçelik, Rıza
AU - Grisoni, Francesca
N1 - Publisher Copyright:
© 2025 RSC.
PY - 2024/12/16
Y1 - 2024/12/16
N2 - Deep learning has significantly accelerated drug discovery, with ‘chemical language’ processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many ‘bells and whistles’. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This ‘hitchhiker's guide’ not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, e.g., in terms of neural network architectures, molecular representations, and hyperparameter optimization.
AB - Deep learning has significantly accelerated drug discovery, with ‘chemical language’ processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many ‘bells and whistles’. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This ‘hitchhiker's guide’ not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, e.g., in terms of neural network architectures, molecular representations, and hyperparameter optimization.
UR - http://www.scopus.com/inward/record.url?scp=85213060676&partnerID=8YFLogxK
U2 - 10.1039/d4dd00311j
DO - 10.1039/d4dd00311j
M3 - Article
C2 - 39726698
AN - SCOPUS:85213060676
SN - 2635-098X
VL - XX
JO - Digital Discovery
JF - Digital Discovery
IS - X
ER -