期刊论文详细信息
Artificial Intelligence in the Life Sciences
Dealing with a data-limited regime: Combining transfer learning and transformer attention mechanism to increase aqueous solubility prediction performance
Johannes Kirchmair1  Magdalena Wiercioch2 
[1] Department of Applied Computer Science, Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Łojasiewicza 11, Kraków 30-348, Poland;Corresponding author at: Department of Applied Computer Science, Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Łojasiewicza 11, 30-348 Kraków, Poland.;
关键词: Aqueous solubility;    Deep Learning;    Cheminformatics;    Transformer model;    Drug discovery;    Regression;   
DOI  :  
来源: DOAJ
【 摘 要 】

Aqueous solubility is a key chemical property that drives various processes in chemistry and biology. Its computational prediction is challenging, as evidenced by the fact that it has been a subject of considerable interest for several decades. Recent work has explored fingerprint-based, feature-based and graph-based representations with different machine learning and deep learning methodologies. In general, many traditional methods have been proposed, but they rely heavily on the quality of the rule-based, hand-crafted features. On the other hand, limitations in the quality of aqueous solubility data become a handicap when training deep models. In this study, we have developed a novel structure-aware method for the prediction of aqueous solubility by introducing a new deep network architecture and then employing a transfer learning approach. The model was proven to be competitive, obtaining an RMSE of 0.587 during both cross-validation and a test on an independent dataset. To be more precise, the method is evaluated on molecules downloaded from the Online Chemical Database and Modeling Environment (OCHEM). Beyond aqueous solubility prediction, the strategy presented in this work may be useful for modeling any kind of (chemical or biological) properties for which there is a limited amount of data available for model training.

【 授权许可】

Unknown   

  文献评价指标  
  下载次数:0次 浏览次数:1次