期刊论文详细信息
BMC Bioinformatics
Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
Research
Rebecca Davis1  Hamid Hadipour2  Pingzhao Hu3  Chengyou Liu4  Silvia T. Cardona5 
[1] Department of Chemistry, University of Manitoba, Winnipeg, MB, Canada;Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada;Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada;Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, MB, Canada;Department of Biochemistry and Medical Genetics, University of Manitoba, Room 308 - Basic Medical Sciences Building, 745 Bannatyne Avenue, R3E 0J9, Winnipeg, MB, Canada;CancerCaree Manitoba Research Institute, CancerCare Manitoba, Winnipeg, MB, Canada;Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, MB, Canada;Department of Microbiology, University of Manitoba, Winnipeg, MB, Canada;Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, MB, Canada;
关键词: Unsupervised deep clustering;    K-means;    Embedding;    Variational autoencoders;    Internal clustering measurements;    Chemical diversity;   
DOI  :  10.1186/s12859-022-04667-1
 received in 2022-03-23, accepted in 2022-04-04,  发布年份 2022
来源: Springer
PDF
【 摘 要 】

BackgroundConverting molecules into computer-interpretable features with rich molecular information is a core problem of data-driven machine learning applications in chemical and drug-related tasks. Generally speaking, there are global and local features to represent a given molecule. As most algorithms have been developed based on one type of feature, a remaining bottleneck is to combine both feature sets for advanced molecule-based machine learning analysis. Here, we explored a novel analytical framework to make embeddings of the molecular features and apply them in the clustering of a large number of small molecules.ResultsIn this novel framework, we first introduced a principal component analysis method encoding the molecule-specific atom and bond information. We then used a variational autoencoder (AE)-based method to make embeddings of the global chemical properties and the local atom and bond features. Next, using the embeddings from the encoded local and global features, we implemented and compared several unsupervised clustering algorithms to group the molecule-specific embeddings. The number of clusters was treated as a hyper-parameter and determined by the Silhouette method. Finally, we evaluated the corresponding results using three internal indices. Applying the analysis framework to a large chemical library of more than 47,000 molecules, we successfully identified 50 molecular clusters using the K-means method with 32 embeddings based on the AE method. We visualized the clustering result via t-SNE for the overall distribution of molecules and the similarity maps for the structural analysis of randomly selected cluster-specific molecules.ConclusionsThis study developed a novel analytical framework that comprises a feature engineering scheme for molecule-specific atomic and bonding features and a deep learning-based embedding strategy for different molecular features. By applying the identified embeddings, we show their usefulness for clustering a large molecule dataset. Our novel analytic algorithms can be applied to any virtual library of chemical compounds with diverse molecular structures. Hence, these tools have the potential of optimizing drug discovery, as they can decrease the number of compounds to be screened in any drug screening campaign.

【 授权许可】

CC BY   
© The Author(s) 2022

【 预 览 】
附件列表
Files Size Format View
RO202309159217256ZK.pdf 5699KB PDF download
13570_2023_282_Article_IEq15.gif 1KB Image download
13570_2023_282_Article_IEq18.gif 1KB Image download
13570_2023_282_Article_IEq21.gif 1KB Image download
Fig. 4 253KB Image download
Fig. 4 823KB Image download
Fig. 1 998KB Image download
MediaObjects/12974_2023_2872_MOESM2_ESM.docx 897KB Other download
MediaObjects/12974_2023_2872_MOESM3_ESM.docx 3368KB Other download
Fig. 2 73KB Image download
Fig. 1 1703KB Image download
Fig. 3 497KB Image download
Fig. 2 281KB Image download
41512_2023_153_Article_IEq102.gif 1KB Image download
Fig. 3 286KB Image download
Fig. 1 305KB Image download
Fig. 3 2026KB Image download
Fig. 7 580KB Image download
MediaObjects/12951_2023_2012_MOESM8_ESM.jpg 5545KB Other download
MediaObjects/40345_2023_307_MOESM1_ESM.docx 2857KB Other download
Fig. 1 73KB Image download
【 图 表 】

Fig. 1

Fig. 7

Fig. 3

Fig. 1

Fig. 3

41512_2023_153_Article_IEq102.gif

Fig. 2

Fig. 3

Fig. 1

Fig. 2

Fig. 1

Fig. 4

Fig. 4

13570_2023_282_Article_IEq21.gif

13570_2023_282_Article_IEq18.gif

13570_2023_282_Article_IEq15.gif

【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  • [27]
  • [28]
  • [29]
  • [30]
  • [31]
  • [32]
  • [33]
  • [34]
  • [35]
  • [36]
  • [37]
  文献评价指标  
  下载次数:13次 浏览次数:1次