中国药物警戒 ›› 2022, Vol. 19 ›› Issue (4): 390-394.
DOI: 10.19803/j.1672-8629.2022.04.10

• 基础与临床研究 • 上一篇    下一篇

基于谱聚类的何首乌天然产物聚类分析

胡笑文, 杨建波, 魏锋, 马双成*   

  1. 中国食品药品检定研究院,北京 102629
  • 收稿日期:2021-07-28 出版日期:2022-04-15 发布日期:2022-04-15
  • 通讯作者: *马双成,男,博士,研究员,中药民族药质量控制和评价。E-mail:masc@nifdc.org.cn . #为共同通信作者。
  • 作者简介:胡笑文,男,博士后,中药信息学。
  • 基金资助:
    重大新药创制国家科技重大专项2018年度(2018ZX09735006); 国家自然科学基金资助项目(81773874、81973476)

Clustering analysis of natural products derived from Polygonum multiflorum Thunb. based on spectral clustering algorithm

HU Xiaowen, YANG Jianbo, WEI Feng, MA Shuangcheng*   

  1. National Institutes for Food and Drug Control, Beijing 102629, China
  • Received:2021-07-28 Online:2022-04-15 Published:2022-04-15

摘要: 目的 对何首乌中的天然产物进行聚类分析,建立一种较为科学的天然产物聚类方法,为后续化合物挑选、药理筛选提供技术指导。方法 从文献中收集并整理何首乌天然产物,选择二苯乙烯类、蒽醌类等主要类别化合物作为聚类对象,转换为简化分子线性输入规范(SMILES),并使用rdkit提取化合物的扩展连通性指纹和理化性质作为特征,经过方差筛选得到有效的特征。使用谱聚类算法,对何首乌天然产物进行聚类,以Calinski Harabaz(CH)指数作为评估指标,优化聚类参数。采用优化后最佳参数对化合物进行聚类,分析各类别的特点。随后对3种主要类别的化合物进行主成分分析,查看主要类别的空间分布。最后对主要类别化合物分别计算脂水分配系数和拓扑极性表面积,分析性质分布,验证聚类合理性。结果 从文献中挑选13个类别的123个何首乌天然产物。经过特征提取和过滤,共得到207个特征。CH指数表明聚类数量为10,γ为0.004时聚类效果最佳。主成分分析显示3个主要成分组在空间中各自成簇,无重叠情况发生。经过聚类后,脂水分配系数和拓扑极性表面积2个指标倾向更加集中。结论 谱聚类算法不仅能够区分何首乌天然产物中差异较大的化合物,也能较好地对复杂化合物进行聚类,聚类结果具有一定的合理性,能够为传统药理筛选提供新的思路。

关键词: 何首乌, 天然产物, 无监督学习, 聚类分析, 谱聚类

Abstract: Objective To establish a proper method for clustering natural products by using the spectral clustering algorithm and compounds derived from Polygonum multiflorum Thunb. Methods Major categories of compounds including stilbenes and anthraquinones that originated from Polygonum multiflorum Thunb. were collected from the literature and converted into the simplified molecular input line entry specification (SMILES). Extended-Connectivity Fingerprints and physicochemical properties were extracted and filtered by variance before the spectral clustering algorithm was used for clustering. The Calinski Harabaz (CH) score was employed for the parameter optimization of the spectral cluster. The optimal method was applied to the natural products and the features of each class were analyzed. Principal component analysis of the three main categories was carried out to visualize the spatial distribution. Finally, the topological polar surface area (TPSA) and lipid-water partition coefficient (LogP) of the main compounds were calculated, and the feature distribution of the properties was analyzed. Results A total of 123 natural products of thirteen categories were collected from the literature. After feature calculation and removal of features with near-zero variance, 207 valid features were obtained. The spectral clustering algorithm achieved the highest CH score when the number of clusters was set at 10 and γ set at 0.004. Principal component analysis showed that three major classes were clustered individually in 3-dimentional space. Besides, and that the distribution of TPSA and LogP tended to be centralized. Conclusion The spectral clustering algorithm can not only distinguish the compounds with unique structures, but also have a better performance for complex compounds in Polygonum multiflorum. These results provide novel ideas for screening of natural products.

Key words: Polygonum multiflorum Thunb., natural products, unsupervised learning, clustering algorithm, spectral clustering

中图分类号: