KMS Chongqing Institute of Green and Intelligent Technology, CAS
A Top-Down Binary Hierarchical Topic Model for Biomedical Literature | |
Lin, Xiaoguang1,2,3; Liu, Mingxuan2,3; Zhang, Ju2,3 | |
2020 | |
摘要 | Over the past two decades, a number of advances in topic modeling have produced sophisticated models that are capable of generating topic hierarchies. In particular, hierarchical Latent Dirichlet Allocation (hLDA) builds a topic tree based on the nested Chinese Restaurant Process (nCRP) or other sampling processes to generate a topic hierarchy that allows arbitrarily large branch structures and adaptive dataset growth. In addition, hierarchical topic models based on the latent tree model, such as Hierarchical Latent Tree Analysis (HLTA), have been developed over the last five years. However, these models do not work well in cases with millions of documents and hundreds of thousands of terms. In addition, the topic trees generated by these models are always poorly interpretable, and the relationships among topics in different levels are relatively simple. The biomedical literature, including Medline abstracts, has large-scale documents in two major categories: biological laboratory research and medical clinical research. We propose a top-down binary hierarchical topic model (biHTM) for biomedical literature by iteratively applying a flat topic model and adaptively processing subtrees of the hierarchy. The biHTM topic hierarchy of complete Medline abstracts with more than 14 topic node levels shows good bimodality and interpretability. Compared to hLDA and HLTA, biHTM shows promising results in experiments assessed in terms of runtime and quality. |
关键词 | Topic model topic hierarchy binary modality biomedical literature text mining |
DOI | 10.1109/ACCESS.2020.2983265 |
发表期刊 | IEEE ACCESS |
ISSN | 2169-3536 |
卷号 | 8页码:59870-59882 |
通讯作者 | Zhang, Ju(zhangju@cigit.ac.cn) |
收录类别 | SCI |
WOS记录号 | WOS:000527413100019 |
语种 | 英语 |