Network methods for Biomedical Applications
The generation and analysis of network representations of biomedical data is extremely complex due to several common confounding features; missing data, values of different types (Likert, Ordinal, Ranked, Continuous, Binary, Textual), imbalance, and feature heterogeneity. This research topic has developed systematic ways to work with these data in a generative framework allowing us to measure the relative validity of network models against controls using different construction regimens such as network fusion. The framework includes methods to partition the networks and allows us to perform exhaustive feature importance and partition comparison to gain biological insight. It has been used to investigate theoretically and implement instances of networks describing ASD and PD patients using combinations of their clinical and genetic data. This revealed key diagnostic features for ASD patients and allowed us to explore early biomarkers in PD progression.
Biomedical Natural Language Processing (bioNLP) for Genetic Disease
Much biomedical data is unstructured, and whilst significant progress has been made in methods that extract domain specific concepts such as gene, protein, and drug names, these commonly comprise unique lexicons that are relatively trivial to find in text. More challenging is the isolation of descriptive terms and phrases that encapsulate important biomedical concepts including quantitative traits and phenotypes. Over the last 3 years I have develop a novel biomedical-NLP research topic that combines the power of community semantic resources such as bio-ontologies and the NLM-Unified Medical Language System (ULMS) with SOTA text-mining and language modeling approaches to structure data at scale from biomedical literature. These methods include topic modelling steps to define the literature (variations of LDA, RNNs, Corex, TopicBERT) and classify entities (MetaMap, SciSpasy, bioBERT) as well as novel domain-adapted models trained and refined on more comprehensive data and tuned for novel classification tasks such as phenotype and variant extraction. This project has generated novel text corpora for ASDs, genetic developmental disease (GDD) & paediatric-COVID, text-mining and analytic software (Cadmus, PyMetaMap), and predictive models for identifying biomedical entities from text.