Ian Simpson

Professor of Biomedical Informatics (2023)
Reader in Biological Informatics (2019)
Lecturer in Biological Informatics (2018)
D. Phil Genetics (Oxon)
B.A., M.A. Biochemistry (Hons. Oxon.)

Overview

My main research interest is in the development of novel Informatics approaches to retrieve, extract, structure, and jointly analyse multi-modal biomedical data to improve our understanding of human disease. The integration of multi-modal data is proving to be pivotal in our exploration of the many subtle and complex aspects that pre-dispose us to disease, and to reveal their underlying molecular and environmental causes. This has profound implications for the diagnosis, characterisation, prognosis, and treatment of disease. To address the challenges posed by these aims my research interests necessarily span many areas of Informatics: database systems, algorithms, high-performance computation, software engineering, agent-based modelling, computational statistics, applied machine learning, natural language processing and knowledge representation.

My motivation for research in Biomedical Informatics stems from the opportunities that arose from the digitization of biology that began over a decade ago. This has enabled the combination of experimental, reference and meta-data from thousands of resources which when analysed jointly can reveal previously undiscoverable signatures. This increased inferential power is the consequence of data integration increasing the signal to noise ratios in what is inherently noisy, sparse, and high feature-space data. The emergence of novel molecular surveying technologies such as those for high-throughput DNA sequencing have made it possible to ask questions about the molecular machines that drive biological processes in a way not previously possible. These new methods are being used to generate an unprecedented quantity of molecular data that reflect the scale and complexity of the underlying systems and require powerful numerical approaches to analyse them. Efforts to augment molecular data with medical images, sensor data, clinical records, patient questionnaires, and even social media entries relating to human diseases are commonplace. They present an incredible opportunity for us to data to improve people’s quality of life.

Research Areas

Structuring Disease Knowledge

As my research has developed there has been an increasing need to take analyses beyond indications of the involvement of genes, proteins and pathways in particular biological systems, processes, and diseases to more comprehensive, coherent downstream interpretations that place those findings in context. This motivated a new strand of research to extract biomedical concepts from the wealth of text data available in community created databases, published literature, and clinical records. We created OntoSuite and topOnto computational tools to identify concept terms from biomedical corpora, map them to biomedical ontologies, and quantify their significance. The first application of this approach created the Human Disease Gene Database (HDGDb) that mapped concepts extracted from 4 major text corpora by combining mappings from the three most widely used bio-NLP tools onto 7 biomedical ontologies achieving greater coverage, sensitivity, and specificity than other comparable resources. The result was a Disease Environment Network (sic. Knowledge-graph) linking genes to biological concepts for major neurological diseases including Alzheimer’s, Parkinson’s, Autism Spectrum Disorders and Epilepsy and revealing new functional associations and candidate disease genes (He, PhD 2017).

Diagnostic and Predictive Modeling for Rare Genetic Disease

The UK has sequenced 100K genomes of patients with rare genetic disease, a group of >5000 conditions that will affect 1:17 in their lifetime and affects 3million people alone in the UK. Despite genomic sequencing the majority (c.60%) of these patients remain undiagnosed. We are using our biomedical natural language processing and network methods in a £5.5m Wellcome Trust funded collaboration with Caroline Wright (Exeter), Fiona Cunningham & Matt Hurles (EBI & Sanger Institute), James Ware (ICL), and Helen Firth (Cambridge) to address the shortfall in diagnostic success from whole genome sequencing. We are using domain adapted language models trained to identify phenotypic concepts and descriptions from the biomedical literature in combination with genome sequence variants to map variants to rich, probabilistic representations of their associated phenotypic concepts. Initial work in collaboration with and Jaewoo Kang at the University of S. Korea has involved full-text literature retrieval using our Cadmus, parallelPyMetaMap, and bioBERT-based methods for over 250k papers describing c.2000 rare genetic diseases and extraction of their associated variants and phenotypes. We are using these data to build predictive models for i) differential disease diagnosis by phenotype, ii) candidate gene prediction and iii) variant effect prediction. Our models and predictions will be incorporated into internationally leading tools for genomic exploration and disease curation; DECIPHER, Genes2Phenotype, and the Variant Effect Predictor, and will generate novel open-source datasets, software, and models that will better inform our understanding of rare genetic disease supporting scientists and clinicians to shorten time to diagnosis and improve outcomes for patients.

Multi-modal Data Modeling for Disease

Since 2018 I have been involved in a translational project funded by the Simons Foundation for Autism Research Initiative (SFARI) and in collaboration with David Fitzpatrick, Douglas Armstong, Richard Chin, and Andrew Stanfield at the University of Edinburgh, and LeeAnne Snyder and Natalia Volfovsky at SFARI in New York. In this project we are using genetic and clinical data from the Developmental Disorders Genotype to Phenotype (DDG2P) and SFARI-SPARK databases for patients with intellectual disability (13,500 pro-bands) and autism (c.100,000 patients) respectively to build predictive gene-phenotype models and similarity networks to i) understand the relationship between rare gene mutations in ID and ASDs and the emerging phenotypes and behaviours that patients present with, ii) stratify patients by shared features to improve diagnosis, treatment, and to aid discovery of the shared molecular aetiology of the disease, and iii) to build predictive models to prioritise candidate disease genes based on patient features.

We are developing systematic computational approaches for the generation and validation of biomolecular and patient centric graphs that include up to several hundred features per-node and analytical methods that operate directly on these graphs either through classical network science approaches or using graph neural networks (GNNs) to create accurate embeddings on which we can base prediction tasks. We have developed patient similarity networks for ASDs where we embed patient clinical features allowing us to partition the patient population and identify the importance features that drive stratification. Initial work has identified critical diagnostic features for ASDs that may reduce the complexity of diagnoses down from >200 clinical questions to barely a dozen. Useful graph embeddings for disease have the potential to allow for informative feature prediction including for example the placement of new patients within the patient knowledge space (disease prediction and stratification), and the inference of incomplete data in challenging patient cases. We are already extending this analysis to new domains such as early biomarker identification for longitudinal Parkinson’s disease data as part of the Michael J. Fox PPMI initiative.

Biomolecular Analysis of Disease Genes & Proteins

My early work focused on elucidating the complex gene expression profiles required to specify the development of animal nervous systems. These initial studies were limited to small numbers of key transcriptional regulators, but subsequent studies used RNA microarrays that quantify gene expression levels for tens of thousands of genes in each sample simultaneously using quantitative fluorescence hybridisation. I developed a novel bootstrap consensus clustering method, clusterCons, to extract groups of temporally co-regulated genes from microarray data from the developing fly peripheral nervous system and subsequently used this approach to describe, for the first time, the role of early onset ciliogenic genes and the specification of sensory neurons. Linking knowledge from research in such experimental model organisms to humans is essential to help validate their use as proxies for human brain development. We further developed an orthology projection method that maps genes and their interaction partners between model organisms allowing comparison of the relationship between fly and human neural development.

Subsequent work moved on to studies of neural function to complement our understanding of neural development. With Clive Bramham at the University of Bergen I studied the role of small non-coding RNAs (miRNAs) on the regulation of NMDA dependent synaptic plasticity, including development of a novel algorithm, miRNA-TAP, for predicting gene targets of these regulatory miRNAs. In this collaboration, I developed novel methods to analyse protein-protein interaction graphs and conducted a high-resolution analysis of protein sequence evolution at the synapse to identify regions of synaptic proteins that are undergoing neo-functionalisation, candidates for the fine-tuning of synaptic plasticity. With Kobi Rosenblum at the University of Haifa I developed gene profiling methods to analyse the response of neural genes to learning signals in taste-conditioning experiments revealing some of the early changes in gene expression associated with memory consolidation.

With Giles Hardingham at the University of Edinburgh, I have developed analytical pipelines to process next-generation sequence (NGS) data from mixed species neuronal cultures. In these cultures, cell-autonomous changes in gene expression profiles can be detected between different cell types within the same culture. In order to do this, we developed sequence separation algorithms that split NGS short-reads (c.150 nucleotides) based on their differential alignment properties between up to three different mammalian genomes. This requires rapid processing of huge sequence libraries (c.2-4x108 reads/sample) across genomes of 109bp size, which we achieve in parallel on high memory compute servers (0.5-1TB RAM) using algorithms that can operate efficiently on compressed data. This completely novel approach to dissecting the molecular signature of in vitro neural systems development and function has resulted in four high profile papers (in Nature Communications, Nature Protocols, eLife and PLoS One) and two software packages: Sargasso and Piquant.

With Douglas Armstrong and Giusy Pennetta at the University of Edinburgh I developed novel methods for the construction and analysis of biological graphs including protein-protein and genetic interaction graphs to better understand the functional relationships between proteins in neurological systems. In these studies, we developed novel graph clustering procedures that scale efficiently for large biological networks, offer a range of different clustering approaches and are capable of deployment on HPC systems. I have further extended these approaches using entropy-based approaches to assess biological process association with weighted gene correlation graphs (Heron, PhD 2019), an approach that promises to integrate powerfully with the network approaches I have developed in recent years.

As sequencing at scale has become more cost-effective RNA-seq based gene expression data for large groups of patients suffering from genetic have become available. I have used these data to predict novel disease genes for autism spectrum disorders (Navarro, PhD 2022), to search for early biomarkers in Parkinson’s disease (Ryan, MSc. 2022), and to stratify patients for prognostic biomarker discovery in breast cancer (Moir, PhD in-progress). In the coming years the coverage of both bulk RNA-seq and single-cell RNA-seq data for large cohorts of disease patients will become prevalent facilitating the development of integrative models that can identify new disease genes, lead to insight into disease mechanism, and provide both biomarkers for diagnosis, and targets for therapeutic intervention.

Dynamic Molecular Modeling

Another aspect of my more recent work to improve the downstream interpretation of neurological data completes the virtuous circle linking basic research to clinical application by taking what I have learned about biological systems and building mechanistic models of them. This mechanistic detail is what my industrial collaborators at UCB Celltech need when making R&D decisions in their drug development pipeline, especially in prioritising the components of a system most likely to effect the required therapeutic change upon pharmaceutical intervention. In this proof-of-principle research (Wysocka, PhD 2019), we developed a rule-based (Kappa Language) model of the DARPP-32 signalling pathway and validated it against experimental data and an equivalent ordinary differential equation (ODE) implementation. Our model is more expressive, flexible, and modular than its ODE equivalent and forms the basis for a Kappa model development approach that is being translated into the UCB Informatics R&D pipeline. As part of this research, we developed novel global and local sensitivity analysis methods to identify the critical agents in model simulations and the most sensitive parameters for model optimisation and drug-target prioritisation via in-silico perturbation.