Author: Rohan Khatri
Khatri, Rohan, 2025 A Workflow for Automated Pairwise Differential Gene Analysis with Real Time Hierarchical Visualization in Shiny R, Flinders University, College of Medicine and Public Health
Terms of Use: This electronic version is (or will be) made publicly available by Flinders University in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. You may use this material for uses permitted under the Copyright Act 1968. If you are the owner of any included third party copyright material and/or you believe that any material has been made available without permission of the copyright owner please contact copyright@flinders.edu.au with the details.
Due to the advancements in sequencing technologies, the amount of transcriptomics and metagenomics data has grown exponentially. As a result, advanced automated bioinformatics workflows are needed for data analysis and interpretation. Dedicated tools for differential gene expression analysis are available for data analysis. However, they are often insufficient for interpreting hierarchical annotated datasets, especially datasets generated from the SEED subsystems. To address this gap, this thesis designs and implements a new analytical bioinformatics pipeline to explore and understand differentially abundant and/or differentially expressed features in a hierarchical dataset, such as data generated from microbial taxonomy analysis (i.e., taxonomic hierarchy), or shotgun DNA/RNA function counts (i.e., functional hierarchy). The primary goal of the research was to build an analysis workflow that can perform the differential gene expression/abundance automatically. The pipeline begins with raw hierarchical gene-annotated data derived from the SEED subsystem hierarchy (level 1 to level 4), along with sample grouping metadata. A Custom R coding script automatically generates all possible pairwise comparisons between sample groups defined in the metadata. DESeq2 is used to perform differential gene expression analysis across all generated pairwise comparisons, and results are stored in a dedicated output directory. These results contain a long list of differentially expressed genes, including log2 fold change values, p values, and adjusted p values, which are difficult to interpret. To address this challenge, a novel pipeline provides a dynamic, shiny dashboard for hierarchical data exploration with real-time filtering options. This novel pipeline generates multiple visualisation outputs, such as bar plots and volcano plots, with real-time filtering options. Users can filter genes across levels 1 to 4, and view volcano plots and an interactive data table that updates according to their selection. Eighteen test DNA samples (six groups, three replicates each) were used for pipeline testing. Fifteen different pairwise comparisons generated by the novel pipeline and DESeq2’s result were stored in a dedicated output directory. The real-time filtering options across the hierarchical level reveal a few essential patterns of gene expression, such as “ Amino Acids and Derivatives”, “ Carbohydrates” and “ Cofactors, vitamins and Pigments”, which are highly expressed in all pairwise comparisons. For validation, this pipeline was tested against the tongue biofilm meta transcriptome-halitosis associated dataset from a study published in npj Biofilms and Microbiomes and got 90% similar results with the existing study’s result. In conclusion, this project provides a scalable, reusable and novel bioinformatics pipeline for the exploration and interpretation of transcriptomics and metagenomics data with a user-friendly shiny dashboard for dynamic filtering and visualisations across hierarchical levels.
Keywords: Bioinformatics, Bioinformatics pipelines or workflow, Hierarchical , Differential gene expression/abundance, DESeq2, Visualisation, Real-time filtering
Subject: Biotechnology thesis
Thesis type: Masters
Completed: 2025
School: College of Medicine and Public Health
Supervisor: Jim Mitchell