Bioinformatics in Bacteriophages: Leveraging the Levenshtein Distance for Enhanced Bacteriophage Genome Analysis

Author: Nathini Sion

Sion, Nathini, 2024 Bioinformatics in Bacteriophages: Leveraging the Levenshtein Distance for Enhanced Bacteriophage Genome Analysis, Flinders University, College of Medicine and Public Health

Terms of Use: This electronic version is (or will be) made publicly available by Flinders University in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. You may use this material for uses permitted under the Copyright Act 1968. If you are the owner of any included third party copyright material and/or you believe that any material has been made available without permission of the copyright owner please contact copyright@flinders.edu.au with the details.

Abstract

Viruses that infect and eliminate bacteria are known as bacteriophages, or ‘phage’. Through various ecosystems, such as the human microbiome, the animal gut, marine, and soil, they are remarkably abundant and diverse. Phages have mosaic genome are composed of modules with unique evolutionary histories. The phages mosaicism contributes to the incredibly diversity observed among phages and poses challenges for their classification. In recent years, bacteriophage taxonomy has developed from the morphology-based to the genome-based classification principle. This reflects the genome classification provides more comprehensive and accurate basis for understanding phage relationships with evolution.

This study utilises the Levenshtein distance as an important tool for assessing the similarity among phage genomes. The distance is computed by evaluating the least number of string modifications, which encompass insertions, deletions, and substitutions, required to transform one string another. This has the potential to effectively recognise between phage genomes based on their evolutionary relationships with functional diversity.

The International Committee on Taxonomy of Viruses (ICTV) has recently implemented a genome-based taxonomy to enhance the classification of viruses, including phages. This transition highlights the importances of bioinformatics. To increase our understanding of viruses and to enable researchers to analyse genomes using computational algorithms, to facilitate the identification of phage functions and the comparison of phages genomes to understand their relationship together with potential applications.

The number of phage genomes in the National Center for Biotechnology Information (NCBI) database has increased significantly as a result of advancements in molecular techniques since the late 20th century. Phage taxonomy has been substantially enhanced by bioinformatics algorithms. Nevertheless, the current methodologies continue to have their limitations. In particular, the approaches for comparing phage genomes may not fully convey the complexity of genome arrangement and synteny. This serves to highlight the importance of conducting further research on algorithms that would accurately and comprehensively represent the entire spectrum of phage diversity.

The objective of this thesis is to evaluate the efficacy of genome similarity analysis by utilising the Levenshtein distance and generating phylogenetic trees. This method functions as both an alternative method for phage classification and an investigation of the extent to which the gene arrangement within genomes is consistent with the current taxonomic classification. Moreover, its efficacy is evaluated in comparison to the current classification principle. The analysis has the potential to offer valuable insights into phage classification, which could be instrumental in the comprehension of phage biology, the prediction of phage-host interactions, and the development of precise classification systems for the effective use of phages in therapy and other applications.

The phage genomes datasets were compiled from the NCBI Genbank database. Genome similarity was then computed using the Levenshtein and Mash distances. The phylogenies were constructed from the Levenshtein distance and visualised using the Interactive Tree Of Life (iTOL) online tool. Numerous phage characteristics, including genome length, bacterial host, and viral taxonomy, were employed to analyse these trees. To evaluate the correlation between the two-distance metrics, tanglegrams were generated. The potential of this method to investigate the relationship between phage gene arrangement and phage taxonomy is illustrated by the results of this study. Research in the future should expand the phage dataset and investigate in several algorithm methods.

Keywords: Bioinformatics, Bacteriophages, Levenshtein Distance, Phage Taxonomy, Phage Classification

Subject: Medical Biotechnology thesis

Thesis type: Masters
Completed: 2024
School: College of Medicine and Public Health
Supervisor: Professor Robert Edwards