Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems
Samuel Liburd Jr. - University of the Virgin Islands
Viruses serve as one of the most efficient vectors for death and disease, killing millions worldwide and mutating uncontrollably. In order to identify and understand viruses, a classification system was created based on features such as virus size, shape, genome structure, and mode of replication. To better understand this system, I hypothesized that it was possible to classify viruses biologically using genomic features and machine learning techniques. To do so, I analysed 511 (+) ssRNA virus genomes for unique genetic characteristics that identify them. The six virus families to be classified were Flaviviridae, Potyviridae, Betaflexaviridae, Virgaviridae, Picornaviridae, and Tombusviridae. Based on my literature review, I wrote a Python script that extracted 12 different features for performing the classification task: genome length, adenine, guanine, cytosine, and thymine count, the number of start codons, G-C and A-T percentages, host organisms, the number of proteins encoded, and the number, if any, of segmentations in the genome. The relevance of these attributes was then ranked using the Correlation-based Feature Subset Eval and Best First algorithms available in the data mining package Weka. The most relevant subset of attributes (genome length, A, C, and G counts, G-C percentage, host organism, and number of proteins formed) was selected with C4.5 classification algorithm. The training method used 66% of the genomic datasets to create a decision tree model. The tests were conducted on the remaining datasets and the results obtained shown that 99.4% of the remaining viruses were accurately classified. This accuracy level encouraged and supported my initial hypothesis that it is possible to classify viruses using machine learning techniques and genomic based features. In the future, I plan to expand this approach using machine learning techniques such as support vector machines and artificial neural networks that could serve as powerful tools to monitor and update changes to viral genomes.
References: Cock, Peter. ‘Using FASTA Nucleotide Files In Biopython’. www2.warwick.ac.uk. N.p., 2016. Web. 3 Oct. 2016.
Gelderblom HR. Structure and Classification of Viruses. In: Baron S, editor. Medical Microbiology. 4th edition. Galveston (TX): University of Texas Medical Branch at Galveston; 1996. Chapter 41.
Funder Acknowledgement(s): UVI NSF/HBCU-UP SURE grant #1137472
Faculty Advisor: Marc Boumedine, mboumedine@gmail.com
Role: I conducted all of the research for this project.