• Skip to main content
  • Skip to after header navigation
  • Skip to site footer
ERN: Emerging Researchers National Conference in STEM

ERN: Emerging Researchers National Conference in STEM

  • About
    • About AAAS
    • About the NSF
    • About the Conference
    • Partners/Supporters
    • Project Team
  • Conference
  • Abstracts
    • Undergraduate Abstract Locator
    • Graduate Abstract Locator
    • Abstract Submission Process
    • Presentation Schedules
    • Abstract Submission Guidelines
    • Presentation Guidelines
  • Travel Awards
  • Resources
    • Award Winners
    • Code of Conduct-AAAS Meetings
    • Code of Conduct-ERN Conference
    • Conference Agenda
    • Conference Materials
    • Conference Program Books
    • ERN Photo Galleries
    • Events | Opportunities
    • Exhibitor Info
    • HBCU-UP/CREST PI/PD Meeting
    • In the News
    • NSF Harassment Policy
    • Plenary Session Videos
    • Professional Development
    • Science Careers Handbook
    • Additional Resources
    • Archives
  • Engage
    • Webinars
    • ERN 10-Year Anniversary Videos
    • Plenary Session Videos
  • Contact Us
  • Login

A Bioinformatics Approach to Classify Viruses Using a Decision Tree Model

Undergraduate #43
Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems

Samuel Liburd Jr. - University of the Virgin Islands


Viruses serve as one of the most efficient vectors for death and disease, killing millions worldwide and mutating uncontrollably. In order to identify and understand viruses, a classification system was created based on features such as virus size, shape, genome structure, and mode of replication. To better understand this system, I hypothesized that it was possible to classify viruses biologically using genomic features and machine learning techniques. To do so, I analysed 511 (+) ssRNA virus genomes for unique genetic characteristics that identify them. The six virus families to be classified were Flaviviridae, Potyviridae, Betaflexaviridae, Virgaviridae, Picornaviridae, and Tombusviridae. Based on my literature review, I wrote a Python script that extracted 12 different features for performing the classification task: genome length, adenine, guanine, cytosine, and thymine count, the number of start codons, G-C and A-T percentages, host organisms, the number of proteins encoded, and the number, if any, of segmentations in the genome. The relevance of these attributes was then ranked using the Correlation-based Feature Subset Eval and Best First algorithms available in the data mining package Weka. The most relevant subset of attributes (genome length, A, C, and G counts, G-C percentage, host organism, and number of proteins formed) was selected with C4.5 classification algorithm. The training method used 66% of the genomic datasets to create a decision tree model. The tests were conducted on the remaining datasets and the results obtained shown that 99.4% of the remaining viruses were accurately classified. This accuracy level encouraged and supported my initial hypothesis that it is possible to classify viruses using machine learning techniques and genomic based features. In the future, I plan to expand this approach using machine learning techniques such as support vector machines and artificial neural networks that could serve as powerful tools to monitor and update changes to viral genomes.
References: Cock, Peter. ‘Using FASTA Nucleotide Files In Biopython’. www2.warwick.ac.uk. N.p., 2016. Web. 3 Oct. 2016.
Gelderblom HR. Structure and Classification of Viruses. In: Baron S, editor. Medical Microbiology. 4th edition. Galveston (TX): University of Texas Medical Branch at Galveston; 1996. Chapter 41.

Funder Acknowledgement(s): UVI NSF/HBCU-UP SURE grant #1137472

Faculty Advisor: Marc Boumedine, mboumedine@gmail.com

Role: I conducted all of the research for this project.

Sidebar

Abstract Locators

  • Undergraduate Abstract Locator
  • Graduate Abstract Locator

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. DUE-1930047. Any opinions, findings, interpretations, conclusions or recommendations expressed in this material are those of its authors and do not represent the views of the AAAS Board of Directors, the Council of AAAS, AAAS’ membership or the National Science Foundation.

AAAS

1200 New York Ave, NW
Washington,DC 20005
202-326-6400
Contact Us
About Us

  • LinkedIn
  • Facebook
  • Instagram
  • Twitter
  • YouTube

The World’s Largest General Scientific Society

Useful Links

  • Membership
  • Careers at AAAS
  • Privacy Policy
  • Terms of Use

Focus Areas

  • Science Education
  • Science Diplomacy
  • Public Engagement
  • Careers in STEM

Focus Areas

  • Shaping Science Policy
  • Advocacy for Evidence
  • R&D Budget Analysis
  • Human Rights, Ethics & Law

© 2023 American Association for the Advancement of Science