Genome Classification with Scikit Learn

Building and tuning a classifier for genome identification

Idea

Can a classifier discern between different species of bacteria?

Setup

  • Python 3
  • SciKit Learn

Process

Pipeline

  1. Get the data from NCBI
  2. Read it in, and output as annotated JSON
  3. Read in data, Kmer-ise it, and perform Bag Of Words
  4. Train the classifier with random set
  5. Test the classifier on unseen set

Using Genomic Data for ML

  • How to select?
  • How to pass to a classifier?
  • How much data to use?

Classifier Choice

Tuning

  • Kmer Size: 6
  • Step Size: 3
  • Max Features: 5000
  • Genome Count per Organism: 100

K = 3, Step = 6, genomes per organism = 100, Max Features=1500

K = 3, Step = 6, genomes per organism = 100, Max Features=5000

Testing other classifiers

The Good

The OK

The Ugly

Conclusions & Future Work

  1. Idea: Encouraging results with diverse inputs
  2. Tuning is important to get best results, but is difficult
  3. Some classifiers are more accurate than others

Questions?