Genome Classification with Scikit Learn
Building and tuning a classifier for genome identification
Idea
Can a classifier discern between different species of bacteria?
Setup
Pipeline
- Get the data from NCBI
- Read it in, and output as annotated JSON
- Read in data, Kmer-ise it, and perform Bag Of Words
- Train the classifier with random set
- Test the classifier on unseen set
Using Genomic Data for ML
- How to select?
- How to pass to a classifier?
- How much data to use?
Tuning
- Kmer Size: 6
- Step Size: 3
- Max Features: 5000
- Genome Count per Organism: 100
K = 3, Step = 6, genomes per organism = 100, Max Features=1500
K = 3, Step = 6, genomes per organism = 100, Max Features=5000
Testing other classifiers
The OK
Conclusions & Future Work
- Idea: Encouraging results with diverse inputs
- Tuning is important to get best results, but is difficult
- Some classifiers are more accurate than others