Tuberculosis (TB) is one of the top 10 causes of death worldwide and the leading cause of death from an infectious agent Mycobacterium tuberculosis var. tuberculosis (MTB) affecting 10 million people who fell ill with TB in 2018 with around 1.2 million deaths. Drug resistant TB poses a major threat to the World Health Organization’s “End TB” strategy which has defined its target as the year 2035. In 2018, there were about 0.5 million cases of drug resistant TB, of which 78% were resistant to multiple TB drugs. The traditional culture-based Drug Susceptibility test (the gold standard) often takes multiple weeks and the necessary laboratory facilities are not readily available in low-income countries.
Predicting the occurrence of drug resistance based on application of Machine Learning (ML) on the whole genome sequencing (WGS) data will pave the way to an early diagnosis and an efficient treatment in a much earlier time as compared to the gold standard culture-based phylogenetic drug susceptibility testing.
This project aims to explore
- Exploratory data analysis, to understand the various variables in the dataset.
- Feature engineering approaches to understand whether Single Nucleotide Polymorphism (SNP) provides a good foundation for prediction.
- Random forest approach for Machine Learning which combine multiple trees to create an overall ensemble model.
Desired skill level
This project requires some knowledge of Python
Beginner: If you’re curious about the topic, you can learn by reading the code and contribute by doing code reviews, helping us to structure the project better, improve documentation, fix variable names etc. Feel free to dip you toes in, the water’s fine!
Intermediate: If you’ve experience with Data Visualization, there’s good scope for that in this project 🙂
Advanced: Some familiarity with Machine Learning and Feature Engineering would be great.