GEA(Gene expression Algorithm)

i want to develop a ml tool for learning and exploring gene expression through development using the generalized functions for querying data on any gene through multiple developmental stages and brain span in humans and mouse…

now, getting into the part of resources, The Allen Institute for Brain Science is a research facility dedicated to providing the scientific community with tools and resources for exploring the brain. Over the last 10 years, in collaboration with research labs across the world, the Institute has openly released numerous datasets, including data from developing brains, diseased brains, mouse, primate, and human brains.

These datasets, available in the public domain, provide an open playground for neuroscientists of all experiences, from high-school students to academic researchers, to explore the brain and “see for themselves” the discoveries reported in the scientfic literature. Much in the same way that the ImageNet and MNIST datasets are used as both a benchmark and educational resource in the machine learning community, the Allen Brain datasets can be used as a reference to verify findings from neuroscience literature and offer an interactive platform to learn more about the brain.

In this project, I use the Allen Brain datasets, and in particular BrainSpan, the developing human brain dataset, to describe important concepts and findings in developmental neurobiology. It has been written to accommodate readers of all backgrounds and is designed to encourage interactivity. 

Developing Human Brain – BrainSpan

BrainSpan contains gene expression data within regions of the developing human brain. Gene expression data is collected through transcriptome profiling, microarray analysis and in-situ hybridisation which are explained below. The dataset also provides reference atlases; images of the developing human brain annotated by brain strucucture.

*Transcriptome *


  • Complete list of genes
  • Close separation in donor ages


  • Brain structures less specific than microarray data



  • Fine spatial resolution – samples from very specific brain regions.


  • Only 4 donors (so only 4 developmental stages: 15pcw, 16pcw or 21pcw)
  • ISHThe BrainSpan dataset also contains in-situ hybridisation data…Advantages:
    • Can visually see expression in brain slice
    • Not quantitative
    Developing Mouse Brain AtlasLike BrainSpan, this dataset contains gene expression across development stages of the developing mouse brain.Documentation/Useful LinksAllen Brain Atlases Home Page – Central hub for exploring all Allen datasets.AllenSDK Github repository – Useful for seeing explicity how queries are parsed by the sdk.API Online Documentation – Links to documentations for different API subclasses (e.g. ImageDownloadApi, OntologiesApi)AllenSDK Examples – Sample of Jupyter notebooks guiding through using aspects of the AllenSDK.API Class Lists – Lists all classes used by the Allen API. Useful for finding class names, attrubutes and associations needed for constructing RMA queries.BrainSpan Data Documentation – Links to white papers detailing methodology used to create BrainSpan data.BrainSpan API Documentation – Explanations and examples of accessing BrainSpan data using the API.RMA Guide – Explains components of RESTful Model Access queries used to access Allen data.Services Documentation – Useful for constructing service RMA queries. Lists parameters and example queries.Developing Mouse API Documentation – Explanations and examples of accessing the developing mouse brain atlas data.Developing Mouse Annotation Volumes – Links to 3D volumetric data files for the developing mouse brain. These files can be used to load a coordinate system for 2/3D gene expression plots.
  • Developmental Neurobiology”It has long been recognized that the life history of a neuron can be characterized by a temporal progression of transitions that may be conveniently (albeit somewhat arbitrarily) viewed as a discrete series of neurogenetic steps or stages through which virtually all cells must pass. Included among these steps are induction, proliferation, migration, restriction and determination, differentiation (expression), the formation of axonal pathways and synaptic connections, and the onset of physiological function. ” – Oppenheim 1991It is an incredible feat of biology to transform a single-celled zygote into a fully-developed adult human being – an end-product that has the ability to perceive its own environment and engage in conscious thought. To create such a complex organism, the developing embryo must undergo a series of precisely timed and controlled developmental processes.A natural starting point for the development of the nervous system is during a process of gastrulation in which cells in the early embryo start to differentiate to form neural tissue. However in the interest of using Allen Brain data, this article begins at a few developmental stages after this point when the embryo has folded to form the neural tube.
  • Neurogenesis[NEUROGENESIS INTRODUCTION]The human brain is thought to consist of over 80 billion neurons.Neurogenesis is an intricately controlled process. Too few neurons produced may result in microcephaly; too many can lead to. Too many excitatory neurons than inhibitory neurons can also lead to autism. To avoid these defects, evolution has developed genetic machinary to balance these factors.Some of the genes responsible for controlling the process of creating excitatory neurons in the cortex, were identified and investigated by Englund et al. in a study published in 2005.
  • image.png
  • so,thus creating visualisations and an educational resource for learning about developmental neurobiology using real data.
  • image.png

Drug Resistance Prediction using WGS data


Tuberculosis (TB) is one of the top 10 causes of death worldwide and the leading cause of death from an infectious agent Mycobacterium tuberculosis var. tuberculosis (MTB) affecting 10 million people who fell ill with TB in 2018 with around 1.2 million deaths. Drug resistant TB poses a major threat to the World Health Organization’s “End TB” strategy which has defined its target as the year 2035. In 2018, there were about 0.5 million cases of drug resistant TB, of which 78% were resistant to multiple TB drugs. The traditional culture-based Drug Susceptibility test (the gold standard) often takes multiple weeks and the necessary laboratory facilities are not readily available in low-income countries.

Predicting the occurrence of drug resistance based on application of Machine Learning (ML) on the whole genome sequencing (WGS) data will pave the way to an early diagnosis and an efficient treatment in a much earlier time as compared to the gold standard culture-based phylogenetic drug susceptibility testing.



This project aims to explore

  1. Exploratory data analysis, to understand the various variables in the dataset.
  2. Feature engineering approaches to understand whether Single Nucleotide Polymorphism (SNP) provides a good foundation for prediction.
  3. Random forest approach for Machine Learning which combine multiple trees to create an overall ensemble model.

Desired skill level

This project requires some knowledge of Python

Beginner: If you’re curious about the topic, you can learn by reading the code and contribute by doing code reviews, helping us to structure the project better, improve documentation, fix variable names etc. Feel free to dip you toes in, the water’s fine!

Intermediate: If you’ve experience with Data Visualization, there’s good scope for that in this project 🙂

Advanced: Some familiarity with Machine Learning and Feature Engineering would be great.

Functional annotation of lncRNA based on their Cis- and Trans interactions


Long non coding RNA are emerging transcriptional species that are increasingly gaining relevance due to their involvement in biological processes. Functionally, lncRNA are known perform regulatory tasks through (i) signal transduction, (ii) sponge formation with microRNA, (iii) protein translocation and guide, (iv) scaffold for molecule assembly and recently (v) triplex formation (Antonov et al. 2019; Hon et al. 2017; Liu et al. 2019)

To achieve functional relevance, lncRNA may either function at the site of transcription (cis) or leave their transcriptional site and effect regulatory roles in distant transcription site (trans) (Latos et al., 2012 ; Rinn et al., 2007) In both cases, they interact with genetic elements to either activate or repress its expression. For example, HOX antisense intergenic RNA HOTAIR, a ~2.2 kb spliced and polyadenylated mammalian transcript that is expressed from the HOXC locus can influence local HOXC expression as well as distant HOXD locus via its repressive activity (Kopp et al., 2018; Rinn et al., 2007). 


Deriving the functional significance of multiple lncRNA from experiments remains a bottleneck when prioritizing lncRNA function. Based on the available datasets that have resolved genome wide lncRNA function (in different cells) such as iMARGI, grid-seq and RED-C datasets, genome, we propose a workflow for the functional annotation of lncRNA that accounts for their cis and trans interactions.  


The aim of this experiment is to develop and validate a workflow for the functional annotation of lncRNA expression using information from their cis and trans interactions. We would validate these by testing the workflow on lncRNA expression profile from total RNA expression datasets.  

Desired Skills 

Understanding of molecular biology and epigenetics 

R, Bash and Python 3. 

Pathogen Prediction Using Neural Networks Trained On Chest X-Ray Images

Jon Ambler, Ayush Garg, Sylvia Lee, Ishaan Narang and Volunteers


Pathogens, affecting the respiratory system, often impact lungs of a person. Comparing the radiograph of the affected person with that of a normal person reveals traces that can help in identification of the pathogen in question, and thus makes the treatment process more streamlined.


In this project we seek out to train a neural network on chest radiograph images. The training will be done using Google’s auto vision ML which automatically determines the best model to use depending on the dataset provided. This model will be connected to a front-end application which will be built using flask.


We welcome individuals who are adept in one or more of the following skills:

  • Python
  • OpenCV, Tensor flow and Keras
  • Html, CSS and JavaScript for the front end
  • Data Collection
  • Radiograph analysis
  • Flask
  • Google Cloud Platform

Our utmost priority will be to come up with a finely curated dataset and a user-friendly frontend which will make the project readily accessible for the public domain.

To help contribute and join the project please email at

Predicting gene expression from chromatin state and 3D chromosomal conformation data: beyond the ABC?

Mikhail Spivakov and lab (MRC LMS; Imperial College London) 


DNA regulatory elements such as enhancers are often located large genomic distances away from the genes they control (up to millions of base pairs). It is generally accepted however that these elements typically exert their effects by coming into physical proximity with their target genes through 3D chromosomal contacts. 

The recently proposed ABC model (Fulco et al., Nat Genet 2019) predicts gene expression using data on the activity of enhancers and their contacts with gene promoters in 3D. 

As a source of 3D chromosomal contacts, the ABC model uses data from Hi-C  (Lieberman-Aiden et al., Nature 2009), a high-throughput biochemical technique that is based on proximity ligation of DNA fragments that co-localise in 3D followed by detection by sequencing. This technique can theoretically profile the contacts between all pairs of DNA fragments in a cell’s nucleus, but it suffers from limited sensitivity due to the fact that enormous amounts of sequencing are required to detect these contacts robustly. Various techniques have been developed to enrich the material for contacts of interest, including Capture Hi-C (Schoenfelder et al., Genome Res 2015; Mifsud et al. Nat Genet 2015) and HiChIP (Mumbach et al., Nat Meth 2016).


The aim of this project is to test whether using Capture Hi-C data processed in different ways instead of low-resolution Hi-C improves the predictive power of the ABC model against gold-standard CRISPR-mediated repression datasets.

Desired skill level

This project requires an intermediate level of R or another scripting language, experience of working with large datasets and epigenomic analysis tools. Basic understanding of gene regulation and genomic assays is desirable. 

Creating a database for the base composition of published sequencing libraries

The base composition of sequencing reads depends on the library type (RNA, genomic, bisulfite, ChIP, etc.) and the species, and can often be characteristic for a particular sequencing application. For a while we’ve been thinking about a quality control tool that checks if a given base composition matches the expected base composition for the application. In other words, does my library look like it is supposed to? Some of the code of my last year’s hackathon project (Charades) could easily be adapted to put a given base composition into the wider context, but what’s missing is a collection of base compositions for a variety of sequencing libraries. The immediate task would be to think about how to best collect library base compositions and match them up with meta data about library type for a variety of published applications. I will most likely be working on a different project this year, but I’d be happy to join in with discussions and provide ideas.

Compiling Life Sciences Training Datasets

The Babraham Bioinformatics group has compiled numerous life sciences datasets with a view to using these in future training courses.  These datasets have also been made publicly available to assist other researchers and students with learning data analysis.

The training materials are stored in a GitHub repository at:

There is also a web interface for these projects at:

The aim of this project is to:

  • Write simple R scripts to summarise datasets used in our training courses.  Following the hackathon, we aim to incorporate many of these extra plots in our GitHub repository.
  • Search for new data to incorporate into our repository.

Desired Skill Level
This project is aimed at people who have a basic understanding of R and can use the language to generate simple plots.

Possible Extensions
It would be nice to replicate what has been achieved in R using Python/Matplotlib.  This extension project is open to people who have these Python/Matplotlib skills.

Cambridge-India openVirus

Peter Murray-Rust, (Chemistry), Gita Yadav (Plant Sciences) and interns in India

openVirus is a team of Young Indian Scientists who have built tools to mine the scientific literature for new insights into Viral Epidemics.
Solutions to COVID may be lying in the literature of previous epidemics or the vast new output of COVID papers. The project has many facets and is very suitable for anyone interested in extracting and analysing masses of scientific articles.

Our facets (X in “viral epdemics and X”) include:

  • what countries are epidemics reported in?
  • what drugs are used?
  • what comorbidities occur
  • who funds research into viruses?
  • what viruses are involved?
  • what is the role of zoonosis (animal hosts)
  • who reports Test and Trace strategies
  • what non-pharma interventions are used (quarantine, social distancing, masks)

We build “minicorpora” for all of these using EuropePMC, and ontologies using Wikidata.

Among the skills that delegates can learn without previous programming

  • repositories (EuropePMC) and searching (including REST)
  • creation of ontologies (dictionaries) using Wikidata and SPARQL
  • Dockerised containers
  • Jupyter notebooks

A mini-review can be carried out in 2-3 hours.

If you’re interested in developing technology (probably scripting – R, Python, KNIME) we’d love contributions on

  • text-based search (Lucene)
  • Natural Language Processing (nltk, OpenNLP)
  • data display (e.g. matplotlib, D3.js)
  • Machine Learning (Keras, word2vec)
  • multilingual documents (Hindi, Urdu, Tamil, and Portuguese / Spanish – we have a collaboration with Redalyc repository in Latin America)

There is extensive documentation and there will be project members available for the working day (up to say 1700 BST, 2130 India Standard Time and PMR till later in UK).

overview slides at: