Brain Learning Unicorn Project

By Elise Douard
Published on June 12, 2020

"Can a model predict the genetic profile of an individual based on brain regions volumes? There is growing evidence suggesting that genetic variations formally associated to neurodevelopmental disorders have significant effects on brain structures. In this project, the performance of three classifiers will be compared when predicting the genetic status of individuals from brain region volumes in a highly imbalanced dataset (UK BioBank cohort)."

BLUP: Brain Learning Unicorn Project

Click HERE to see this readme as a website.

Can a model predict the genetic profile of an individual based on brain regions volumes?

Trying to deal with new stuff learned at the amazing Brainhack school

Author: Elise Alix Douard

Other project contributors: BHS, Hannah Kiesow @hannahmaykiesow (also working on UKBiobank), Kuldeep Kumar @meetkd007 (extracting data from UKBB servers)

Personal presentation

Welcome to this project dear unicorn student !


I am Elise, Ph.D. student in neurosciences at the UdeM since near to 4 years, and working on the contribution of genetic to neurodevelopmental disorders as autism. I don't really fit in a specific domain (genetic/cognitive neurosciences/psychology). I guess it is what we call a unicorn student? Currently, I am working with genetic data (Copy Number Variants), clinical phenotypes and doing a lot of statistics and graphs on R. But my initial formation was in cognitive neurosciences where I started to work with multimodal data (combination of Arterial Spin Labelling MRI data and Eye-tracking data).

Since I started my Ph.D., I never used MRI data nor python, and I am here to take a revenge on that.


  • Data management (feed me with multimodal data plz)
  • Statistics
  • Debugging codes

Project definition


Source: Illustration inspired from content and adapted on adobe illustrator

Copy number variants (CNVs) are a family of structural variation of the chromosomes. They can be either a loss or a gain of a chromosome portion in comparison to a genome of reference. Sometimes, CNVs can be pathogenic, meaning that they are formally associated to neurodevelopmental or psychiatric disorders, such as autism spectrum disorders (ASD), Schizophrenia (SZ) or intellectual disability (ID). Such pathogenic CNVs have been associated to significant alterations of brain volume (Modenato et al., 2020 ; Martin-Brevet et al., 2018 ; Maillard et al., 2015) or connectivity (Moreau et al., 2019). Notably, there were common alterations of the insula volume when comparing structural brain alterations due to pathogenic CNVs and due to a neurodevelopmental disorder (e.g. ASD or SZ) (Cauda et al., 2017 ; Goodkind et al., 2015).


Can a model predict the genetic profile of an individual based on brain regions volumes?


  • Main goal:

This project aims to feed a machine learning model with brain volumes to predict if an individual is carrier of a potentially pathogenic CNV.

Source: Illustration inspired from content and adapted on adobe illustrator

  • Subgoals:

  • Making minimal change to the distribution shape of the volumes

  • Dealing with imbalanced dataset reflecting the reality of the prevalence of pathogenic CNVs in the general population

  • Personal goals:

  • Learn how to properly share scientific content

  • Learn how to python instead of R

  • Learn how to interactive plot

  • Learn how to machine learning



Raw data

For this project, 35,759 individuals from UK Biobank with genetic and derived brain volume data were available.

  • MRI data (features): All individuals have derived-brain volumes preprocessed on Freesurfer using Desikan parcellation (68 regions) [cf. documentation for processing pipeline details].
  • Genetic data (labels): 1,265 individuals (3.5% of the cohort) were carriers of at least one of 93 potentially pathogenic CNV identified by Kendal et al. (2019). These individuals will constitute the carriers class, all the other will be in the controls class.
  • Confounders: age, sex, total intracranial volume (TIV) and site of MRI acquisition were available for all individuals.

Data transformation

For all individuals, the 68 region volumes were adjusted for potential confounder effects.

Table 1: Description of the confounders

GoupNMean age (sd)Mean TIV (sd)N FemaleN MaleN Site 1N Site 2N Site 3
Carriers126563.8 (7.4)1540824.3 (150493.6)671594781320164
Controls3449464.1 (7.6)1549091.7 (151512.6)18280162142141186074476

Final dataset: training and test sets

A subgoal of the project was to make minimal changes to the features used in the machine learning models. The final volumes used as features were the ones adjusted for the confounder effects without z-scoring. Another subgoal was to deal with imbalanced dataset which reflect the reality of carriers prevalence in the general population. As an alternative, the imbalance was reduced by pseudo-randomly resampling the controls. The final dataset included the 1,265 carriers and twice more controls.

Click on the following images to open interactive pie-charts:

Figure 1: Proportion of controls and carriers in the training and test sets used for the machine learning models.


Machine learning model used

  • Random Forest Classifier using sklearn.ensemble.RandomForestClassifier [documentation]
  • Random Forest with Random Undersampling using imblearn.ensemble.BalancedRandomForestClassifier [documentation]
  • Gradient-Boosted Classification Tree using sklearn.ensemble.GradientBoostingClassifier [documentation]

Step followed for the three machine learning models

  • Step 1: Training the model on the training set
  • Step 2: Test the model on a validation set
  • Step 3: Cross validation (n fold = 10)
  • Step 4: Playing with the hyperparameters max_depth and n_estimators (and learning_rate for the gradient-boosted tree classifier)
  • Step 5: Testing the model on the test set

Tools (used and learned)

The project rely on the following technologies:

  • Visual Studio Code
  • Git/Github
  • Python (libraries: pandas, numpy, nilearn, nibabel, sklearn, scipy, random, seaborn, plotly, matplotlib, ipywidgets, itertools)
  • Jupyter notebook (Including Markdown, Python code and Slides using the Rise extension)


In this GitHub repository:

  • README: Readme file for this project repository
  • Welcoming_vid.mp4: A non-relevant video of welcoming
  • Prep_visualization_ML_allinone.ipynb: Jupyter notebook with all the instructions and code lines for the data preparation and visualization, the machine learning models training and testing and the results visualization
  • requirements.txt: List of all the package used for this project
  • Slides_Final_week.ipynb: Jupyter notebook using Rise extension for the final presentation of the project
  • Data_for_plots/: Data used for two interactive figures (3D Desikan parcellation map and pie-cart of the sex and site distribution in the carriers and controls)
  • Interactive_plots/: Folder with the html object of the interactive figures
  • Slide_content/: Images used for the Slides_Final_week.ipynb file

Week 3 deliverable: data visualization

link to the blog created for the assignment:


Results of the models

Figure 2: Results after testing the models (step 5)


As you can see in the Figure 2, none of the models were better than chance to classify the carriers and the controls. BUT it is not a problem because… I learned a lot doing this project!

Progress overview

Week one:

  • Learning tons of new concepts

Week two:

  • 18 may 2020 Creating the README file
  • 20 May 2020 5 min presentation of the project draft using README file
  • 21 may 2020 Starting to play with data on python and creating the main jupyter notebook

Week three:

  • 29 may 2020 Finalization of data preparation (removing confounders) and visualization, starting ML part

Week four:

  • 3 may 2020 Starting slides on jupyter notebook using Rise extension and creating project illustrations on adobe illustrator
  • 5 may 2020 Final presentation of the project with UKBioBank data and the results of the machine learning models

Week five:

  • 8 may 2020 Final push of the blog for the visualization assignment
  • 12 may 2020 Final push of the project on Github and the website

Conclusion and acknowledgement

This BHS project allows me to learn a lot of concepts and tools concerning the open science. It was also a nice introduction to machine learning models. Hopefully, I will spread the word and I will surely include all these new tools in my practice. I am more than grateful toward all the instructors, mentors and students, who shared their knowledge.

Thanks for this enriching experiment!



Cauda F. et al., “Are Schizophrenia, Autistic, and Obsessive Spectrum Disorders Dissociable on the Basis of Neuroimaging Morphological Findings?: A Voxel-Based Meta-Analysis.” Autism Research 10, no. 6 (2017): 1079–95.

Goodkind M. et al., “Identification of a Common Neurobiological Substrate for Mental Illness.” JAMA Psychiatry 72, no. 4 (April 2015): 305–15.

Kendall K. M. et al., “Cognitive Performance and Functional Outcomes of Carriers of Pathogenic Copy Number Variants: Analysis of the UK Biobank.” The British Journal of Psychiatry 214, no. 5 (May 2019): 297–304.

Maillard A. M. et al., “The 16p11.2 Locus Modulates Brain Structures Common to Autism, Schizophrenia and Obesity.” Molecular Psychiatry 20, no. 1 (February 2015): 140–47.

Martin-Brevet S. et al., “Quantifying the Effects of 16p11.2 Copy Number Variants on Brain Structure: A Multisite Genetic-First Study.” Biological Psychiatry, 84, no. 4 (2018): 253–64.

Modenato C. et al., “Neuropsychiatric Copy Number Variants Exert Shared Effects on Human Brain Structure.” MedRxiv, April 17, 2020, 2020.04.15.20056531.

Moreau C. et al., “Neuropsychiatric Mutations Delineate Functional Brain Connectivity Dimensions Contributing to Autism and Schizophrenia.” BioRxiv, (2019), 862615.

See also these similar projects

Using ALE algorithm and machine learning to classify need and desire states

The goal of this project is to make a classification of the needs and desires states from studies fMRI data

Discover this project

Introduction to data visualization in Python

In this module, we will introduce the basics of plotting in python with some of most commonly used packages such as matplotlib and seaborn.

Discover this project