The molecular characterisation of complex behaviours is a challenging task as a range of different factors are often involved to produce the observed phenotype. An established approach is to look at the overall levels of expression of brain genes – or ‘neurogenomics’ – to select the best candidates that associate with patterns of interest. However, traditional neurogenomic analyses have some well-known limitations; above all, the usually limited number of biological replicates compared to the number of genes tested – known as “curse of dimensionality”. In this study we implemented a Machine Learning (ML) approach that can be used as a complement to more established methods of transcriptomic analyses. We tested three supervised learning algorithms (Random Forests, Lasso and Elastic net Regularized Generalized Linear Model, and Support Vector Machine) for their performance in the characterization of transcriptomic patterns and identification of genes associated with honeybee waggle dance. We then intersected the results of these analyses with traditional outputs of differential gene expression analyses and identified two promising candidates for the neural regulation of the waggle dance: boss and hnRNP A1. Overall, our study demonstrates the application of Machine Learning to analyse transcriptomics data and identify candidate genes underlying social behaviour. This approach has great potential for application to a wide range of different scenarios in evolutionary ecology, when investigating the genomic basis for complex phenotypic traits and can present some clear advantages compared to the established tools of gene expression analysis, making it a valuable complement for future studies.
Bibliographical noteWe thank Dr Georgios Leontidis (The School of Natural and Computing Science, University of Aberdeen) for his valuable support during the selection and implementation of the ML models, and the two anonymous reviewers for providing useful feedback that helped improve the clarity and soundness of the manuscript. We are also grateful to NERC (Natural Environment Research Council) for funding this project and supporting MV’s salary over 10 weeks through their Research Experience Placement programme (DTG reference: NE/S007377/1). The honeybee work that was performed to obtain the sequencing data used in this study was funded by the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant no. 638873 to EL). This funding also supported FM during the execution of the field and molecular work.
Data Availability StatementAll codes used in the analyses here reported are visible in a GitHub repository associated with this project: https://github.com/Vejni/WaggleDance_MachineLearning. The raw sequencing data that represent the starting material for the analyses here described have been deposited on NCBI SRA (Bioproject PRJNA756776).
- feature selection
- gene structure and function
- social evolution