Mining whole genome sequence data to efficiently attribute individuals to source populations

Francisco J. Pérez-Reche; Ovidiu Rotariu; Bruno S. Lopes; Ken J. Forbes; Norval J.C. Strachan

doi:10.1038/s41598-020-68740-6

Mining whole genome sequence data to efficiently attribute individuals to source populations

Francisco J. Pérez-Reche^*, Ovidiu Rotariu, Bruno S. Lopes, Ken J. Forbes, Norval J.C. Strachan

^*Corresponding author for this work

University of Aberdeen

Research output: Contribution to journal › Article › peer-review

7 Citations (Scopus)

5 Downloads (Pure)

Abstract

Whole genome sequence (WGS) data could transform our ability to attribute individuals to source populations. However, methods that efficiently mine these data are yet to be developed. We present a minimal multilocus distance (MMD) method which rapidly deals with these large data sets as well as methods for optimally selecting loci. This was applied on WGS data to determine the source of human campylobacteriosis, the geographical origin of diverse biological species including humans and proteomic data to classify breast cancer tumours. The MMD method provides a highly accurate attribution which is computationally efficient for extended genotypes. These methods are generic, easy to implement for WGS and proteomic data and have wide application.

Original language	English
Article number	12124
Pages (from-to)	12124
Number of pages	16
Journal	Scientific Reports
Volume	10
Issue number	1
DOIs	https://doi.org/10.1038/s41598-020-68740-6
Publication status	Published - 22 Jul 2020

Bibliographical note

Acknowledgements:
The Campylobacter work in this project was supported by Food Standards Scotland project FSS00017 and the Scottish Government (Rural and Environment Science and Analytical Services Division) project A13559368.

Keywords

Bacterial evolution
Evolutionary genetics
Population genetics
Scientific data
MULTILOCUS GENOTYPES
INFECTIONS
LOCI
INFERENCE
ADMIXTURE
ANCESTRY
DIVERSITY
SELECTION
ASSIGNMENT TESTS
ENTROPY

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1038/s41598-020-68740-6Licence: CC BY

Pérez_et_al_SR_MiningWholeGenome_VoR
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Final published version, 3.57 MBLicence: CC BY

Cite this

@article{5048ac72b0764a79900419945753dccf,

title = "Mining whole genome sequence data to efficiently attribute individuals to source populations",

abstract = "Whole genome sequence (WGS) data could transform our ability to attribute individuals to source populations. However, methods that efficiently mine these data are yet to be developed. We present a minimal multilocus distance (MMD) method which rapidly deals with these large data sets as well as methods for optimally selecting loci. This was applied on WGS data to determine the source of human campylobacteriosis, the geographical origin of diverse biological species including humans and proteomic data to classify breast cancer tumours. The MMD method provides a highly accurate attribution which is computationally efficient for extended genotypes. These methods are generic, easy to implement for WGS and proteomic data and have wide application.",

keywords = "Bacterial evolution, Evolutionary genetics, Population genetics, Scientific data, MULTILOCUS GENOTYPES, INFECTIONS, LOCI, INFERENCE, ADMIXTURE, ANCESTRY, DIVERSITY, SELECTION, ASSIGNMENT TESTS, ENTROPY",

author = "P{\'e}rez-Reche, {Francisco J.} and Ovidiu Rotariu and Lopes, {Bruno S.} and Forbes, {Ken J.} and Strachan, {Norval J.C.}",

note = "Acknowledgements: The Campylobacter work in this project was supported by Food Standards Scotland project FSS00017 and the Scottish Government (Rural and Environment Science and Analytical Services Division) project A13559368.",

year = "2020",

month = jul,

day = "22",

doi = "10.1038/s41598-020-68740-6",

language = "English",

volume = "10",

pages = "12124",

journal = "Scientific Reports",

issn = "2045-2322",

publisher = "Nature Publishing Group",

number = "1",

}

TY - JOUR

T1 - Mining whole genome sequence data to efficiently attribute individuals to source populations

AU - Pérez-Reche, Francisco J.

AU - Rotariu, Ovidiu

AU - Lopes, Bruno S.

AU - Forbes, Ken J.

AU - Strachan, Norval J.C.

N1 - Acknowledgements: The Campylobacter work in this project was supported by Food Standards Scotland project FSS00017 and the Scottish Government (Rural and Environment Science and Analytical Services Division) project A13559368.

PY - 2020/7/22

Y1 - 2020/7/22

N2 - Whole genome sequence (WGS) data could transform our ability to attribute individuals to source populations. However, methods that efficiently mine these data are yet to be developed. We present a minimal multilocus distance (MMD) method which rapidly deals with these large data sets as well as methods for optimally selecting loci. This was applied on WGS data to determine the source of human campylobacteriosis, the geographical origin of diverse biological species including humans and proteomic data to classify breast cancer tumours. The MMD method provides a highly accurate attribution which is computationally efficient for extended genotypes. These methods are generic, easy to implement for WGS and proteomic data and have wide application.

AB - Whole genome sequence (WGS) data could transform our ability to attribute individuals to source populations. However, methods that efficiently mine these data are yet to be developed. We present a minimal multilocus distance (MMD) method which rapidly deals with these large data sets as well as methods for optimally selecting loci. This was applied on WGS data to determine the source of human campylobacteriosis, the geographical origin of diverse biological species including humans and proteomic data to classify breast cancer tumours. The MMD method provides a highly accurate attribution which is computationally efficient for extended genotypes. These methods are generic, easy to implement for WGS and proteomic data and have wide application.

KW - Bacterial evolution

KW - Evolutionary genetics

KW - Population genetics

KW - Scientific data

KW - MULTILOCUS GENOTYPES

KW - INFECTIONS

KW - LOCI

KW - INFERENCE

KW - ADMIXTURE

KW - ANCESTRY

KW - DIVERSITY

KW - SELECTION

KW - ASSIGNMENT TESTS

KW - ENTROPY

UR - http://www.scopus.com/inward/record.url?scp=85088381352&partnerID=8YFLogxK

U2 - 10.1038/s41598-020-68740-6

DO - 10.1038/s41598-020-68740-6

M3 - Article

C2 - 32699222

AN - SCOPUS:85088381352

SN - 2045-2322

VL - 10

SP - 12124

JO - Scientific Reports

JF - Scientific Reports

IS - 1

M1 - 12124

ER -

Mining whole genome sequence data to efficiently attribute individuals to source populations

Abstract

Bibliographical note

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Application of mathematical models for public health protection against infectious diseases

Cite this

Mining whole genome sequence data to efficiently attribute individuals to source populations

Abstract

Bibliographical note

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Impacts

Application of mathematical models for public health protection against infectious diseases

Cite this