Functional filter for whole genome sequencing data identifies HHT and stress-associated non-coding SMAD4 polyadenylation site variants >5kb from coding DN

Sihao Xiao* (Corresponding Author), Zhentian Kai, Daniel Murphy, Dongyang Li, Dilip Patel, Adrianna Bielowka, Maria E. Bernabeu-Herrero, Awatif Abdulmogith, Andrew D. Mumford, Sarah Westbury, Micheala A. Aldred, Neil Vargesson, Mark J. Caulfield, Genomics England Research Consortium, Claire L Shovlin* (Corresponding Author)

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

Despite whole genome sequencing (WGS), many single gene disorder cases remain unsolved, impeding diagnosis and preventative care for people whose disease-causing variants escape detection. Since early WGS data analytic steps prioritize protein-coding sequences, to simultaneously prioritize variants in non-coding regions rich in transcribed and critical regulatory sequences, we developed GROFFFY, an analytic tool which integrates coordinates for regions with experimental evidence of functionality. Applied to WGS data from solved and unsolved hereditary hemorrhagic telangiectasia (HHT) recruits to the 100,000 Genomes Project, GROFFFY-based filtration reduced the mean number of variants per DNA from 4,867,167 to 21,486, without deleting disease-causal variants. In three unsolved cases (two related), GROFFFY identified ultra-rare
deletions within the 3’ untranslated region (UTR) of the proto-oncogene SMAD4, where germline loss-of-function alleles cause combined HHT and colonic polyposis (MIM: 175050). Sited >5.4kb distal to coding DNA, the deletions did not modify or generate microRNA binding sites, but instead disrupted the sequence context of the final cleavage and polyadenylation site necessary for protein production: By iFoldRNA, an AAUAAA-adjacent 16 nucleotide deletion brought the cleavage site into inaccessible neighboring secondary structures, while a 4-nucleotide deletion unfolded the downstream RNA polymerase II roadblock. SMAD4 RNA expression differed to control-derived RNA in resting and cycloheximide-stressed peripheral blood mononuclear cells. Patterns predicted the mutational site for an unrelated HHT/polyposis- affected individual, where a complex insertion was subsequently identified. In conclusion, we describe a functional rare variant type that impacts regulatory systems based on RNA polyadenylation. Extension of coding sequence-focused gene panels is required to capture these variants.
Original languageEnglish
Pages (from-to)1903-1918
Number of pages16
JournalAmerican Journal of Human Genetics
Volume110
Issue number11
Early online date9 Oct 2023
DOIs
Publication statusPublished - 2 Nov 2023

Bibliographical note

Acknowledgments:
This research was made possible through access to the data and findings generated by the 100,000 Genomes Project. The work was cofounded by the National Institute for Health Research Imperial Biomedical Research Centre, the D’Almeida Charitable Trust, and Imperial College Healthcare NHS Trust. AA was supported by Prince Sultan Military Medical City, Saudi Arabia. MAA was supported by the National Institutes of Health (grant R35HL140019). The 100,000 Genomes Project is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The 100,000 Genomes Project uses data provided by patients and collected by the National Health Service as part of their care and support. We thank the National Health Service staff of the UK Genomic Medicine Centres and the participants for their willing participation; the Genomics England Clinical Research Interface team, specifically Susan Walker, for separately reviewing bam file variant sequences; Charlotte Bevan, Michael Hubank and Santiago Vernia for helpful discussions and manuscript review; and our academic and public partners within the NIHR Imperial BRC’s Social Genetic and Environmental Determinants of Health (SGE) theme. We specifically thank the presented families for confirmation of their clinical phenotypes and consent to share in this manuscript. The views expressed are those of the authors and not necessarily those of funders, the NHS, the NIHR, or the Department of Health and Social Care.

Data Availability Statement

The publicly available file accession numbers used to generate the code are provided in full in supplemental methodsTables S3 and S4 and are available at the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA 596860. Primary WGS data from the 100,000 Genomes Project, which are held in a secure research environment, are available to registered users. Please see https://www.genomicsengland.co.uk/about-gecip/for-gecip-members/data-and-data-access for further information.

Fingerprint

Dive into the research topics of 'Functional filter for whole genome sequencing data identifies HHT and stress-associated non-coding SMAD4 polyadenylation site variants >5kb from coding DN'. Together they form a unique fingerprint.

Cite this