92
Quantifying the contribution of recessive coding variation to developmental disorders Short title: Recessive coding causes of developmental disorders One Sentence Summary: Recessive coding variants explain a low fraction of undiagnosed developmental disorder patients. Hilary C. Martin 1,* , Wendy D. Jones 1,2 , Rebecca McIntyre 1 , Gabriela Sanchez-Andrade 1 , Mark Sanderson 1 , James D. Stephenson 1,3 , Carla P. Jones 1 , Juliet Handsaker 1 , Giuseppe Gallone 1 , Michaela Bruntraeger 1 , Jeremy F. McRae 1 , Elena Prigmore 1 , Patrick Short 1 , Mari Niemi 1 , Joanna Kaplanis 1 , Elizabeth J. Radford 1,4 , Nadia Akawi 5 , Meena Balasubramanian 6 , John Dean 7 , Rachel Horton 8 , Alice Hulbert 9 , Diana S. Johnson 6 , Katie Johnson 10 , Dhavendra Kumar 11 , Sally Ann Lynch 12 , Sarju G. Mehta 13 , Jenny Morton 14 , Michael J. Parker 15 , Miranda Splitt 16 , Peter D Turnpenny 17 , Pradeep C. Vasudevan 18 , Michael Wright 16 , Andrew Bassett 1 , Sebastian S. Gerety 1 , Caroline F. Wright 19 , David R. FitzPatrick 20 , Helen V. Firth 1,13 , Matthew E. Hurles 1 , Jeffrey C. Barrett 1,* on behalf of the DDD Study 1. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, U.K. 2. Great Ormond Street Hospital for Children, NHS Foundation Trust, Great Ormond Street Hospital, Great Ormond Street, London WC1N 3JH, UK. 3. European Molecular Biology Laboratory–European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK. 4. Department of Paediatrics, Cambridge University Hospitals NHS Foundation Trust, Cambridge, U.K. 5. Division of Cardiovascular Medicine, Radcliffe Department of Medicine, University of Oxford, Oxford, U.K. 6. Sheffield Clinical Genetics Service, Sheffield Children's NHS Foundation Trust, OPD2, Northern General Hospital, Herries Rd, Sheffield, S5 7AU, U.K. 7. Department of Genetics, Aberdeen Royal Infirmary, Aberdeen, U.K. 8. Wessex Clinical Genetics Service, G Level, Princess Anne Hospital, Coxford Road, Southampton, SO16 5YA. 9. Cheshire and Merseyside Clinical Genetic Service, Liverpool Women's NHS Foundation Trust, Crown Street, Liverpool, L8 7SS, U.K. 10. Department of Clinical Genetics, City Hospital Campus, Hucknall Road, Nottingham, NG5 1PB, U.K. 11. Institute of Cancer and Genetics, University Hospital of Wales, Cardiff, U.K. 12. Temple Street Children’s Hospital, Dublin, Ireland.

Edinburgh Research Explorer › portal › files › ... · Web viewFor this, we restricted to ExAC variants within the intersection of the V3 and V5 Agilent probes used in DDD (including

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Quantifying the contribution of recessive coding variation to developmental disorders

Short title: Recessive coding causes of developmental disorders

One Sentence Summary: Recessive coding variants explain a low fraction of undiagnosed developmental disorder patients.

Hilary C. Martin1,*, Wendy D. Jones1,2, Rebecca McIntyre1, Gabriela Sanchez-Andrade1, Mark Sanderson1, James D. Stephenson1,3, Carla P. Jones1, Juliet Handsaker1, Giuseppe Gallone1, Michaela Bruntraeger1, Jeremy F. McRae1, Elena Prigmore1, Patrick Short1, Mari Niemi1, Joanna Kaplanis1, Elizabeth J. Radford1,4, Nadia Akawi5, Meena Balasubramanian6, John Dean7, Rachel Horton8, Alice Hulbert9, Diana S. Johnson6, Katie Johnson10, Dhavendra Kumar11, Sally Ann Lynch12, Sarju G. Mehta13, Jenny Morton14, Michael J. Parker15, Miranda Splitt16, Peter D Turnpenny17, Pradeep C. Vasudevan18, Michael Wright16, Andrew Bassett1, Sebastian S. Gerety1, Caroline F. Wright19, David R. FitzPatrick20, Helen V. Firth1,13, Matthew E. Hurles1, Jeffrey C. Barrett1,* on behalf of the DDD Study

1. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, U.K.

2. Great Ormond Street Hospital for Children, NHS Foundation Trust, Great Ormond Street Hospital, Great Ormond Street, London WC1N 3JH, UK.

3. European Molecular Biology Laboratory–European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.

4. Department of Paediatrics, Cambridge University Hospitals NHS Foundation Trust, Cambridge, U.K.

5. Division of Cardiovascular Medicine, Radcliffe Department of Medicine, University of Oxford, Oxford, U.K.

6. Sheffield Clinical Genetics Service, Sheffield Children's NHS Foundation Trust, OPD2, Northern General Hospital, Herries Rd, Sheffield, S5 7AU, U.K.

7. Department of Genetics, Aberdeen Royal Infirmary, Aberdeen, U.K.

8. Wessex Clinical Genetics Service, G Level, Princess Anne Hospital, Coxford Road, Southampton, SO16 5YA.

9. Cheshire and Merseyside Clinical Genetic Service, Liverpool Women's NHS Foundation Trust, Crown Street, Liverpool, L8 7SS, U.K.

10. Department of Clinical Genetics, City Hospital Campus, Hucknall Road, Nottingham, NG5 1PB, U.K.

11. Institute of Cancer and Genetics, University Hospital of Wales, Cardiff, U.K.

12. Temple Street Children’s Hospital, Dublin, Ireland.

13. Department of Clinical Genetics, Cambridge University Hospitals NHS Foundation Trust, Cambridge, U.K.

14. Clinical Genetics Unit, Birmingham Women's Hospital, Edgbaston, Birmingham, B15 2TG, U.K.

15. Sheffield Clinical Genetics Service, Sheffield Children's Hospital, Western Bank, Sheffield, S10 2TH, U.K.

16. Northern Genetics Service, Newcastle upon Tyne Hospitals, NHS Foundation Trust

17. Clinical Genetics, Royal Devon & Exeter NHS Foundation Trust, Exeter, U.K.

18. Department of Clinical Genetics, University Hospitals of Leicester NHS Trust, Leicester Royal Infirmary, Leicester, LE1 5WW

19. University of Exeter Medical School, Institute of Biomedical and Clinical Science, RILD, Royal Devon & Exeter Hospital, Barrack Road, Exeter, EX2 5DW, U.K.

20. MRC Human Genetics Unit, MRC IGMM, University of Edinburgh, Western General Hospital, Edinburgh EH4 2XU, U.K.

* Corresponding authors [email protected] and [email protected]

We estimated the genome-wide contribution of recessive coding variation from 6,040 families from the Deciphering Developmental Disorders study. The proportion of cases attributable to recessive coding variants was 3.6% in patients of European ancestry, compared to 50% explained by de novo coding mutations. It was higher (31%) in patients with Pakistani ancestry, due to elevated autozygosity. Half of this recessive burden is attributable to known genes. We identified two genes not previously associated with recessive developmental disorders, KDM5B and EIF3F, and functionally validated them with mouse and cellular models. Our results suggest that recessive coding variants account for a small fraction of currently undiagnosed non-consanguineous individuals, and that the role of noncoding variants, incomplete penetrance, and polygenic mechanisms need further exploration.

Large-scale sequencing studies of phenotypically heterogeneous rare disease patients can discover new disease genes (1–3) and characterise the genetic architecture of such disorders. In the Deciphering Developmental Disorders (DDD) study, we previously estimated the fraction of patients with a causal de novo coding mutation in both known and as-yet-undiscovered disease genes to be 40-45% (4), and here we extend this approach to recessive variants. It has been posited that there are thousands of as-yet-undiscovered recessive intellectual disability (ID) genes (5, 6), which could imply that recessive variants explain a large fraction of undiagnosed rare disease cases. However, attempts to estimate the prevalence of recessive disorders have been restricted to known disorders (7) or known pathogenic alleles (8). Here, we quantify the total autosomal recessive coding burden using a robust and unbiased statistical framework in 6,040 exome-sequenced DDD trios from the British Isles. Our approach provides a better-calibrated estimate of the exome-wide burden of recessive disease than previously published methods (3, 9).

We analysed 5,684 European and 356 Pakistani probands (EABI, PABI - European or Pakistani Ancestry from the British Isles; Fig. S1, S2) with developmental disorders (DDs). The clinical features are heterogeneous and representative of genetically undiagnosed DD patients from British and Irish clinical genetics services: 88% have an abnormality of the nervous system, and 88% have multiple affected organ systems (Fig. 1, Fig. S3, Table S1). Clinical features are largely similar between EABI and PABI (Fig. 1, Table S1).

To assess the genome-wide recessive burden, we compared the number of rare (minor allele frequency, MAF, <1%) biallelic genotypes observed in our cohort to the number expected by chance (10). We used the phased haplotypes from unaffected DDD parents to estimate the expected number of biallelic genotypes. Reassuringly, the number of observed biallelic synonymous genotypes matched the expectation (Fig. S4). We observed no significant burden of biallelic genotypes of any consequence class in 1,389 probands with a likely diagnostic de novo, inherited dominant or X-linked variant. We therefore evaluated the recessive coding burden in the remaining 4,318 EABI and 333 PABI probands. This “undiagnosed” cohort were more likely to have a recessive cause because they did not have a likely dominant or X-linked diagnosis (11), had at least one affected sibling, or >2% autozygosity (Fig. 2A). As expected due to their higher autozygosity (Fig. S5), PABI individuals had more rare biallelic genotypes than EABI individuals (Fig. 2A); 92% of these were homozygous (rather than compound heterozygous), versus only 28% for the EABI samples. We observed a significant enrichment of biallelic loss-of-function (LoF) genotypes in both undiagnosed ancestry groups (Poisson p=3.5x10-5 in EABI, p=9.7x10-7 in PABI), and, in the EABI group, a nominally significant enrichment of biallelic damaging missense genotypes (p=0.025) and a significant enrichment of compound heterozygous LoF/damaging missense genotypes (p=6x10-7) (Fig. 2A).

Amongst the 4,651 EABI+PABI undiagnosed probands, a set of 903 clinically-curated DD-associated recessive genes showed a higher recessive burden (Fig. S6; 1.7-fold; Poisson p=6×10-18) than average (1.1-fold for all genes). Indeed, 48% of the observed excess of biallelic genotypes lay in these known genes. By contrast, we did not observe any recessive burden in 243 DD-associated genes with a dominant LoF mechanism, nor in any gene sets tested in the 1,389 diagnosed probands (Poisson p>0.05).

We developed a method to estimate the proportion of probands with a causal variant in a particular genotype class (10) in either known and as-yet-undiscovered genes. Unlike our previously published approach (4), this method accounts for the fact that some fraction of the variants expected by chance are actually causal (Fig. S7). We estimated that 3.6% (~205) of the 5,684 EABI probands have a recessive coding diagnosis, compared to 49.9% (~2836) with a de novo coding diagnosis. Recessive coding genotypes explain 30.9% (~110) of the 356 PABI individuals, compared to 29.8% (~106) for de novos. The contribution from recessive variants was higher in EABI probands with affected siblings than those without (12.0% of 117 versus 3.2% of 5,098), and highest in PABI probands with high autozygosity (47.1% of 241) (Fig. 2B; Table S2). In contrast, it did not differ between 115 PABI probands with low autozygosity and all 5,684 EABI probands.

We caution that the PABI results may be less reliable due to modest sample size (note the wide confidence intervals in Table S2), exacerbated by consistent overestimation of rare variant frequencies in our limited sample of parents. Reassuringly, our estimated recessive contribution in PABI is close to the 31.5% reported in Kuwait (12), which has a similar level of consanguinity (13). Our results are consistent with previous reports of a low fraction of recessive diagnoses in European cohorts (3, 11, 14), but unlike those studies, our estimates further show that the recessive contribution in as-yet-undiscovered genes is also small. While it has been hypothesised that there are thousands of undiscovered recessive DD-associated genes (5, 6), our analyses suggest that the cumulative impact of these discoveries on diagnostic yield will be modest in non-consanguineous populations.

We next tested each gene for an excess of biallelic genotypes in the undiagnosed probands (Table S3) (10). Three genes passed stringent Bonferroni correction (p<3.4×10-7) (10), THOC6 (previously reported (15)), EIF3F, and KDM5B. Thirteen additional genes had p<10-4 (Table S4), of which eleven are known recessive DD-associated genes, and known genes were enriched for lower p-values (Fig. S8).

We observed five probands with an identical homozygous missense variant in EIF3F (binomial p=1.2×10-10) (ENSP00000310040.4:p.Phe232Val), plus four additional homozygous probands who had been excluded from our discovery analysis for various reasons (Table S5). The variant (rs141976414) has a frequency of 0.12% in non-Finnish Europeans (one of the most common protein-altering variants in the gene), and no homozygotes were observed in gnomAD (16).

All nine individuals homozygous for Phe232Val had intellectual disability (ID) and a subset also had seizures (6/9), behavioral difficulties (3/9) and sensorineural hearing loss (3/9) (Table S5). There was no obvious distinctive facial appearance (Fig. S9). EIF3F encodes a subunit of the mammalian eIF3 (eukaryotic initiation factor) complex, which negatively regulates translation. The genes encoding eIF2B subunits have been implicated in severe autosomal recessive neurodegenerative disorders (17). We edited iPSC lines with CRISPR-Cas9 to be heterozygous or homozygous for the Phe232Val variant, and Western blots showed that EIF3F protein levels were ~27% lower in homozygous cells relative to heterozygous and wild-type cells (Fig. S10), which may be due to reduced protein stability (Fig. S11). The Phe232Val variant significantly reduced translation rate (Fig. 3A, Fig. S12). Proliferation rates were also reduced in the homozygous but not heterozygous cells (Fig. 3B, Fig. S13), although the viability of the cells was unchanged (Fig. S14).

Another recessive gene we identified was KDM5B (binomial p=1.1×10-7) (Fig. 4), encoding a histone H3K4 demethylase. Three probands had biallelic LoFs passing our filters, and a fourth was compound heterozygous for a splice site variant and a large gene-disrupting deletion. Several of these patients were recently reported with less compelling statistical evidence (18). Interestingly, KDM5B is also enriched for de novo mutations in our cohort (4) (binomial p=5.1×10-7). We saw nominally significant over-transmission of LoFs from the mostly unaffected parents (p=0.002, transmission-disequilibrium test; Table S6), but no parent-of-origin bias. Theoretically, all the KDM5B LoFs observed in probands might be acting recessively and heterozygous probands may have a second (missed) coding or regulatory hit or modifying epimutation. However, we found no evidence supporting this (see (10); Fig. S15, S16), nor of potentially modifying coding variants in likely interactor genes, nor that some LoFs avoid nonsense-mediated decay (Fig. 4B). Genome-wide levels of DNA methylation in whole blood did not differ between probands with different types of KDM5B mutations or between these and controls (Fig. S17).

These lines of evidence, along with previous observations of KDM5B de novos in both autism patients and unaffected siblings (19), suggest that heterozygous LoFs in KDM5B are pathogenic with incomplete penetrance, while homozygous LoFs are likely fully penetrant. Several microdeletions (20) and LoFs in other dominant ID genes are incompletely penetrant (21). Other H3K4 methylases and demethylases also cause neurodevelopmental disorders (22). KDM5B is atypical; the others are mostly dominant (22), typically with pLI scores >0.99 and very low pRec scores, whereas KDM5B has pLI=5×10-5 and pRec>0.999 (23).

KDM5B is the only gene that showed significant enrichment for both biallelic variants and de novo mutations in our study. We saw significant enrichment of de novo missense (373 observed versus 305 expected; ratio=1.25, upper-tailed Poisson p=1x10-4) but not de novo LoF mutations across all known recessive DD genes (excluding those known to also show dominant inheritance). One hypothesis is that the de novo missense mutations are acting as a “second hit” on the opposite haplotype from an inherited variant in the same gene. However, we saw only two instances of this in the cohort, and besides, if it were driving the signal, we would expect to see a burden of de novo LoFs in recessive genes too, which we do not. A better explanation is that recessive DD genes are also enriched for dominant activating mutations. There are known examples of this; e.g. in NALCN (24, 25) and MAB21L2 (26), heterozygous missense variants are activating or dominant-negative, whereas the biallelic mechanism is loss-of-function. In contrast, the six de novo LoFs in KDM5B suggest it follows a different pattern. Of the twenty-one recessive genes with nominally significant de novo missense enrichment in our data, only one showed evidence of mutation clustering using our previously published method (1) (CTC1; p=0.03), which could suggest an activating/dominant-negative mechanism. Larger sample sizes will be needed to establish which of these genes also act dominantly, and by which mechanism.

All four individuals with biallelic KDM5B variants have ID, variable congenital abnormalities (Table S7) and a distinctive facial appearance (Fig. S18). Other than ID, there were no consistent phenotypes or distinctive features shared between the biallelic and monoallelic individuals, or within the monoallelic group (Table S7).

We created a mouse loss-of-function model for Kdm5b. Heterozygous knockout mice appear normal and fertile, while homozygous Kdm5b-null mice are subviable (44% of expected, from heterozygous in-crosses). This partially penetrant lethality, in addition to a fully penetrant vertebral patterning defect (Fig. S19), is consistent with previously published work (27). We additionally identified numerous behavioral abnormalities in homozygous Kdm5b-null mice: increased anxiety, less sociability, and reduced long-term memory compared to wild-types (Fig. 4).

We have quantified the contribution of recessive coding variants in both known and as-yet-undiscovered genes to a large UK cohort of DD patients, and found that overall they explain a small fraction. Our methodology allowed us to carry out an unbiased burden analysis not possible with previous methods (Fig. S4). We identified two new recessive DD genes that are less likely to be found by typical studies because they result in heterogeneous and nonspecific phenotypes, and presented strong functional evidence supporting their pathogenicity.

Our results can be used to improve recurrence risk estimates for undiagnosed families with a particular ancestry and pattern of inheritance. Extrapolating our results more widely requires some care: our study is slightly depleted of recessive diagnoses since some recessive DDs (e.g. metabolic disorders) are relatively easily diagnosed through current clinical practice in the UK and less likely to have been recruited. Furthermore, country-specific diagnostic practices and levels of consanguinity may make the exact estimates less applicable outside the UK.

Overall, we estimated that identifying all recessive DD genes would allow us to diagnose 5.2% of the EABI+PABI subset of DDD, whereas identifying all dominant DD genes would yield diagnoses for 48.6%. The high proportion of unexplained patients even amongst those with affected siblings or high consanguinity suggests that future studies should investigate a wide range of modes of inheritance including oligogenic and polygenic inheritance as well as noncoding recessive variants.

References and Notes

1. Deciphering Developmental Disorders Study, Large-scale discovery of novel genetic causes of developmental disorders. Nature. 519, 223–228 (2015).

2. R. K. C. Yuen et al., Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat. Neurosci. 20, nn.4524 (2017).

3. S. C. Jin et al., Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat. Genet. (2017), doi:10.1038/ng.3970.

4. Deciphering Developmental Disorders Study, Prevalence and architecture of de novo mutations in developmental disorders. Nature. 542, 433–438 (2017).

5. H. H. Ropers, Genetics of early onset cognitive impairment. Annu. Rev. Genomics Hum. Genet. 11, 161–187 (2010).

6. L. E. L. M. Vissers, C. Gilissen, J. A. Veltman, Genetic studies in intellectual disability and related disorders. Nat. Rev. Genet. 17, 9–18 (2016).

7. P. A. Baird, T. W. Anderson, H. B. Newcombe, R. B. Lowry, Genetic disorders in children and young adults: a population study. Am. J. Hum. Genet. 42, 677–693 (1988).

8. S. J. Schrodi et al., Prevalence estimation for monogenic autosomal recessive diseases using population-based genetic data. Hum. Genet. 134, 659–669 (2015).

9. N. Akawi et al., Discovery of four recessive developmental disorders using probabilistic genotype and phenotype matching among 4,125 families. Nat. Genet. 47, 1363–1369 (2015).

10. Materials and methods are available as supplementary materials at the Science website.

11. C. F. Wright et al., Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet. 385, 1305–1314 (2015).

12. A. S. Teebi, Autosomal recessive disorders among Arabs: an overview from Kuwait. J. Med. Genet. 31, 224–233 (1994).

13. G. O. Tadmouri et al., Consanguinity and reproductive health among Arabs. Reprod. Health. 6, 17 (2009).

14. C. Gilissen et al., Genome sequencing identifies major causes of severe intellectual disability. Nature. 511, 344–347 (2014).

15. C. L. Beaulieu et al., Intellectual disability associated with a homozygous missense mutation in THOC6. Orphanet J. Rare Dis. 8, 62 (2013).

16. gnomad, (available at http://gnomad.broadinstitute.org/).

17. A. Fogli, O. Boespflug-Tanguy, The large spectrum of eIF2B-related diseases. Biochem. Soc. Trans. 34, 22–29 (2006).

18. V. Faundes et al., Histone Lysine Methylases and Demethylases in the Landscape of Human Developmental Disorders. Am. J. Hum. Genet. 102, 175–187 (2018).

19. I. Iossifov et al., The contribution of de novo coding mutations to autism spectrum disorder. Nature. 515, 216–221 (2014).

20. G. L. Carvill, H. C. Mefford, Microdeletion syndromes. Curr. Opin. Genet. Dev. 23, 232–239 (2013).

21. H. H. Ropers, T. Wienker, Penetrance of pathogenic mutations in haploinsufficient genes for intellectual disability and related disorders. Eur. J. Med. Genet. 58, 715–718 (2015).

22. C. N. Vallianatos, S. Iwase, Disrupted intricacy of histone H3K4 methylation in neurodevelopmental disorders. Epigenomics. 7, 503–519 (2015).

23. M. Lek et al., Analysis of protein-coding genetic variation in 60,706 humans. Nature. 536, 285–291 (2016).

24. J. X. Chong et al., De novo mutations in NALCN cause a syndrome characterized by congenital contractures of the limbs and face, hypotonia, and developmental delay. Am. J. Hum. Genet. 96, 462–473 (2015).

25. M. D. Al-Sayed et al., Mutations in NALCN cause an autosomal-recessive syndrome with severe hypotonia, speech impairment, and cognitive delay. Am. J. Hum. Genet. 93, 721–726 (2013).

26. J. Rainger et al., Monoallelic and biallelic mutations in MAB21L2 cause a spectrum of major eye malformations. Am. J. Hum. Genet. 94, 915–923 (2014).

27. M. Albert et al., The histone demethylase Jarid1b ensures faithful mouse development by protecting developmental genes from aberrant H3K4me3. PLoS Genet. 9, e1003461 (2013).

28. GTEx Portal, (available at https://gtexportal.org/home/).

29. S. Kohler et al., Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am. J. Hum. Genet. 85, 457–464 (2009).

30. E. Bragin et al., DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).

31. E. Sheridan et al., Risk factors for congenital anomaly in a multiethnic birth cohort: an analysis of the Born in Bradford study. Lancet. 382, 1350–1359 (2013).

32. W. McLaren et al., The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).

33. 1000 Genomes Project Consortium et al., A global reference for human genetic variation. Nature. 526, 68–74 (2015).

34. A. L. Price et al., Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

35. V. Narasimhan et al., BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics. 32, 1749–1751 (2016).

36. M. P. Conomos, A. P. Reiner, B. S. Weir, T. A. Thornton, Model-free Estimation of Recent Genetic Relatedness. Am. J. Hum. Genet. 98, 127–148 (2016).

37. K. E. Samocha et al., A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

38. E. Krissinel, K. Henrick, Multiple alignment of protein structures in three dimensions. CompLife. 3695, 67–78 (2005).

39. E. Krissinel, K. Henrick, Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr. D Biol. Crystallogr. 60, 2256–2268 (2004).

40. S. J. Hubbard, J. M. Thornton, NACCESS-Computer Program. 1993. Department of Biochemistry and Molecular Biology, University College London.

41. W. S. J. Valdar, Scoring residue conservation. Proteins. 48, 227–241 (2002).

42. G. E. Crooks, G. Hon, J.-M. Chandonia, S. E. Brenner, WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004).

43. H. Kilpinen et al., Common genetic variation drives molecular heterogeneity in human iPSCs. Nature. 546, 370–375 (2017).

44. M. Knapp, The transmission/disequilibrium test and parental-genotype reconstruction: the reconstruction-combined transmission/ disequilibrium test. Am. J. Hum. Genet. 64, 861–870 (1999).

45. D. Szklarczyk et al., The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 (2017).

46. B. J. Klein et al., The histone-H3K4-specific demethylase KDM5B binds to its substrate and product through distinct PHD fingers. Cell Rep. 6, 325–335 (2014).

47. O. Delaneau, J.-F. Zagury, J. Marchini, Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods. 10, 5–6 (2013).

48. P. Mali et al., RNA-guided human genome engineering via Cas9. Science. 339, 823–826 (2013).

49. K. Gapp et al., Potential of Environmental Enrichment to Prevent Transgenerational Effects of Paternal Trauma. Neuropsychopharmacology. 41, 2749–2758 (2016).

50. C. Dias et al., BCL11A Haploinsufficiency Causes an Intellectual Disability Syndrome and Dysregulates Transcription. Am. J. Hum. Genet. 99, 253–274 (2016).

51. A. Csibi et al., The translation regulatory subunit eIF3f controls the kinase-dependent mTOR signaling required for muscle differentiation and hypertrophy in mouse. PLoS One. 5, e8994 (2010).

52. M. C. Frank, M. Braginsky, D. Yurovsky, V. A. Marchman, Wordbank: an open repository for developmental vocabulary data. J. Child Lang. 44, 677–694 (2017).

53. M. E. Smith, G. Lecker, J. W. Dunlap, E. E. Cureton, The Effects of Race, Sex, and Environment on the Age at Which Children Walk. The Pedagogical Seminary and Journal of Genetic Psychology. 38, 489–498 (1930).

54. G. L. Yamamoto et al., Rare variants in SOS2 and LZTR1 are associated with Noonan syndrome. J. Med. Genet. 52, 413–421 (2015).

55. N. A. Akawi, F. Al-Jasmi, A. M. Al-Shamsi, B. R. Ali, L. Al-Gazali, LINS, a modulator of the WNT signaling pathway, is involved in human cognition. Orphanet J. Rare Dis. 8, 87 (2013).

56. H. Najmabadi et al., Deep sequencing reveals 50 novel genes for recessive cognitive disorders. Nature. 478, 57–63 (2011).

Acknowledgements

We thank the DDD families, the Sanger Human Genome Informatics team, P. Danacek for help with bcftools/roh, K. de Lange for help with figures, K. Samocha for mutability estimates, J. Matte and G. Turner for help with experiments, and A. Sakar for patient review.

Families gave informed consent to participate, and the study was approved by the UK Research Ethics Committee (10/H0305/83, granted by the Cambridge South Research Ethics Committee and GEN/284/12, granted by the Republic of Ireland Research Ethics Committee).

Funding: The DDD study presents independent research commissioned by the Health Innovation Challenge Fund [grant number HICF-1009-003]. See Supplement for details.

Author contributions: Data analysis: H.C.M., J.F.M., J.H., P.S., N.A.; Clinical interpretation: W.D.J.; EIF3F experiments: R.M., C.P.J.., M.B.; Mouse phenotyping: G.S.-A., M.S.; Protein structure modelling: J.D.S.; Data processing: G.G., M.N., J.K., C.F.W., E.R.; Experimental validation: E.P.; Patient recruitment: M. Balasubramanian, J.D., R.H., A.H., D.S.J., K.J., D.K, S.A.L., S.G.M., J.M., M.J.P., M.S., P.D.T., P.C.V., M.W.; Experimental and analytical supervision: A.B., S.S.G., C.F.W., D.R.F., H.V.F., M.E.H., J.C.B.; Writing: H.C.M., W.D.J., R.M., G.S.-A., J.S., M.B., A.B., J.H., M.E.H., J.C.B.

Competing interests: M.E.H. is a co-founder of, consultant to, and holds shares in, Congenica Ltd, a genetics diagnostic company.

Data and materials availability: Exome sequencing and phenotype data are accessible via the European Genome-phenome Archive (EGA) (Datafreeze 2016-10-03) (https://www.ebi.ac.uk/ega/studies/EGAS00001000775).

Supplementary Materials

Funding statement

Materials and Methods

Figures S1 – S20

Tables S1 – S7

References (27-54)

Figure legends

Fig. 1: Clinical features of DDD probands analysed here. Proportion of probands in different groups with clinical features indicated, extracted from HPO terms. Asterisks indicate nominally significant differences between indicated groups (Fisher’s exact test).

Fig. 2: Contribution of recessive coding variants to genetic architecture in this study. (A) Number of observed and expected biallelic genotypes per individual across all genes. Nominally significant p-values from a Poisson test of enrichment are shown. (B) Left: number of probands grouped by diagnostic category. The inherited dominant and X-linked diagnoses (narrow pink bar) include only those in known genes, whereas the proportion of probands with de novo and recessive coding diagnoses was inferred as described in (10), including those in as-yet-undiscovered genes. Right: the proportion of probands in various patient subsets inferred to have diagnostic variants in the indicated classes.

Fig. 3: Functional consequences of the pathogenic EIF3F recessive missense variant. A) The Phe232Val variant impairs translation. Plot shows median fluorescence intensity (MFI) in iPSC lines heterozygous or homozygous for or without the Phe232Val variant (correcting for replicate effects), measured using a Click-iT protein synthesis assay (10). MFI correlates with methionine analogue incorporation in nascent proteins. The p-value indicates a non-zero effect of genotype from a linear regression of MFI on genotype and replicate. Red lines: means. B) The Phe232Val variant impairs iPSC proliferation in the homozygous but not heterozygous form. Results from a cell trace violet (CTV) proliferation assay, in which CTV concentration reduces on each division. The population of cells that have been through zero, one or multiple divisions is labelled.

Fig. 4: KDM5B is a recessive DD gene in which heterozygous LoFs are incompletely penetrant. A) Summary of damaging variants found in KDM5B. B) Positions of likely damaging variants found in this and previous studies in KDM5B (ENST00000367264.2; introns not to scale), omitting two large deletions. Colors correspond to those shown in (A). There are no differences in the spatial distribution of LoFs by inheritance mode, nor in their likelihood of escaping nonsense-mediated decay by alternative splicing in GTex (28). C-E) Behavioral defects of homozygous Kdm5b-null versus wild-type mice (n=14-16). C) Knockout mice displayed increased anxiety, spending significantly less time in the light compartment of the Light-Dark box. D) Reduced sociability, in the three-chamber sociability test. Knockout mice spent less time investigating a novel mouse. E) 24h memory impairment. While wild-type mice preferentially investigated an unfamiliar mouse over a familiar one, homozygous knockout mice showed no discrimination.

Supplementary Materials for

Quantifying the contribution of recessive coding variation to developmental disorders

Hilary C. Martin, Wendy D. Jones, Rebecca McIntyre, Gabriela Sanchez-Andrade, Mark Sanderson, James D. Stephenson, Carla P. Jones, Juliet Handsaker, Giuseppe Gallone, Michaela Bruntraeger, Jeremy F. McRae, Elena Prigmore, Patrick Short, Mari Niemi, Joanna Kaplanis, Elizabeth J. Radford, Nadia Akawi, Meena Balasubramanian, John Dean, Rachel Horton, Alice Hulbert, Diana S. Johnson, Katie Johnson, Dhavendra Kumar, Sally Ann Lynch, Sarju G. Mehta, Jenny Morton, Michael J. Parker, Miranda Splitt, Peter D Turnpenny, Pradeep C. Vasudevan, Michael Wright, Andrew Bassett, Sebastian S. Gerety, Caroline F. Wright, David R. FitzPatrick, Helen V. Firth, Matthew E. Hurles, Jeffrey C. Barrett on behalf of the DDD Study

correspondence to: [email protected], [email protected]

This PDF file includes:

Funding statement

Materials and Methods

Figures S1 to S20

Captions for Tables S1 to S7

Funding statement

The DDD study presents independent research commissioned by the Health Innovation Challenge Fund (grant HICF-1009-003), a parallel funding partnership between the Wellcome Trust and the UK Department of Health, and the Wellcome Trust Sanger Institute (grant WT098051). The views expressed in this publication are those of the author(s) and not necessarily those of the Wellcome Trust or the UK Department of Health. The study has UK Research Ethics Committee approval (10/H0305/83, granted by the Cambridge South Research Ethics Committee and GEN/284/12, granted by the Republic of Ireland Research Ethics Committee). The research team acknowledges the support of the National Institutes for Health Research, through the Comprehensive Clinical Research Network. This study makes use of DECIPHER (http://decipher.sanger.ac.uk), which is funded by the Wellcome Trust.

Materials and Methods

Family recruitment

The DDD project recruited individuals with severe, undiagnosed developmental disorders from 24 clinical genetics centres within the United Kingdom National Health Service and the Republic of Ireland as described previously (11), using specific criteria (https://decipher.sanger.ac.uk/files/ddd/documents/policy/ddd_project_referral_guide_clinical_genetics.pdf). Families gave informed consent to participate, and the study was approved by the UK Research Ethics Committee (10/H0305/83, granted by the Cambridge South Research Ethics Committee and GEN/284/12, granted by the Republic of Ireland Research Ethics Committee). DNA was collected from saliva samples obtained from the probands and their parents, and from blood obtained from the probands, then samples were processed as previously described (1). The analyses in paper are based on a data freeze that includes 7,831 trios from 7,446 families and 1,791 patients without parental samples (Fig. S2). These individuals include those analysed in previous publications (4, 9).

Clinical features

The patients were systematically phenotyped: detailed developmental phenotypes were recorded using Human Phenotype Ontology (HPO) terms (29), and growth measurements, family history, developmental milestones etc. were collected using a standard restricted-term questionnaire within DECIPHER (30). Some HPO terms fall under multiple organ systems (e.g. microcephaly falls under "nervous system", "head or neck" and "skeletal system"), and for Fig. 1 and Table S1, we wanted to avoid counting multiple organ systems that represented a single HPO term multiple times. Thus, we first ranked organ systems according to the number of raw counts of individuals with at least one term under that system in the full DDD cohort. We then took individuals with at least one HPO under the organ system most commonly affected, and assigned these individuals a count of one organ system. We then removed these HPOs from the patients’ lists, and identified individuals with at least one HPO in the organ system ranked next most commonly affected. We continued to count organ systems and remove HPOs for 19 non-overlapping systems. The organ systems in Fig. 1A and Table S1 were determined using this procedure (e.g. a patient with microcephaly is only included under “nervous system”, not under “head or neck” or “skeletal system”), and Fig. 1B shows the distribution of the non-redundant counts.

We note that certain DDD clinicians tend to be more diligent about reporting multiple HPO terms, and some tend to report certain organ systems more thoroughly. Thus, the small differences in clinical features between EABI and PABI (Fig. 1; Fig. S3; Table S1) could be partly because clinicians who work in areas with a high South Asian population have different reporting styles from others. The slightly greater severity of the PABI group might also partly be because, in communities enriched for cousin marriages, there is a higher number of children with genetic disorders (31), and clinical geneticists are only able to see the more severe cases.

Exome sequencing and variant quality control

Exome sequencing, alignment and calling of single-nucleotide variants and small insertions and deletions was carried out as previously described (4), as was the filtering of de novo mutations. For the analysis of biallelic genotypes, we chose thresholds for genotype and site filters to balance sensitivity (number of retained variants) and specificity (as assessed by Mendelian error rate and transition/transversion ratio). We removed sites with a strand bias test p-value < 0.001. We then set individual genotypes to missing if they had genotype quality < 20, depth < 7 or, for heterozygous calls, a p-value from a binomial test for allele balance < 0.001. Since the samples had undergone DNA capture with either the Agilent SureSelect Human All Exon V3 or V5 kit, we subsequently only retained sites that passed a missingness cutoff in both the V3 and the V5 samples. We found that, after setting a depth filter, the proportion of missing genotypes allowed had a more substantial effect on the number of Mendelian errors than genotype quality and allele balance cutoffs (Fig. S20). Thus, we ran the biallelic burden analysis on two different callsets, using a 10% (strict) or a 50% (lenient) missingness filter, and found that the results were very similar. We report results from the more lenient filter in this paper, since it allowed us to include more variants. Genotypes were set to missing for a trio if there was a Mendelian error, and variants were removed if more than one trio had a Mendelian error and if the ratio of trios with Mendelian errors to trios carrying the variant without a Mendelian error was greater than 0.1. If any of the individuals in a trio had a missing genotype at a variant, all three individuals were set to missing for that variant.

Variants were annotated with Ensembl Variant Effect Predictor (32) based on Ensembl gene build 83, using the LOFTEE plugin. The transcript with the most severe consequence was selected. We analyzed three categories of variant based on the predicted consequence: (1) synonymous variants; (2) loss-of-function variants (LoFs) classed as “high confidence” by LOFTEE (including the annotations splice donor, splice acceptor, stop gained, frameshift, initiator codon and conserved exon terminus variant); (3) damaging missense variants (i.e. those not classed as “benign” by PolyPhen or SIFT, with CADD>25). Variants were also annotated with MAF data from four different populations of the 1000 Genomes Project (33) (American, Asian, African and European), two populations from the NHLBI GO Exome Sequencing Project (European Americans and African Americans) and six populations from the Exome Aggregation Consortium (ExAC) (African, East Asian, non-Finnish European, Finnish, South Asian, Latino), and an internal allele frequency generated using unaffected parents from the DDD.

Ancestry inference

We ran a principal components analysis in EIGENSOFT (34) on 5,853 common exonic SNPs defined by the ExAC project. We set genotypes with GL<20 to missing and excluded SNPs with >2% missingness. We calculated principal components in the 1000 Genomes Phase III samples and then projected the DDD samples onto them. We grouped samples into three broad ancestry groups (European, South Asian, and Other) as shown in Fig. S2 (right hand plots). By drawing ellipses around the densest clusters of DDD samples, we defined two narrower groups: European Ancestry from the British Isles (EABI) and Pakistani Ancestry from the British Isles (PABI).

For the burden and gene-based analysis, we primarily focused on these narrowly-defined EABI and PABI groups because it is difficult to accurately estimate population allele frequencies in more broadly defined groups. For example, in 4,942 European-ancestry probands, the number of observed biallelic synonymous variants was slightly higher than the number expected (ratio = 1.06; p=2.7×10-4).

Calling autozygous regions

To call autozygous regions, we ran bcftools/roh(35) (bcftools version 1.5-4-gb0d640e) separately on the different broad ancestry groups. We LD pruned our data to avoid overcalling small runs of homozygosity as autozygous regions. Because rates of consanguinity differ dramatically between EABI and PABI, we chose r2 cutoffs for each that brought the ratio of observed to expected biallelic synonymous variants with MAF<0.01 closest to 1 (see below for calculation of the number expected): PLINK options --indep-pairwise 50 5 0.4 for EABI and --indep-pairwise 50 5 0.8 for PABI.

Sample quality control and subsetting

Our strategy for filtering probands and defining different subsets is shown in Fig. S1. For our main analyses we excluded individuals without parental data (though we did incorporate them into some of the follow-up on EIF3F and KDM5B). Next, we removed 1,403 probands whose ancestry was not in the EABI or PABI groups described above. Finally we removed 388 probands that failed quality control filters. Specifically: 47 probands with >5% missingness at the SNPs used for PCA (since their ancestry assignment was likely uncertain), 7 probands with uniparental disomy, and one individual from every pair of probands who were related (kinship > 0.044, estimated by PCRelate (36), equivalent to third-degree relatives).

For estimating cumulative frequencies of rare variants, we removed 924 parents reported to be affected, since one might expect these to be enriched for damaging variants compared to the general population, and 9 European parents with an abnormally high number of rare (MAF<1%) synonymous genotypes (>834, compared to the 99.9th percentile of 223). However, we retained their offspring in the burden and gene-based analyses.

We stratified probands by high autozygosity (>2% of the genome classed as autozygous), whether or not they had an affected sibling, and whether or not they already had a likely diagnostic dominant or X-linked exonic mutation (a likely damaging de novo mutation or inherited damaging variant in a known monoallelic DDG2P gene (http://www.ebi.ac.uk/gene2phenotype/) [if the parent was affected] or a damaging X-linked variant in a known X-linked DDG2P gene). The 4,458 patients who had no such diagnostic variants were included in the “undiagnosed” set, along with 193 patients who had diagnostic variants in recessive DDG2P genes or potentially diagnostic variants in monoallelic or X-linked DDG2P genes but had high autozygosity or affected siblings. There were 1,366 EABI and 23 PABI probands in the diagnosed set, and 4,318 EABI and 333 PABI probands in the undiagnosed set. For the set of probands with affected siblings shown in Figures 1 and 2, we restricted to families from which more than one independent (i.e. non-MZ twin) child was included in DDD and in which the siblings’ phenotypes were more similar than expected by chance given the distribution of HPO terms in the full cohort (HPO similarity p-value < 0.05 (9)).

Burden analyses and gene-based tests

Variants were filtered on class (LoF, damaging missense or synonymous) and by different MAF cutoffs. Variants failing the MAF cutoff in any of the publicly available control populations, the full set of unaffected DDD parents, or the unaffected DDD parents in that population subset (PABI or EABI) were removed.

Following the approach we used previously (9), we calculated Bg,c, the expected number of rare biallelic genotypes of class c (LoF, damaging missense or synonymous) in each gene g, as follows:

where is the number of probands and is the expected frequency of biallelic genotypes of class c in gene, calculated as follows:

where fc,g is the cumulative frequency of variants of class c in gene g with MAF less than the cutoff, and ag is the fraction of individuals autozygous at gene g. An individual was defined as being autozygous if he/she had a region of homozygosity with any overlap of gene g; in practice, autozygous regions almost always overlapped genes completely rather than partially.

The rate of LoF/damaging missense compound heterozygous genotypes is:

To calculate the cumulative frequency of variants of class c in gene g, fc,g, we first phased the variants in the DDD parents based on the inheritance information. The cumulative frequency is then given by:

where is the number of parental haplotypes with at least one variant of class c in gene g, and is the total number of parental haplotypes.

For each gene, we calculated the binomial probability (given probands and rate of the observed number of biallelic genotypes of class c. We did this for four consequence classes (biallelic LoF, biallelic LoF+LoF/damaging missense, biallelic damaging missense, and biallelic LoF+LoF/damaging missense+biallelic damaging missense). Our reasoning was that, in some genes, biallelic LoFs might be embryonic lethal and LoF/damaging missense compound heterozygotes might cause DD, but in other genes, including rare damaging missense variants in the analysis might drown out signal from truly pathogenic LoFs. We conducted the test on two sets of probands (EABI only, and EABI+PABI). We did not analyze PABI separately due to low power.

For the set of EABI only, we conducted a simple binomial test. For the combined EABI+PABI test, we took into account the different ways in which n or more probands with the relevant genotype could be distributed between the two groups and the probability of observing each combination using population-specific rates (e.g. two observed biallelic genotypes could be both seen in EABI, both in PABI, or one in each). We then summed these probabilities across all possible combinations to obtain an aggregate probability for sampling n or more probands by chance, as described in (9).

For some genes, was estimated to be 0 in one or both populations because there were no variants in the parents that passed filtering. The vast majority of these also had . We dropped these genes from the tests, but still included them in our Bonferroni correction. We also excluded 715 genes either because they were in the HLA region or because they were classed as having suspiciously many or suspiciously few synonymous or synonymous+missense variants in ExAC, leaving 18,630 genes. We thus set a significance threshold of 0.05/(8 tests 18,630 genes) = p<3.4×10-7. For Fig. S8, we ordered the genes by their lowest p-value, randomized the order of genes with the same p-value, then tested for a difference in the distribution of ranks between recessive DDG2P genes and all other genes using a Kolmogorov-Smirnov test.

For the burden analysis, we summed up the observed and expected number of biallelic genotypes across all genes to give and , then calculated their difference (the excess) and their ratio . Under the null hypothesis, we expect to follow a Poisson distribution with rate .

Alternative methods for estimating the expected number of biallelic genotypes

For Fig. S4, we used two alternative methods to the one described above for estimating the expected number of biallelic genotypes. Firstly, we calculated the cumulative frequency based on ExAC by summing the frequencies of individual variants of class c in gene g separately for non-Finnish Europeans (NFE) for EABI and from South Asians (SAS) for PABI:

For this, we restricted to ExAC variants within the intersection of the V3 and V5 Agilent probes used in DDD (including 100bp flanks on each side), and removed variants if they had >50% missingness in the relevant ExAC population.

We also used a modified version of the method described by Jin et al. (3) for approximating the expected number of biallelic genotypes by fitting a polynomial regression on the mutability:

where is the observed number of biallelic genotypes for gene g, is the mutability for gene g calculated using the method of Samocha et al. (37), , and are coefficients, and is a random noise term. We fitted this model to the observed number of biallelic synonymous genotypes (MAF<0.01) for each gene to estimate , and , then calculated the expected number of biallelic genotypes for each gene using the fitted value:

Note that we did not do the additional step described in Jin et al. in which they calculate the expected number of biallelic genotypes for gene g as , since this constrains the total expected number of biallelic genotypes to be equal to the number observed, making exome-wide burden analysis impossible.

Estimating the proportion of cases with diagnostic biallelic coding variants or de novo mutations

We are interested in estimating the proportion of probands with diagnostic variants of consequence class c. Under the null hypothesis in which none of the genotypes of class c are pathogenic, the number of such genotypes we expect to see in probands is:

where is the total expected frequency of genotypes of class c across all genes.

However, under the alternative hypothesis, suppose that some fraction of genotypes in class c cause DD, and some fraction are lethal. Assuming complete penetrance, we can thus split into genotypes that are due to chance and those that are diagnostic:

+

The component due to chance is , and is the average number of pathogenic genotypes per individual, given that the individual has at least one such genotype.

In healthy parents, biallelic genotypes of class c are all due to chance from the portion of c that is not pathogenic:

where is the expected rate of biallelic genotypes of class c in the parents, given the cumulative frequencies estimated in the same set of people, and the autozygosity rates. We can thus obtain a maximum likelihood estimate for using , the observed number of biallelic genotypes of class c in the parents:

The 95% confidence interval for is ,. We show the estimates of from the EABI and PABI parents in Fig. S7. To estimate , we combined data from both populations for MAF<0.01 variants to estimate , and obtained the following maximum likelihood estimates and 95% confidence intervals: 0.141 (0.046,0.238), 0.083 (-0.009,0.175), and 0.007 (-0.028,0.042).

For biallelic genotypes, we can substitute into the expression above, substitute for and rearrange to obtain a maximum likelihood estimate of :

We cannot disentangle and with the available data, but we find that the ratio makes very little difference to the estimate of , so we make the assumption that . The 95% confidence interval is then .

De novo mutations were called as previously described (4), selecting the threshold on ppDNM (posterior probability of a de novo mutation) such that the observed number of synonymous de novos matched the number expected. Using Sanger validation data from an earlier dataset (1), we adjusted the observed number of mutations to account for specificity and sensitivity as follows:

where is the positive predicted value (the proportion of candidate mutations that are true positive at the chosen threshold) and is the sensitivity to true positives at the same threshold. The adjustment by 0.95 is due to exome sequencing being only about 95% sensitive. The overall de novo mutation rate was calculated in different sets of probands using the model from (37), adjusting for sex, as described previously (4) .

Since we cannot estimate for de novo mutations using the parents, as we did for recessive variants, we instead set to 0.099, the fraction of genes with pLI>0.99. This estimate is more speculative than the directly observed depletion of biallelic genotypes above, but we note that the estimate of for the full set of 7,832 trios only increases from ~0.129 to ~0.154 if we increase from 0.01 to 0.3. To estimate , we make use of this relationship:

where is the population of the British Isles, and is the probability that an individual is recruited to the DDD given he/she has a pathogenic mutation of class c. If we assume that this recruitment probability is the same for de novo missense mutations as for de novo LoFs, we can write:

=

We know and , will assume so we can estimate , and can thus write the number of de novo missense mutations we expect to see as:

+

We calculated for a range of values of and found that best matched the observed data, so we used this value for estimating .

Structural analysis of EIF3F

This is shown in Fig. S11. Human EIF3:f (pdb 3j8c:f) was submitted to the Protein structure comparison service PDBeFold at the European Bioinformatics Institute (38, 39). Of the close structural matches returned, the X-ray yeast structure pdb entry 4OCN was chosen to display the human variant position, as the structural resolution (2.25Å) was better than the human EIF3:f pdb 3j8c:f structure (11.6Å) and it was the most complete structure among the yeast models. In order to map the Phe232 variant onto the equivalent position on the yeast structure, the structural alignment from PDBeFold was used. Solvent accessibility was calculated using the Naccess software (40) using the standard parameters of a 1.4Å probe radius. Amino acid sequence conservation was calculated using the Scorecons server (41) and displayed using sequence logos (42).

Introducing the EIF3F Phe232Val variant with CRISPR/Cas9

The Phe232Val mutation in the human EIF3F gene was generated by a single base substitution (T>G) using CRISPR/Cas9-induced homology directed repair in the kolf2_C1 human iPSC line, a clonal derivative of the kolf2 line generated by the HipSci consortium (43). This was achieved by nucleofection (Lonza, P3 buffer, program CA137) of 106 cells with Cas9-crRNA-tracrRNA ribonucleoprotein (RNP) complexes. Synthetic RNA oligonucleotides (target site: 5’-AGGCGTGAACATCACTCCCA-3’, 225 pmol crRNA/tracrRNA, IDT) were annealed by heating to 95°C for 2 min in duplex buffer (IDT) and cooling slowly, followed by addition of 122 pmol recombinant eSpCas9_1.1 protein (in 10 mM Tris-HCl, pH 7.4, 300 mM NaCl, 0.1 mM EDTA, 1 mM DTT, PMID: 26628643), and 500 pmol of a 100 nt ssDNA oligonucleotide (5’-CTCCGATGCGTTCAGTGTCGTAGTACGCGTATTTCACTGTCAGAGGCGT GACCATCACTCCCATGGTCCTCCCAGGGACTCCCATTAAAGTGCTGCGGAAGG-3’, IDT Ultramer) as a homology-directed repair template to introduce the desired base change. After recovery, plating at single cell density and colony picking into 96 well plates, 384 clones were screened for heterozygous and homozygous mutations by high throughput sequencing of amplicons spanning the target site using an Illumina MiSeq instrument.

Crude DNA lysates were prepared by incubation of cells in 100 l of yolk sac lysis buffer (10 mM Tris-HCl pH 8.3, 50 mM KCl, 2 mM MgCl2, 0.45% IGEPAL CA-630, 0.45% Tween 20, 400 g/ml proteinase K) at 60°C 1 h, 95C 10 min followed by dilution 10x in water. The region surrounding the mutation was amplified by nested PCR using 1 l diluted lysate and KAPA HiFi hotstart polymerase (Kapa Biosystems) by 35 cycles of {98°C 20 s, 65°C 15 s, 72°C 30 s}, using primers eIF3F_500_F and eIF3F_500_R (see below). Subsequently, reactions were diluted 10x and re-amplified with primers eIF3F_200_F and eIF3F_200_R (see below) (Tm = 60°C) to ensure specificity for the EIF3F gene. Indexed Illumina sequencing adaptors were added by a third round of PCR to identify the location of positive clones. Final cell lines were further validated by Sanger sequencing. Since this region is highly similar to several other regions in the genome, we analysed the final clones for off-target mutagenesis at the homologous sites, and found no evidence of this at the two closest off-target sites on chr2:58252105-58252196 and chr12:5032866-5032941 in any of the lines obtained.

Primer sequences: (5’-3’)eIF3F_500_F: tctggggttcatttggtccceIF3F_500_R: ctgctcagtcacacttcctgeIF3F_200_F: ACACTCTTTCCCTACACGACGCTCTTCCGATCTagacatggaaaccctctccceIF3F_200_R: TCGGCATTCCTGCTGAACCGCTCTTCCGATCTagtcccttttcaaaccaccc

Investigating EIF3F protein levels in iPSC lines containing the EIF3F variant iPS cells were cultured on Vitronectin XF (07180, Stemcell Technologies)-coated plates in TeSR-E8 medium (05990, Stemcell Technologies) and incubated at 37oC in 5%CO2:95% air. At 75% confluence, iPS cells were incubated in Gentle Cell Dissociation Reagent (07174, Stemcell Technologies) for 5 min, collected into a tube, centrifuged at 300 g for 3 min and resuspended in ice-cold 20 mM Tris-HCL pH 7.4 containing protease inhibitor cocktail (1861281, Thermo Scientific). Cells were agitated at 4oC for 30 min before passing through a 26 gauge needle several times and centrifuging at 13000 g for 15 min at 4oC. Protein in the supernatant was quantified using the Qubit fluorometer (Invitrogen). We repeated this for two independent cultures of iPSC lines. Samples were reduced (NP0009, Invitrogen) and mixed with sample buffer (NP0007, Invitrogen) before an equal amount of total protein from each cell line (6 µg for culture 1, 4 µg for culture 2) was electrophoresed through a 4-12% Bis-Tris gel (NP0322, Invitrogen) and transferred onto a membrane (IPVH00010, Immobilon P) using the X-Cell SureLock system, according to the manufacturer’s protocol (E19051, Invitrogen). Blots were blocked in 5% non-fat milk (Marvel) in tris buffered saline containing 0.1% Tween 20 (TBST) for 1 h at room temperature. Primary antibodies were as follows: 1:500 anti-rabbit EIF3F (638202, Biolegend) or 1:1000 beta III tubulin (EP1569Y; AB52623, Abcam), both diluted in 5% non-fat milk in TBST and incubated overnight at 4oC. Rabbit-HRP-linked secondary antibody (7074, Cell Signalling Technologies) was diluted 1:2000 in 5% non-fat milk in TBST and incubated at room temperature for 2 h. Blots were incubated for 2 min with Western Bright ECL Spray (Advansta) followed by detection with the ImageQuant LAS 4000 (GE Healthcare). Band intensities were measured using ImageJ 1.48v and EIF3F expression values were normalised to beta-tubulin.

We jointly analysed the data from the two independent replicates, having normalised the relative expression values by dividing by the mean of the wild-type lines for each replicate. We tested for lower relative EIF3F expression in the homozygous cells compared to heterozygous and wild-type cells combined using a one-sided t-test (Fig. S10).

Investigating protein synthesis in iPSC lines containing the EIF3F variant

Protein synthesis was assayed using the Click-iT L-homopropargylglycine (HPG) Alexa Fluor 488 Protein Synthesis Assay Kit (C10429, Molecular Probes). Briefly, in this assay, we took iPSCs with or without the EIF3F variant and depleted them of methionine to stop translation, then added a methionine analogue that contains an alkyne moiety which is incorporated into nascent proteins. We then added Alexa Fluor 488 azide, which leads to a chemoselective "click" reaction between the fluorescent azide and the alkyne, allowing the modified proteins to be detected by image-based analysis as a proxy for the amount of protein synthesis.

We optimised the assay for use with iPSCs, using the translational elongation inhibitor cycloheximide as a control to confirm the sensitivity and reproducibility of the assay (Fig. S12). iPS cells were cultured on Vitronectin XF (07180, Stemcell Technologies)-coated plates in TeSR-E8 medium (05990, Stemcell Technologies) and incubated at 37oC in 5%CO2:95% air. Cells were washed in PBS (D8537, Sigma), dissociated with Accutase (A11105-01, Gibco), then 30 000 cells were seeded per well of a 96 well plate, in triplicate in TeSR-E8 containing 10 M ROCK inhibitor (Y-27632 dihydrochloride; Ab120129, Abcam). The following day, cells were depleted of methionine for 1 h by incubation in methionine-free DMEM (21013-024, Gibco), which was supplemented with insulin, transferrin and selenium (41400045, Gibco), 20 ng/ml human FGF-basic protein (100-18C, PeproTech), sodium pyruvate (11360-039, Gibco), L-glutamine (25030-81, Gibco) and N-acetyl cysteine (A7250, Sigma), at 37℃ in 5%CO2:95% air. The media was then changed to methionine-free DMEM plus supplements and 100 M Click-iT HPG (L-homopropargylglycine; C10202, Molecular Probes), a methionine-analogue, or vehicle (DMSO; D2650, Sigma), and cells were incubated for 2 h at 37℃ in 5%CO2:95% air. As a control, cycloheximide was added to control wells at the same time, at different concentrations (Fig. S12B).

At the end of the incubation, cells were washed briefly with PBS, trypsinised for 5 min (25300-054, Gibco) to dissociate single cells, diluted in DMEM containing 20% serum to quench the trypsin, transferred to a V-bottom 96-well plate and triplicate reactions were pooled. Cells were then processed using standard immunocytochemical methods through a series of incubations, centrifugations (all performed at 300 g for 5 min) and washes in PBS, all at room temperature, in the following order: incubation in 3.7% formaldehyde in PBS (28906, Thermo Scientific) for 15 min to fix the cells; incubation in 0.5% Triton X-100 (X100, Sigma) for 20 min to permeabilise the cells; incubation in 3% bovine serum albumin (A9543, Sigma) in PBS for 10 min to block non-specific binding sites; incubation in 100 l Click-iT reaction cocktail, which was prepared according to the manufacturer’s recommendation, for 1 h, followed by a wash with Click-iT reaction rinse buffer, before resuspending in PBS and analysing fluorescence at 488 nm on a BD LSRFortessa (BD Biosciences).

This experiment was repeated three times for each of the two independent cell lines, for each of the three genotypes (wild-type, heterozygous, homozygous). Data were analysed using FlowJo (v10.4.2). After gating on single cells, we obtained a bimodal distribution of Alexa Fluor 488-positive cells (Fig. S12A), due to the different rates of protein synthesis during the cell cycle (G1/G2 have high rates of protein synthesis and S/M phase have lower rates). We gated on the AlexaFlour 488-positive population (fluorescence > 104), which comprised ~80% of the cells in the absence of cycloheximide (Fig. S12A). This proportion did not significantly depend on genotype, replicate or cell line (ANOVA p>0.05). After gating, there were an average of 139 cells per replicate (standard deviation = 76), on which we determined the median fluorescence intensity. We then tested for the effect of genotype on median fluorescence intensity using linear regression, controlling for replicate. As a summary, we averaged the median fluorescence for the two cell lines of the same genotype for each replicate, calculated the ratio of homozygote to wild-type, and then averaged this across replicates.

Investigating proliferation and apoptosis in iPSC lines containing the EIF3F variant We investigated proliferation of iPS cell lines using two methods. In the first assay, 50,000 single cells were plated and four days later total cells were counted (Fig. 13A). In the second assay, the membrane of single cells was stained with the fluorescent dye CellTrace VioletTM, the fluorescence of which reduces with each cell division and was assessed using flow cytometry (Fig. 3C; Fig. 13B).

For these assays, iPS cells were cultured on Vitronectin XF (07180, Stemcell Technologies)-coated plates in TeSR-E8 medium (05990, Stemcell Technologies) and incubated at 37oC in 5%CO2:95% air. At 65% confluence, iPS cells were washed with PBS (D8537, Sigma), dissociated with Accutase (A11105-01, Gibco), resuspended in 10X volume of TeSR-E8 medium (05990, Stemcell Technologies) containing 10 μM ROCK inhibitor (Y-27632 dihydrochloride; Ab120129, Abcam), passed through a 10 μm cell strainer (pluriStrainer, pluriSelect; 43-10010-40) and centrifuged at 300 g for 3 min.

In the first assay, cells were resuspended in TeSR-E8 medium and 50,000 live cells (assessed using Trypan blue exclusion; Invitrogen, 15250061) were plated per well of a 6-well plate containing TeSR-E8 medium and 10 μM ROCK inhibitor. Cells were cultured for the next four days in TeSR-E8 medium, before trypsinising for 5 min (25300-054, Gibco), and counting the total number of live cells (Figure S13A).

In the second assay, cells were resuspended in PBS containing 4 μM CellTrace Violet(TM) (Molecular Probes, C34557) and incubated at room temperature, in the dark for 20 min. Cells were diluted in TeSR-E8 medium containing 10 μM ROCK inhibitor and centrifuged at 300 g for 3 min, and then 40,000 cells were plated per well of a 48-well plate. Cells were cultured for the next two days in TeSR-E8 medium, washed in PBS, dissociated to single cells with Accutase, transferred to a V-bottom plate and centrifuged at 400 g for 4 min before being fixed in 100 μl fixation buffer (eBioscience™ IC Fixation Buffer, 00-8222-49) and resuspended with staining buffer (eBioscience™ Flow Cytometry Staining Buffer, 00-4222-26). Cells stained and fixed at day zero were used as a ‘no division’ control. Note that Fig. 3B shows the distributions from one representative replicate per genotype. The data were consistent across cell lines and replicates (Fig. S13B) and clearly showed that the homozygous lines had a higher fraction of cells that had only undergone one division than both the wild-type and heterozygous cells. To test this formally, we fitted a linear regression of the number of cells in division 1 against the genotype (0=wild-type or heterozygous; 1=homozygous) and the total number of cells.

We assessed apoptosis by staining with annexin V conjugated to a fluorescent dye and detected dead cells with propidium iodide (PI). Briefly, phosphatidylserine is externalised in apoptotic cells and is bound by recombinant annexin V-FITC (green fluorescence), while PI is excluded by the membrane of live cells and only stains necrotic cells or those in late apoptosis with red fluorescence. After incubation with both probes, early apoptotic cells show green fluorescence, and late apoptotic cells show both red and green fluorescence. Necrotic cells show red only and live cells show little or no fluorescence. We used a topoisomerase inhibitor (topotecan) as a positive control in this assay. Cells were harvested at 65% confluence (including the positive control pre-incubated for 2 h in 10 μM topotecan) by dissociation with Accutase, and processed according to the manufacturer’s protocol (Dead Cell Apoptosis Kit; Invitrogen, V13242).

Flow cytometry data were acquired on an LSRFortessa (Becton Dickinson) instrument. Doublets cells were excluded and CTV, Annexin V or PI fluorescence analyzed using FlowJo 10.4.2 software package (LCC). Fig. 3B and Fig. S13B summarise the results from the CTV proliferation assay, and Fig. S14 shows the results from the apoptosis assay.

Validation of KDM5B variants by targeted re-sequencing

We re-sequenced all KDM5B de novo mutations and inherited LoF variants, with the exception of two large deletions. PCR primers were designed using Primer3 to amplify the site of interest, generating approximately a 230 bp product centred on the site. PCR amplification of the targeted regions was carried out using JumpStartTM AccuTaqTM LA DNA Polymerase (Sigma-Aldrich), using 40 ng of input DNA from the proband and their parents. Unique identifying tag sequences were introduced into the PCR amplicons in a second round of PCR using KAPA HiFi HotStart ReadyMixPCR Kit (KapaBiosystems). PCR amplicons were pooled and 96 products were sequenced in one MiSeq lane using 250 bp paired-end reads. Reference and alternate read counts extracted from the resulting bam files and were used determine the presence of the variant in question. In addition, read data were visualised using IGV.

Transmission-disequilibrium test on KDM5B LoFs

We observed 15 trios in which one parent transmitted a LoF to the child, 5 trios in which one parent had a LoF that was not transmitted, 2 quartets in which one parent had a LoF that was transmitted to one out of two affected children, and 4 trios in which both parents transmitted a LoF to the child. We tested for significant over-transmission using the transmission-disequilibrium test as described by Knapp (44). There were 7 LoFs (including one large deletion) observed in probands whose parents were not originally sequenced, which we excluded from the TDT. Of the six for which we attempted validation and segregation analysis, one was found to be de novo and five inherited.

Searching for coding and regulatory modifiers of KDM5B

We defined a set of genes that might modify KDM5B function as: interactors of KDM5B obtained from the STRING database of protein-protein interactions (45) (HIST2H3A, MYC, TFAP2C, CDKN1A, TFAP2A, SETD1A, SETD1B, KDM1A, KDM2B, PAX9) plus those mentioned by Klein et al. (46) (RBBP4, HDAC1, HDAC4, MTA2, CHD4, FOXG1, FOXC9), as well as all lysine demethylases, lysine methyltransferases, histone deacetylases, and SET domain-containing genes from http://www.genecards.org/. The final list contained 95 genes. We looked for LoF or rare missense variants in these genes in the monoallelic KDM5B LoF carriers that might have a modifying effect, but found none that were shared by more than two of the de novo carriers.

We also looked for indirect evidence of a regulatory “second hit” near KDM5B by examining the haplotypes of common SNPs in the region (Fig. S15). DDD probands and a subset of their parents were genotyped on either the Illumina OmniExpress chip or the Illumina CoreExome chip. We performed variant and sample quality control for each dataset separately. Briefly, we removed variants and samples with high data missingness (>=0.03), samples with high or low heterozygosity, sample duplicates, individuals of African and East Asian ancestry, and SNPs with MAF<0.005. We then ran SHAPEIT2 (47) to phase the SNPs within 2Mb either side of KDM5B. To make Fig. S15, we used the heatmap() function in R to cluster the phased haplotypes using the default hierarchical clustering method (based on Euclidean distance).

Analysis of DNA methylation in KDM5B probands

DNA from 64 DDD whole blood samples comprising 41 probands with a KDM5B variant and 23 negative controls was run on an Illumina EPIC 850K methylation array. Negative controls were selected from DDD probands with de novo mutations in genes not expressed in whole blood (SCN2A, KCNQ2, SLC6A1, and FOXG1), since we would not expect these to significantly impact the methylation phenotype in that tissue. Samples were randomised on the array to reduce batch effects, and were QCed using a combination of data from control probes and numbers of CpGs that failed to meet the standard detection p-value of 0.05. Based on these criteria, two samples failed and were excluded from further analysis (one of the negative controls and one of the inherited KDM5B LoF carriers).

We looked at methylation levels in the KDM5B LoF carriers to search for an “epimutation” (hypermethylation on or around the promoter) that might be acting as second hit. We analyzed a subset of CpGs in and around the KDM5B promoter region: the CpG island in the KDM5B promoter itself, and a CpG island in the promoter of KDM5B-AS1, a lnc-RNA not specifically associated with KDM5B, but also highly expressed in the testis. We also extended analysis 5kb on either side of the start and stop sites of the KDM5B promoter. We examined the distribution of the beta values (the ratio of methylated to unmethylated alleles) at each of the CpGs in the 10kb region (Fig. S16).

A final post-QC set of 754,273 methylation probes and samples were analysed for whole-genome methylation changes using principal component analysis (Fig. S17). One of the samples with a de novo was removed from PCA analysis as it was a significant outlier. PCA was run on the post-QC set of beta values using the standard prcomp() functions in R (version 3.4.1).

Generation of the KDM5B knockout mice

A mouse Kdm5b loss-of-function allele (MGI:6153378) was generated by CRISPR/CAS9-mediated deletion of coding exon 7 (ENSMUSE00001331577), leading to a premature translational termination due to a downstream frameshift. CAS9 mRNA (48) and two pairs of guide RNAs flanking exon 7 (targeting CCTAGTAACACTAGGTGTTAATA, GTGTTTGGTTGTCAGTTAGAGGG, CCTCTCGTACATACATCCTAGGC, CCTAGGCTCGAACTTCACCATGT) were injected into 1-cell stage C57BL/6NJ zygotes, which were then implanted into host mothers. Mice born from these transfers were tested for germ line transmission, and identified founders were bred on a C57BL/6NJ background to establish colonies for further testing.

Breeding and housing of mice and all experimental procedures were assessed by the Animal Welfare and Ethical Review Body of the Wellcome Sanger Institute and conducted under the regulation of UK Home Office license (P6320B89B), and in accordance with institutional guidelines. From in-crosses of heterozygous Kdm5b mutant mice, we recovered 350 pups at weaning, of which 39 were homozygous (11.1% versus 25% expected). Kdm5b-/- pups were raised and weaned together with Kdm5b+/+ pups. Mice were housed in mixed genotype cages (2-5 mice) with food and water ad-libitum, under controlled temperature and humidity and a 12h light cycle (light on at 7am) at the Research Support Facility of the Wellcome Sanger Institute.

Phenotyping of wild-type and KDM5B knockout miceAt 10 weeks of age, a cohort of 16 wild-type and 16 homozygous knockout male mice underwent a series of behavioral tests. These were carried out between 9am – 5pm, after 1h of habituation to the testing room. Experimenters were blind to genotype, mouse movements were recorded with an overhead infrared video camera and tracked by automated video tracking (Ethovision XT 11.5, Noldus Information Technology).

Light-dark box (Fig. 4C): This test was adapted from Gapp et al. (49). Mice were housed in pairs for 20 minutes before introducing them individually into the light-dark box, a plastic box (40 × 42 × 26 cm) divided in two compartments. One is smaller, closed and dark (1/3 of the total surface area) and is connected through a door (5 cm) to a larger, brightly lit (370 lux with an overhead lamp) compartment (2/3 of the box). Each mouse was placed in the dark compartment, the door was then opened and the animal was left to explore for 10 min. The time spent in the light compartment was tracked and the difference between genotypes was estimated using a t-test.

Three-Chamber Sociability Assay (Fig. 4D): The protocol was adapted from Dias et al. (50). In brief, the test arena was divided into 3 equally sized chambers (25 x 37.5 cm) connected by small doors (5 x 5 cm). Mice were first individually habituated to the arena for 5 minutes, where the central chamber was empty and each of the side chambers had an empty cylindrical corral (8 cm diameter, 15 cm tall, 1 cm gap between bars). The mouse was then returned to the central chamber and doors were closed, while a stimulus mouse of a different strain (129Sv) was introduced into one of the corrals. Doors were opened and the test mouse was allowed to explore all three chambers: one side chamber with an empty corral and the other with a corral containing a freely moving novel mouse, for 5 minutes. The time spent investigating both corrals was recorded and measured as total investigation. The time spent investigating the mouse was assessed as a percentage of the total investigation time, i.e. [time investigating corral with mouse]/[time investigating corral with mouse + time investigating empty corral]. A t-test was used to evaluate the difference between genotypes of the time spent investigating the mouse.

Social Recognition Assay (Fig. 4E): This paradigm assesses memory and relies on the natural preference of mice to investigate a novel mouse over a familiar one. The assay consists of two days, as reported by Dias et al. (50). On both days, mice were habituated for 10 minutes to the test box, an empty, clean cage. Briefly, on the first day, mice were tested on a habituation paradigm to assess their olfaction and social investigation. An anesthetized stimulus mouse was placed in the center of the arena for 1 minute, repeated 4 times at 10 minute intervals. On the 5th trial, a novel mouse was presented. Both Kdm5b wild-type and homozygous knockout mice showed a similar decreasing investigation of the repeatedly-presented stimulus mouse, and an increased investigation on trial 5 of the novel mouse. On day 2, after 24 hours, the discrimination test was performed. Mice were simultaneously presented with the familiar mouse from Day 1 and a completely novel mouse. The amount of time the test animal spent investigating each stimulus mouse by close-proximity (sniffing, oronasal contact, or approaching within 1–2 cm) was recorded. Using predefined exclusion criteria, one wild-type mouse was excluded on day 1, and a second wild-type was excluded on day 2, both for insufficient overall investigation. A two-way ANOVA was used to assess the discrimination difference between genotypes, followed by Bonferroni's multiple comparisons test (two comparisons, familiar versus unfamiliar mouse for wild type and mutant).

Barnes Maze probe trial (Fig. S19AB): This assay is a test of visuo-spatial learning and memory on a circular maze (120cm diameter table) with 20 holes around the perimeter. One of the holes leads to a small dark box (Target) where they can escape from the brightly lit maze. Mice were trained for three days, 10 trials (4 min maximum each), to find the target location. On the probe trial, 72h after the last training day, the escape box was removed. Each mouse was given 4 minutes to explore the maze. The mouse's movements were tracked, and the amount of time spent around each of the holes was analyzed. Data is expressed as the percentage of time spent dwelling around each hole, relative to the total amount of time spent investigating all holes. With 20 holes to investigate, the amount of time spent by chance around each hole would be 5%. The difference between genotypes in remembering the target location was evaluated using a two-way ANOVA with Bonferroni's multiple comparisons test.

Statistical analysis for all mouse behavior experiments was performed with GraphPad Prism 7 (GraphPad Software).

X-rays (Fig. S19C): Seven wild-type and seven homozygous Kdm5b knockout mice were anaesthetised with ketamine/xylazine (100mg/ 10mg per kg of body weight) and then placed in an MX-20 X-ray machine (Faxitron X-Ray LLC). Whole body radiographs were taken in dorso-ventral and lateral positions. Images were then analysed and morphological abnormalities assessed using Sante DICOM Viewer v7.2.1 (Santesoft LTD).

Supplementary Figures

Fig. S1: Flow diagram indicating the final number of samples used in the various analyses, and why samples were removed. Note that we use “diagnosed” and “undiagnosed” as shorthand in the manuscript, but these terms are not fully accurate descriptors of these groups (see Methods section on “Sample quality control and subsetting”). The “diagnosed” probands include only those with diagnostic dominant or X-linked exonic variants. The “undiagnosed” set includes 193 patients who had diagnostic variants in recessive DDG2P genes or potentially diagnostic variants in monoallelic or X-linked DDG2P genes but had high autozygosity or affected siblings

Fig. S2: Principal components analysis of the 1000 Genomes Phase 3 samples (left) with DDD samples projected on top of them (right). The ellipses used to define the EABI and PABI populations in DDD are shown on the PC2 versus PC3 plot.

Fig. S3: Distribution of the number of organ systems affected per proband (10). P-values are from Wilcoxon rank-sum tests.

Fig. S4: Number of observed and expected biallelic synonymous genotypes per individual, estimated across 5684 EABI and 356 PABI probands. The grey lines indicate the observed number and the points with small black lines represent the expected number with a 95% confidence interval. Two-sided p-values are indicated. The expected numbers were calculated in different ways as described in Methods. Note that the estimate based on the DDD parental haplotypes (black points) best matches the observed data, and that use of the ExAC frequencies or the polynomial model of Jin et al. (3) gives estimates of the expected number that are significantly different from the observed number.

Fig. S5: Histograms of levels of autozygosity across EABI and PABI probands.

Fig. S6. Burden of biallelic damaging genotypes in different gene sets, for undiagnosed EABI and PABI probands combined (N=4,651). This includes biallelic LoFs, biallelic damaging missense and compound heterozygous LoF/damaging missense. The lines show 95% confidence intervals and the p-values are from a one-sided Poisson test. Note the strong enrichment in known recessive DD genes and genes predicted to be intolerant of biallelic LoFs based on the pRec score (23), and lack of enrichment in known dominant DD genes and genes predicted to be intolerant of monoallelic LoFs (pLI>0.9).

Fig. S7: Estimates of , the proportion of biallelic genotypes that are causal for DD or lethal. These were estimated from the parental data. See Methods for details. The points show maximum likelihood estimates and the lines show 95% confidence intervals. Points at the same MAF cutoff have been slightly scattered along the x-axis for ease of visualisation.

Fig. S8: Distribution of the ranks of minimum p-values per gene for known recessive genes versus all other genes. The order of genes with the same minimum p-value was randomised. A Kolmogorov-Smirnov (KS) test indicated that these distributions were significantly different. This, combined with the fact that fourteen of the sixteen genes with p<10-4 are known recessive DD genes, suggests our approach for identifying these genes is valid and Bonferroni correction is conservative.

Fig. S9: Anterior-posterior facial photographs of selected individuals with the homozygous Phe232Val variant in EIF3F. DECIPHER IDs are shown in the top right corner. Affected individuals did not have a distinctive facial appearance. Individual 265452 (leftmost) had muscle atrophy, as demonstrated in photographs of the anterior surface of the hands which show wasting of the thenar and hypothenar eminences. This is notable because it is only reported in one other proband in the DDD study and, in mice, Eif3f has been shown to play a role in regulating skeletal muscle size via interaction with the mTOR pathway (51). None of the other individuals were either assessed to have or previously recorded to have muscle atrophy.

Fig. S10. Results from Western blot showing lower levels of the EIF3F protein in iPSCs homozygous for the Phe232Val variant. WT: wild-type, HET and HOM: heterozygous and homozygous for Phe232Val. Note that two independent cell lines were used for each genotype, and that the two culture represent independent replicates. Beta-tubulin was used as a normalising control. Band intensities were quantified using ImageJ. We jointly analysed the data from both independent cultures, having normalised the relative expression values by dividing by the mean of the wild-type lines for each replicate. There was significantly lower relative EIF3F expression in the homozygous cells compared to heterozygous and wild-type cells combined (one-sided t-test; p=0.003), with a mean reduction of 26.6%.

Fig. S11: Predicted structural effects of the pathogenic Phe232Val variant in EIF3F. The secondary structure, domain architecture and 3D fold of EIF3F is conserved between species but sequence similarity is low (29% between yeast and humans). A) Section of the amino acid sequence logo for EIF3F where the strength of conservation across species is indicated by the size of the letters. The sequence below represents the human EIF3F. Boxed characters are the aromatic residues conserved between humans and yeast and proximal in space to Phe232. B) Structure of the section of EIF3F containing the Phe232Val variant, highlighted in green. Amino acids conserved between yeast and human sequences as highlighted in panel A are shown in grey. The conserved Phe232 side chain is buried (solvent accessibility 0.7%) and likely plays a stabilizing role, so the Phe232Val variant may disrupt protein stability. See (10) for details of structure prediction.

Fig. S12: Optimising the sensitivity of the Alexa Fluor 488 Protein Synthesis Assay Kit for assessing protein synthesis in iPSCs with or without the EIF3F Phe232Val variant. A) Representative histogram of fluorescence intensities (AlexaFluor-488), a proxy for the rate of nascent protein synthesis, across multiple wild-type iPSCs measured by flow cytometry. This plot shows a single replicate. We gated the major population (blue rectangle; i.e. fluorescence > 104), and used its median (red dotted line) in downstream analysis. B) Dose-response curve for iPSCs treated with cycloheximide. Data are represented as mean and standard deviation of median fluorescence intensity of the major Alexa-Fluor 488-positive population across four independent replicates, three of which were run in parallel with experiments presented in Figure 3C. Note that, for the leftmost point, the error bars are smaller than the dot and thus not visible.

Fig. S13. The EIF3F Phe232Val variant reduces proliferation of iPSCs in the homozygous but not heterozygous state. (A) Mean cell counts four days after plating 50,000 cells per line, with 95% confidence intervals, from three independent replicates. (B) Proportions of cells that have undergone zero, one or multiple divisions in a cell trace violet (CTV) proliferation assay. There are four replicates for each of the two independent cell lines for each genotype. The homozygous line had significantly more cells in division 1 than wild-type and heterozygous lines (p=2x10-5; linear regression including total number of cells as covariate). See Fig. 3C for representative distributions of CTV intensity, one for each genotype.

Fig. S14: No difference in cell apoptosis of iPSCs containing the EIF3F variant versus wild-type cells. Graph shows mean and standard deviation of the percentage of live, apoptotic or necrotic cells, assayed using annexin V or propidium iodide staining (see (10)). Data derived from three replicates.

Fig. S15: Plot showing haplotypes of common SNPs around KDM5B in individuals with de novo missense or LoF mutations or with monoallelic or biallelic LoFs. These is no evidence for a local haplotype shared by multiple probands with monoallelic LoFs that was not also present in an unaffected parent with a monoallelic LoF. The region shown lies between two recombination hotspots. The rows represent phased haplotypes, with orange and green rectangles corresponding to the different alleles at the SNPs at the positions indicated along the bottom. Hierarchical clustering has been applied to the haplotypes, as indicated by the dendrogram on the left, and the labels on the right indicate which individual carries the haplotype, and whether the individual was a proband carrying a de novo (purple), a biallelic LoF (dark green), or an inherited heterozygous LoF (yellow), or a parent carrying a heterozygous LoF (pink).

Fig. S16: Violin plots of the beta values (the ratio of methylated to unmethylated alleles) at each of the CpGs in the 10kb region around the KDM5B promoter. The CpGs within the KDM5B and KDM5B-AS1 promoters are annotated below the plot, with coordinates relative to hg19. The bottom panel shows the negative controls (probands with likely causal de novo mutations in known DD genes not expressed in blood), and the other panels show probands with variants in KDM5B that are either biallelic (top panel), de novo (second panel) or monoallelic and inherited (third panel).

Fig. S17: Results from a principal components analysis of genome-wide DNA methylation in probands with de novo mutations or biallelic or inherited monoallelic LoFs in KDM5B. The lack of differences between groups suggests that LoFs in KDM5B do not manifest in DNA methylation changes, at least not in whole blood of children in this age range (age 18 months to 16 years).

Fig. S18: Anterior-posterior facial photographs of one of the individuals with biallelic KDM5B variants. Note the narrow palpebral fissures, arched or thick eyebrows, dark eyelashes, low hanging columella, smooth philtrum and thin upper vermillion border.

Fig. S19: Cognitive and skeletal defects of homozygous Kdm5b knockout mice. A) Impaired long-term spatial memory in the Probe trial of Barnes Maze assay. 72 hours after learning the location of the target hole, wild-type mice spent significantly more time around the target hole compared to knockout mice. Two-way ANOVA, p=0.026 for hole ✕ genotype interaction; with Bonferroni’s multiple comparison test, p<0.001 for Target hole. B) Representative heatmap of wild-type mouse activity around the Barnes Maze during the 72 hour probe trial. C) Skeletal defects identified from X-ray analysis. All homozygous Kdm5b knockout mice analysed (n=7) had transitional vertebrae (one-tailed Fisher's exact test p=0.03) compared to none of the wild-types (n=7). This is an X-ray image of the dorsal view of wild-type and homozygous Kdm5b knockout mice. This case shows a transitional L