51
Phasing single DNA molecules with barcode linked sequencing David Redin Doctoral Thesis KTH Royal Institute of Technology

Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

Phasing single DNA molecules with barcode linked sequencing

David Redin

Doctoral Thesis KTH Royal Institute of Technology

Page 2: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

© David Redin

Stockholm 2018

KTH Royal Institute of Technology

School of Engineering Sciences in Chemistry, Biotechnology and Health

Department of Gene Technology

Science for Life Laboratory

SE-171 65 Solna

Sweden

Printed by Universitetsservice US-AB

ISBN 978-91-7729-939-4

TRITA-CBH-FOU-2018:41

Page 3: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

For science.

Page 4: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

PUBLIC DEFENSE

The public defense of this thesis will take place on October 19, 2018 at 10.00 AM in the Air & Fire auditorium, Science for Life Laboratory (Tomtebodavägen 23 in Solna), for the degree of Doctor of Philosophy (Ph.D.) in Biotechnology.

Respondent | David Redin, M.Sc. in Medical Biotechnology Dept. of Gene Technology, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH Royal Institute of Technology. Science for Life Laboratory, Solna, Sweden.

Chairman | Prof. Per-Åke Nygren Dept. of Protein Science, KTH Royal Institute of Technology, AlbaNova, Stockholm, Sweden. Faculty opponent | Prof. Ulf Gyllensten Dept. of Immunology, Genetics and Pathology, Uppsala Universitet, Rudbeck Laboratory, Uppsala, Sweden. Evaluation committee

Prof. Ingrid Kockum Dept. of Clinical Neuroscience, Karolinska Institutet, Solna, Sweden.

Prof. Stefan Bertilsson Dept. of Ecology and Genetics, Uppsala Universitet, Uppsala, Sweden.

Assoc.Prof. Carsten Daub Dept. of Biosciences and Nutrition, Karolinska Institutet, Solna, Sweden.

Respondent’s supervisor | Prof. Afshin Ahmadian Dept. of Gene Technology, KTH Royal Institute of Technology, Sweden. Respondent’s co-supervisor | Prof. Joakim Lundeberg Dept. of Gene Technology, KTH Royal Institute of Technology, Sweden.

Page 5: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

ABSTRACT

Elucidation of our genetic constituents has in the past decade predominately taken the form of short-read DNA sequencing. Revolutionary technology developments have enabled vast amounts of biological information to be obtained, but from a medical standpoint it has yet to live up to the promise of associating individual genotypes to phenotypic states of wide-spread clinical relevance. The mechanisms by which complex phenotypes arise have been difficult to ascertain and the value of short-read sequencing platforms have been limited in this regard. It has become evident that resolving the full spectrum of genetic heterogeneity requires accurate long range information of individual haplotypes to be distinguished. Long-range haplotyping information can be obtained experimentally by long-read sequencing platforms or through linkage of short sequencing reads by means of a common barcode. This thesis explores these solutions, primarily through the development of novel technologies to phase short sequences of single molecules using DNA barcoding. A new method for high-throughput phasing of single DNA molecules, achieved by the production and utilization of uniquely barcoded beads in emulsion droplets, is described in Paper I. The results confirm that complex libraries of beads featuring mutually exclusive barcodes can be generated through clonal PCR amplification, and that these beads can be used to phase variations of the 16s rRNA gene which reduces the ambiguity of classifying bacterial species for metagenomics. Paper II describes a second methodology (‘Droplet Barcode Sequencing’) which simplifies the concept of barcoding DNA fragments by omitting the need for beads and instead relying on clonal amplification of single barcoding oligonucleotides. This study also increases the amount of information that can be linked, which is showcased by phasing all exons of the HLA-A gene and successfully resolving all the alleles present in a sample pool of eight individuals. Paper III expands on this work and explores the use of a single molecule sequencing platform to provide full-length sequencing coverage of six genes of the HLA family. The results show that while genes shorter than 10 kb can be resolved with a high degree of accuracy, compensating for a relatively high error rate by means of increased coverage can be challenging for larger genomic loci. Finally, Paper IV introduces the use of barcode-linked reads on an unprecedented scale, with a new assay that enables low-cost haplotyping of whole genomes without the need for predetermined capture sequences. This technology is utilized to generate a haplotype-resolved human genome, call large-scale structural variants and perform reference-free assembly of bacterial and human genomes. At a cost of only $19 USD per sample, this technology makes the benefits of long-range haplotyping available to the vast majority of laboratories which currently rely solely on short-read sequencing platforms.

Keywords: Single molecule sequencing, DNA barcoding, whole genome haplotyping, linked-read sequencing, phasing, de novo genome assembly.

Page 6: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

SAMMANFATTNING

Kartläggningen av våra genetiska beståndsdelar har under det senaste decenniet i huvudsak kommit genom tekniker för DNA-sekvensering med kort läslängd. Revolutionära tekniker har möjliggjort storskaliga biologiska analyser, men från en medicinsk synpunkt har de ännu inte levt upp till löftet om att associera komplexa genotyper med fenotyper av stort samhällsintresse. De biologiska mekanismerna som lägger grunden till komplexa sjukdomar har varit svåra att utreda och värdet av sekvenseringsplattformar med kort läslängd har varit begränsat i detta avseende. Det är numera tydligt att en komplett kartläggning av den mänskliga genetiska variationen kräver sekvenseringsinformation som sträcker sig betydligt längre, för att på så sätt kunna särskilja de haplotyper som genomet består utav. Lång sekvensinformation kan erhållas experimentellt antingen genom speciella plattformar med lång läslängd, eller genom sekvensering av korta molekyler som kopplats ihop med en DNA-baserad streckkod. Denna avhandling undersöker dessa lösningar, främst genom utveckling av nya tekniker för att samordna korta delar av enskilda molekyler. En ny metod för storskalig analys av enskilda DNA-molekyler, baserad på framställning och utnyttjande av unikt DNA-kodade partiklar i emulsionsdroppar, beskrivs i papper I. Resultaten bekräftar att komplexa bibliotek av kodade partiklar som är täckta av unika DNA-streckkoder kan genereras genom klonal PCR-amplifiering. Dessa partiklar används för att samordna olika delar av 16s rRNA-genen, som i sin tur minskar tvetydigheten för klassificering av bakteriearter i miljöprover. I papper II beskrivs en andra metodik som förenklar konceptet av att koda enskilda DNA-fragment genom att utesluta behovet av partiklar utan istället förlita sig på klonal amplifiering av enskilda kodade oligonukleotider. Denna studie utökar också mängden information som kan länkas, vilket visas genom att samordna alla exoner av HLA-A-genen och genom att kunna kartlägga alla alleler närvarande i en provblanding av åtta individer. Papper III bygger vidare på detta arbete och undersöker användningen av en sekvenseringsplattform med lång läslängd för att tillhandahålla full täckning av sex olika gener tillhörande HLA-familjen. Resultaten visar att gener kortare än 10 kilobaser kan karakteriseras med hög noggrannhet, medan längre genomiska regioner är svårare att karakterisera på grund av begränsad täckning. Papper IV introducerar slutligen användningen av DNA-kodning på ännu större skala, med en ny metod som möjliggör haplotypning av hela genom till en låg reagenskostnad och utan behov av förutbestämda kopplingssekvenser. Denna teknik används för att generera ett haplotyp-upplöst humant genom, detektera storskaliga strukturella varianter och genomföra ihopsättning av bakteriella och humana genom utav behov av refererenssekvenser. Med en kostnad på endast $19 USD per prov möjliggör tekniken för laboratorier över hela världen att kombinera storskalig DNA sekvensering med värdefull haplotyp-karakterisering över långa avstånd.

Nyckelord: DNA-sekvensering, enskilda molekyler, haplotyp-karakterisering.

Page 7: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

LIST OF PUBLICATIONS

This thesis is based on the following papers, hereafter referred to as Papers I-IV. Articles have been reprinted with permission for the Appendix.

Paper I

Erik Borgström, David Redin, Sverker Lundin, Emelie Berglund, Anders F. Andersson & Afshin Ahmadian. Phasing of single DNA molecules by massively parallel barcoding. Nature Communications 6: 7173 (2015).

Paper II

David Redin, Erik Borgström, Mengxiao He, Hooman Aghelpasand, Max Käller & Afshin Ahmadian. Droplet Barcode Sequencing for targeted linked-read haplotyping of single DNA molecules. Nucleic Acids Research 45: e125 (2017).

Paper III

Sebastian Johansson, Szilveszter Juhos, David Redin, Afshin Ahmadian & Max Käller. Comprehensive haplotyping of the HLA gene family using nanopore sequencing. Manuscript.

Paper IV

David Redin, Tobias Frick, Hooman Aghelpasand, Jennifer Theland, Max Käller, Erik Borgström, Remi-Andre Olsen & Afshin Ahmadian. Efficient whole genome haplotyping and single molecule phasing with barcode-linked reads. Manuscript under review.

Page 8: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

TABLE OF CONTENTS

CHAPTER ONE | Coming in terms with ignorance PAGE 1

CHAPTER TWO | A tale of two haplotypes PAGE 3

CHAPTER THREE | How it’s made: DNA data PAGE 9

CHAPTER FOUR | The art of trial and error PAGE 19

CHAPTER FIVE | Connecting the dots PAGE 25

CHAPTER SIX | Present Investigations PAGE 27

CHAPTER SEVEN | Looking to the future PAGE 37

ACKNOWLEDGEMENTS PAGE 39

REFERENCES PAGE 41

APPENDIX | Papers I – IV

Page 9: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

1

CHAPTER ONE | Coming in terms with ignorance

Doctors are men who prescribe medicines of which they know little, to cure diseases of which they know less, in human beings of whom they know nothing. – Voltaire (1760).

Beyond a veil of molecular mechanisms lies the answer to mankind’s greatest question; unable to be revealed with the instruments humans have obtained through evolution. Every second, trillions of cells in the human body perform a vast array of synchronized molecular functions to serve an organism which remains largely oblivious to all but an overall state of equilibrium (or lack thereof). An inherent irony of molecular biology is that the neuronal cells which enable cognitive thought are the same cells utilized in investigating how they able to do so. Specialized cells work in unison to manage a dynamic, yet consistent system of biochemical reactions required for sustaining life. From a scientific perspective, the challenge to define life has been a matter of understanding this system, by studying the underlying aspect of cellular programming which governs it. Science has provided tools to aid in this quest, but we are faced with an ongoing conundrum – the more we learn, the more we realize how little we know.

In the field of genomics, the effort to understand the molecular

underpinnings of life boils down to studying genomes, or rather how variation in the DNA sequence or structure of a genome governs traits of certain cell types. Thanks to a multitude of technological developments, geneticists have been able to produce sequence-based maps of the human genome that are used to link observable characteristics (phenotypes) to specific features of the genome (genotypes). In the beginning of the ‘genomic age’, it was thought that most phenotypes (e.g. a common disease) could be associated with a set of sequence-based variations in the DNA (i.e.

Page 10: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

2

mutations). Population-scale studies of DNA mutations have proved useful in identifying molecular pathways involved in common diseases, but in most cases the measured variation in the across genomes has been unable to fully explain the observed traits (1). Today we know that the evolutionary nature of the genotype-phenotype relationship is more often than not affected by a complex network of molecular operations. For instance, phenotypic variation can also be caused by non-sequence based epigenetic modifications. Despite extensive efforts to characterize the human genome, we still do not know how most it works.

This body of work comprises technological developments for sequence-based genome analysis, to improve the throughput, resolution and sensitivity of molecular methods used to probe the genetic constitution of DNA samples. In the grand scheme of things, it represents a miniscule contribution to scientific progression, yet providing researchers with tools to gain biological insights with less ambiguity and resources than previously conceivable. The publications included within, describe developments that empower researchers to resolve the full spectrum of genetic variation, bringing us closer to realizing the goal of basing therapeutic decisions on the genetic constitution of individuals. The same technologies also enable researchers to analyze unknown genomes with an unparalleled level of ease and resources, paving the way towards new biological discoveries. For the inquisitive mind, these developments are put in context by an onset presentation of scientific discoveries and technological developments upon which this work is based.

Before an inevitable realization of ignorance sets in, it is worth celebrating the remarkable achievements of scientists who have shaped the world we live in. The pursuit of a doctorate has laid the foundation for this thesis, and in the spirit of disseminating scientific information one can only hope it inspires those who come after as much as it has been by the scientists who came before.

Page 11: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

3

CHAPTER TWO | A tale of two haplotypes

My scientific studies have afforded me great gratification; and I am convinced that it will not be long before the whole world acknowledges the results of my work. - Gregor Mendel

Once upon a time, there was a peasant boy named Johann who lived on a fruit orchard with his parents and two sisters in Austria. The boy’s father, Anton, cared for the land like his forefathers had before him, teaching his son the precise craft of grafting trees - a process by which shoots from a tree with desired traits are attached to the stem of an older tree. The family was fortunate to receive exotic shoots of exceptional quality from the local Countess, and Johann could at a young age see the fruits of his meticulous labor. Breaking with the family’s traditions, Johann was sent to attend school at a neighboring village; which despite an economic burden on his family saw him excel in the natural sciences (2). With two years remaining of his studies, Johann’s father was no longer able to provide for his education. Now faced with a dilemma, 16-year old Johann had to choose whether to get a job and continue his studies, or move home and resume life as a peasant. Johann choose the former, and despite struggling to put food on the table, he graduated with honors and proceeded to study philosophy, mathematics and physics in preparation for higher education. Later at university Johann developed a close relationship with his mentor, Dr. Franz, who as a physicist and monk demonstrated how controlled experiments and observations could disclose fundamental laws of nature (3). Dr. Franz inspired him to become an Augustinian monk, and the young man took on Gregor as a new religious name. Gregor proceeded with his training in both theology and the natural sciences, embarking on a series of experiences that would set him on the path to become a great experimental scientist. It was not until two decades after his death

Page 12: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

4

in 1884 that Gregor Johann Mendel was recognized for his astonishing work ‘Experiments on Plant Hybrids’ (4), that would eventually see him be considered the father of genetics.

The rudimentary properties of genetic inheritance observed by Mendel paved the way for scientists to search for the molecular basis of this phenomenon. In 1869, the chemical discovery of nucleic material in cells was made by Friedrich Miescher, who concluded that this new material, unlike proteins, contained a large amount of phosphorus and lacked sulfur (5). While Miescher hypothesized that the role of this ‘nuclein’ was responsible for hereditary traits and cellular reproduction, it seemed implausible that a single material could form the basis of such incredible diversity in the natural world. Like Mendel, Miescher would not live to see his discoveries become the foundation of molecular biology. It was not until the mid 20th century that scientists became more inclined to see this nucleic acid (DNA) as the hereditary molecule of life. By that time, Russian biochemist Phoebus Levene had discovered the three major components of nucleotides (phosphate-sugar-base) and that DNA is a polymer of such nucleotides containing one of four nitrogenous bases - adenine (A), cytosine (C), guanine (G) and thymine (T) (6). Levene’s work was continued by Erwin Chargaff, who made key discoveries that the nucleotide composition of DNA varies between species, and that the bases are present in fixed ratios (7, 8). In the early 1950’s the last remaining piece of the DNA puzzle was to identify its structure; which set off an unprecedented race within the scientific community to find the solution - a story upon which multiple books and motion pictures have subsequently been produced. As history unfolded, it was James Watson and Francis Crick, with help from radiologists Rosalind Franklin and Maurice Wilkins, who discovered the structure of DNA to be a double helix with complementary base-pairing of A’s to T’s and G’s to C’s. The structure of DNA was published (9) by Watson and Crick on April

Page 13: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

5

25th, 1953, marking a turning point in our history as humans beginning to understand life at the molecular level.

Understanding the structure of DNA meant scientists could begin to form a model of how the hereditary information is utilized with respect to RNA and the synthesis of proteins. Crick proposed the relationship between these molecules in 1957, forming the ‘central dogma’ of molecular biology and detailing it as an irreversible transfer of information from nucleic acids (DNA and RNA) to proteins (10, 11). The theory postulates three main cellular processes by which the flow of information is maintained; DNA replication (DNA → DNA), transcription (DNA → RNA) and translation (RNA → protein). Crick also suggested the probable existence of mechanisms for RNA replication (RNA → RNA) and reverse transcription (RNA → DNA), both of which have since then been proven to exist. Following this groundbreaking hypothesis, Crick continued his work on the mechanisms of protein synthesis by deciphering how nucleic acid residues are translated into amino acids chains (i.e. proteins). Crick et al. (12) proposed that each triplet of nucleotide bases (i.e. one of 64 possible combinations of A, T, G or C) would correspond to one of 20 previously determined amino acids. This remarkable paper also correctly concluded the degenerate, non-overlapping and unidirectional nature of how these triplets (named codons) are encoded (13).

The central dogma and the subsequent work to decipher the genetic code has laid the foundation for modern genetics and genomics, but discoveries over the past 50 years have painted a highly complex picture of cellular regulation that goes far beyond the reliance on solely proteins to carry out cellular functions. Proteins that are synthesized arise from messenger RNA (mRNA) molecules which only make up 1-5% of the RNAs (14) in a cell (depending on the cell type and state), and which in turn stems from only 1.5% of the DNA in the human genome that is ‘protein-

Page 14: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

6

coding’. The remaining 98.5% of the human genome consists of non-coding RNA genes, regulatory elements, and other non-coding forms of DNA which have little or no annotated functions. The notion that large parts of the human genome, a collection of 3.2 billion nucleotides inherited from each of one’s parents, could be ‘junk’ is one that has divided the field scientists over the past decade (15). There is a degree of humbling humor in the thought that we as Homo sapiens, a species which after 1.6 billion years of evolution as multicellular organisms (16), would be comprised of ~80% ‘non-functional’ and useless DNA (17). The presence of large repetitive regions, seemingly useless copies of genes and transposable elements seem to indicate that the process by which such features arose in our ancestors did not have a negative effect on the population. On the contrary, it can be speculated that an excess of genetic material would be an acceptable load on cellular functions such as replication, while offering protection from random mutations which organisms amass on a slow but regular basis (18).

Despite this, the level of complexity of the human genome and the processes by which it influences cellular functions is one that scientists have only begun to comprehend. The Human Genome Project initiated in 1990 was in many ways the start of large-scale genomics, its aim was to ‘determine the sequence of nucleotide base pairs for the human genome and to map all genes from a physical and functional standpoint’ (19). The first draft of the human genome was published in February 2001, and although ahead of schedule by a couple of years it took a large team of international researchers 13 years to complete for an estimated cost of 3 billion USD. Most of the sequence information was gathered in the last two years due to technological breakthroughs and a shift in sequencing strategy. The process of sequencing was also accompanied by the development of public database resources (20-22) that provided geneticists with invaluable tools for annotation which meant humans could finally begin to explore the molecular

Page 15: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

7

underpinnings of human health and disease. By 2008, further technological advancements allowed the sequencing of the first human diploid genome to be completed in a few months (23), and the 1000 Genomes project was launched aiming to provide a more detailed map of the human genome based on individuals from various ethnic backgrounds (24). Since then, the cost of sequencing the human genome has continued to fall drastically, to the point where performing whole genome sequencing on individuals can now be accomplished for less than $1000 USD in a few days.

There is a wide variety of genetic variation in the human genome, and although much emphasis has been put on analyzing patterns of single nucleotide polymorphisms (SNPs), it has done little to explain the causal effects of complex diseases. Researchers have long recognized that variants of more than one base pair, as well as repetitive regions, may be of more interest from a clinical perspective than SNPs. Such variations, which can be anything from short block substitutions to large structural variants of >100,000 bases, accounts for more than half of nucleotide differences in the human genome (25); yet the majority of these events have been difficult to resolve through sequencing until recently (26). Since the first draft, steady progress has been made by the Genome Reference Consortium to improve the publically available reference of the human genome, which is currently on build 38 (GrCh38). Several prominent researchers and key figures involved in the Human Genome Project genome have been forthcoming about the need to fill in the blanks by looking into the regions that are missing (27).

An aspect which has been commonly overlooked by large scale sequencing efforts of the past decade is that of analyzing the diploid nature of human genomes; whereby the sequence of both maternal and paternal chromosomes is determined separately. Variations of genes, typically identified through a predetermined set of SNP positions that are common within the human

Page 16: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

8

population, are called alleles. And the process of distinguishing the SNPs found within an individual's genome to either the maternal or paternal allele is known as haplotyping. Haplotyping is in other words the process of resolving both haplotypes of an individual, each of which correspond to the assortment of genes inherited from a single parent. Differences between an individual's alleles which arise from deleterious DNA mutations, can mean that one of the alleles when transcribed and translated results in a non-functional protein. Fortunately, having two copies of each gene to begin with can mean that a functional protein can still be expressed, and thus that a healthy phenotype can still be maintained in some cases. However, in the case of multiple deleterious mutations in the same gene, it is important to know whether they stem from the same allele or if both copies of the gene are inactivated. For instance, the ability to distinguish between such cases is fundamental for the diagnosis of recessive Mendelian diseases (i.e. diseases that are caused by mutations in a single gene). The importance of having haplotype-resolved genomes is further exemplified by the finding that differences between the two haplotypes from the same individual can differ by a substantial 0.5% (~16 million nucleotides) (25). Although the tools available today enable an incredibly detailed look at each of the 6.4 billion nucleotides in a person’s diploid genome, the classification of variant alleles on a genome-wide scale is in a sense not all that different from the work Mendel pioneered in the mid 19th century. The tale of resolving the haplotypes of individual genomes is however one which requires specific technological achievements, and it is such developments which are the foundation of this thesis.

Page 17: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

9

CHAPTER THREE | How it’s made: DNA data

Progress in science depends on new techniques, new discoveries and new ideas, probably in that order. - Sydney Brenner

The generation of genomic data has since its dawn been dependent on an ever-expanding molecular toolbox of novel techniques to make visible that which we cannot see. As molecular biology has matured and new techniques have come forth, scientists have gained a deeper understanding of the molecules which regulate our lives, and drastically improved the accessibility and ease by which genomic data can be gathered. Over the past decade, the field has seen massive improvements and reductions in cost when it comes to sequencing DNA; to the point where most analyses relating to genomics, epigenomics and transcriptomics are today based on such data. This chapter chronicles specific technologies of DNA sequencing which have been pertinent to the papers featured in this thesis.

It was the chemical synthesis of custom nucleic acid chains (oligonucleotides), as spearheaded by H. G. Khorana (28) and further developed by R. L. Letsinger (29, 30) and M. Caruthers (31), which has enabled much of the research relating to DNA we see today. DNA synthesis paved the way for revolutionary technologies like PCR (polymerase chain reaction) in 1988 which permits amplification of DNA molecules by cycling the processes of separating complementary DNA strands (denaturation by heating) and enzymatically incorporating nucleotides to make each complement double stranded (32). The enzymatic incorporation of free nucleotides (i.e. a mixture of A’s, T’s, G’s and C’s in solution) is based on having a partial stretch of double stranded DNA (typically a synthesized oligonucleotide) to prime the polymerization. The first enzymes (polymerases) used for this

Page 18: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

10

purpose could not withstand the heating required to denature the stands, but this was soon resolved by the introduction of thermostable polymerases (33) which meant that the reaction could proceed without human intervention and thus be automated in thermocyclers (PCR machines).

Biochemical pioneer and two-time recipient of the Nobel Prize in Chemistry, Frederick Sanger, needs no introduction to those who have dabbled in molecular biology. Sanger was first to determine the amino acid sequence of a protein in 1951 (34, 35), the first to sequence a nucleic acid in 1969 (36), and first to fully sequence a DNA-based genome in 1977 (37). He developed multiple techniques for sequencing DNA, and one method in particular that is based on chain-terminating dideoxynucleotides (38) has had a profound impact on the field. Introduced in 1977, ‘Sanger sequencing’ is not only the foundation of modern sequencing technologies but is also still used to this day. It was first commercialized by Applied Biosystems for their ABI 370A instrument - the first of its kind four-color fluorescence automated sequencing instrument. The later release of the ABI PRISM 3700 capillary electrophoresis instrument in 1999 featured a large improvement in sequencing throughput and is, along with the introduction of whole genome shotgun sequencing, accredited as the reasons why the Human Genome Project could be finished years ahead of schedule.

The era of ‘massively parallel sequencing’ is one which began in 2005 with the introduction of the GS20 and subsequent GS FLX instrument by 454 Life Sciences. The 454 system (41) was able to sequence ~500 consecutive bases for around one million molecules per run. The chemistry was based on pyrosequencing (39, 40) - a protocol for detection of light-signals resulting from a cascade of enzymatic reactions that follows the incorporation of each nucleotide (added one at a time). Accurate detection of photons required localized amplification of the signal, by sequencing

Page 19: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

11

groups of identical input molecules rather than single molecules. This amplification was carried out as a preparatory step before the actual sequencing, through a process known as emulsion PCR (emPCR); whereby millions of template molecules are encapsulated and amplified in discrete reaction chambers. Emulsions were created by shaking oil and an aqueous phase, forming millions of aqueous droplets that enabled high-throughput monoclonal amplification of single template molecules. Compared to previously used vector-based methods, clonal amplification enabled a massive improvement in the throughput and ease of preparing sequencing libraries. For the 454 platform, emulsion droplets were generated with oligonucleotide-coated beads (~25 um in diameter) that were added to the aqueous phase to enable PCR products to be covalently linked to the beads. Breaking the emulsion then enabled collection of DNA-coated beads, each one coated with ~10 million copies of a unique template molecule, and spreading those beads individually into micro-fabricated wells enabled massively parallel sequencing of the DNA. Although the 454 sequencing platform was formally discontinued in 2016 due to competing technologies, the idea of amplifying the detection signal through polymerase amplification of sequencing molecules - to increase the ease and accuracy by which a readout could be obtained - has lived on. The beads and emulsification protocol of the 454 platform was utilized in Paper I to perform massively parallel barcoding of single DNA molecules.

The strategy of clonally amplifying sequencing molecules on

beads has also inspired the Ion Torrent platform, which is based on the sequential detection of protons released upon incorporation of native (non-labelled) nucleotides. A consequence of using non-terminating nucleotides is however that signal linearity breaks down when incorporating continuous stretches of the same nucleotide, making it difficult to characterize homopolymers. Semiconductor chips are used to measure dynamic changes in hydrogen ion concentration (pH), enabling Ion Torrent to develop

Page 20: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

12

a system for rapid sequencing for small genomes or targeted gene panels. The latest product line of chips yield from 2 to 100 million reads of 200 base-pairs (bp) in a few hours, but the platform also requires an extensive preparation of sequencing samples beforehand.

Analogous to the bead-based amplification of sequencing

molecules, on-chip bridge amplification (42, 43) using covalently anchored primers is an elegant solution with seemingly limitless scalability. Short DNA molecules (200-1000 bases) with universal adaptor sequences can be seeded on to chip surfaces with the density of template molecule controlled by dilution. Clusters of clonally amplified molecules are then formed by polymerase extension which can subsequently by sequenced. The technology commercialized by Solexa (later obtained by Illumina) was based on bridge amplification and sequencing with chain-terminating nucleotides featuring reversibly fluorescent labels. The incorporation of nucleotides is detected optically from one of four fluorophores attached to each base, followed by cleavage of fluorophores from incorporated nucleotides and washing - a process which is cycled for each base that is read. After a specified number of sequencing cycles, the molecules are made single stranded and flipped to continue sequencing in the opposite direction, yielding paired-end sequencing reads. When Illumina launched its HiSeq 2000 platform in 2010 it offered 2x150 bp paired-end sequencing data with high accuracy and ultra-high throughput at an unmatched price point. This quickly lead to market dominance as the output was soon increased to 3 billion reads (~1 Terabase) per run. The subsequent release of the MiSeq system - a benchtop instrument which offered more convenient and rapid sequencing for small scale studies - further diminished the value of competing technologies. Continued development of the HiSeq product line has steadily brought down the cost of sequencing the human genome to $1000 USD, which was long proclaimed as the price-point where truly widespread utilization of

Page 21: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

13

DNA sequencing technologies would become reality. In addition to the four-color sequencing chemistry, Illumina has released systems based a two-fluorophore chemistry; in effect cutting run times in half albeit with some compromises on read length and output quality for non-heterogeneous libraries. Doubling down on the production-scale generation of short-length sequencing reads, Illumina introduced the NovaSeq system in 2018, which following upcoming releases is expected to output a maximum of 20 billion reads of 2x150 bp paired-end reads (~6 Terabases) per run, enabling parallel sequencing of ~48 human genomes at 30X coverage in two days.

The irony of this short-read sequencing revolution is that although it has made DNA sequencing available to many research laboratories, the actual comprehension of the biological data has to a large degree been overlooked by the industry. With a single NovaSeq instrument, production facilities will soon have the potential to sequence >8000 human genomes and generate ~1 petabyte of data per year. Such numbers pose serious issues for simply storing the data, nonetheless being able to extract biologically relevant information from it. Already back in 2008, the usefulness of short read sequencing data was questioned due to issues of it not being able to resolve repetitive regions and large structural variants found in the human genome (44). Since then, the merit of this concern has materialized with the release of long read sequencing platforms and the showcasing of medical relevance which such information offers. Long-range sequence information significantly simplifies the assembly of genome and enables large portions of a genome to be phased, i.e. variants are resolved to individual haplotypes. The technological developments presented in this thesis all aim to generate more valuable sequencing information; by utilizing the accuracy and throughput of short-read sequencing technologies while preserving the contiguity of large DNA molecules through linkage of reads originating from single molecules.

Page 22: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

14

Long read sequencing platforms, in contrast to previous technologies which rely on clonal amplification of template molecules, attempts to read the sequence information of single DNA molecules. Unsurprisingly, sequencing at the single molecule level is difficult due to a low signal-to-noise ratio and lack of allowance for stochastic errors; the consequence of which is reflected by relatively high error profiles. In theory, the effects of lower sequencing accuracy could be addressed by means of a higher sequencing coverage, but since the detection of single molecules require more technically advanced solutions it has proven difficult to scale such technologies. Currently, the two viable alternatives on the market is that of Pacific Biosciences (PacBio) and Oxford Nanopore Technologies. The PacBio technology (45) relies on tethering single polymerase molecules loaded with single DNA molecules at the bottom of zeptoliter-sized wells. Optical detection of incorporated nucleotides with fluorescent labels (attached to the phosphate group) is then achieved in real time by a CCD camera. The camera is focused on the bottom of the wells and is fine-tuned so that the prolonged presence and subsequent release of the fluorescently labelled phosphate group, as opposed to free nucleotides in the solution, can be classified as incorporated bases. The error rate is estimated around 13% (46), but with added preparatory steps the DNA molecules can be circularized to reduce the error rate through multiple passes. For their latest instrument, the Sequel system, PacBio claims average read lengths of 20 kilobases (kb) with 10 gigabases (Gb) output per run (47). Clearly the development of such a platform is an amazing technological accomplishment, but in terms of generating sequencing data the system is expensive to run and the throughout is insufficient for sequencing a whole human genome.

Oxford Nanopore Technologies (ONT) has brought a completely new approach to DNA sequencing (48), removing the need for bulky systems with optical detection and instead packing

Page 23: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

15

the whole system in a USB-powered device that fits in your hand. Their MinION device currently features a polymer membrane with up to 512 protein nanopores through which long DNA molecules can translocate. An electric potential is applied over the membrane and alterations in the current is analysed in real-time as the DNA is fed through the pore by a motor protein, at a rate of 450 bases per second (49). As the thickness of the pores are equivalent to ~5 nucleotides, the base calling is based on processing of k-mers rather than single nucleotides. Since its release in 2014, the technology has started to mature in terms of computationally being able predict the sequencing information of the molecule, but the process is still error prone with an estimated systematic error rate of ~15% (50). Neither ONT nor the community of users have settled on which base calling software works the best, and it is likely to be influenced by the sample. Unlike other sequencing devices, the run time is not predetermined but rather controlled by the user who is free to stop the sequencing when sufficient data has been obtained. The lifespan of the pores however, is limited as they lose activity over time by getting blocked or destroyed. ONT claims a yield of 10-20 Gb for the MinION device (at a cost of $1,000), yet a recent study (49) to assemble the human genome using only nanopore sequencing reads utilized 39 flow cells with an average yield of only 2.3 Gb per flow cell and an average read length of ~6,000 bp. To obtain sufficient coverage to attempt assembly of the human genome, an additional 14 flow cells had to be utilized in the study. Their final assembly covered 95.2% of the GrCh38 human reference genome and featured reads with an average 92% identity to the reference after correction.

Oxford Nanopore have recently released the GridION instrument, a slightly larger box which houses five MinION flow cells and a dedicated computation unit, and is expected to release a benchtop solution named PromethION with 48 larger flow cells at the end of 2018. Although ONT have stated it as a goal to bring the sequencing cost of a human genome down to ~$1000 USD per

Page 24: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

16

human genome, the reality of such a claim has yet to materialize. The most promising aspect of the platform is its potential to be used as a portable device for sequencing in the field, and even though the preparation of samples for sequencing remains a hurdle for this to be viable in reality, ONT have recently announced the ongoing development of solutions to these issues. Also, since the detection signal of the nanopore technology is not dependent on DNA nucleotides per se, it has the potential to be used for direct RNA sequencing (51) and DNA-methylation (52) (epigenetic) studies as well, even if more development is needed before such uses become mainstream features of the technology. The use of the ONT platform has been explored in Paper III.

With current state of affairs, long read sequencing platforms are not well-suited for human genomics on a genome-wide scale, due low throughput and sequencing accuracy when compared to the Illumina platform. Instead, approaches which rely on barcode linked reads have emerged as promising alternatives (53-57), drawing upon the benefits of high accuracy and throughput of short read sequencing platforms while maintaining the ability to generate long-range sequence information. The foremost commercial platform for this purpose, is that from 10x Genomics (55). Their technology features a microfluidic device for generating emulsions on a disposable chip, where long DNA fragments and uniquely barcoded gel beads are loaded into droplets. In the latest release of this platform (10x Chromium Genome), the barcoded bead library features a complexity of 4 million unique barcodes. Inside the droplets, gel beads are dissolved and a proprietary chemistry enables mutually exclusive barcodes to be linked to multiple positions throughout the original molecule (typically 50-200 kb in size). The system loads ~1 ng of DNA (~10 million 100 kb fragments) which are separated into ~1 million droplets. Each droplet thus contains multiple genomic fragments (around ten), but the amount is limited to minimize the risk of compartmentalizing multiple copies of the same genomic loci. The

Page 25: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

17

biggest issue of the 10x platform is the cost of simply preparing libraries for sequencing. Although the upside of linked-read data is significant, implementation of the technology has been limited to large-scale laboratories where preparing 8 libraries in parallel at a cost of $2200 USD is acceptable. The cost of sequencing such samples on Illumina’s platform is currently below $1000 USD per library, and in contrast ordinary library preparation kits are priced around $15-$60 USD per sample (58).

Although barcode-linked read technologies do not generate continuous sequence information or reads with a specified order; it is clear that with sufficient coverage across the genome such data is useful in resolving large structural variants and assembling haplotype-resolved genomes without a reference. However, the high cost of generating libraries with the 10x Genomics platform means that it has not gained widespread use despite significant interest from the research community. Furthermore, as new competitors to Illumina in the field of DNA sequencing continue to emerge, the cost of sequencing the human genome is expected to drop to ~$100 USD in the coming years (59). The consequence of continued reductions in sequencing costs is thus likely to be one where the 10x Genomics remains an unattractive solution for whole genome haplotyping. The technology presented in Paper IV addresses this issue, by detailing a new library preparation method which enables the contiguity of short reads to be preserved on a genome-wide scale for only $19 USD per sample.

Page 26: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

19

CHAPTER FOUR | The art of trial and error

If you try to fail, and succeed, which have you done? - George Carlin

Sequencing of DNA requires the controlled manipulation of nucleic acids out of their natural state into something which can be read by instruments - a process which is referred to as ‘library preparation’. Much like actual libraries, the term comes from the fact that sequencing samples typically constitute a massive variety of DNA molecules, similar to books in that they in turn are made up of letters, albeit of the genetic code. Manipulation of DNA at the molecular level utilizes a diverse toolbox of enzymes; such as polymerases, ligases, transposases, restriction enzymes, kinases and nucleases. Although such variety of molecular functions opens up a myriad of technical possibilities, it requires an intricate understanding of these functions and how the enzymatic activity is affected by experimental conditions. In essence, educated guesses are made in regards to how nucleic acids will respond to given reaction conditions, typically influenced by the sequence constitution, physical interaction with enzyme(s), the reaction buffer (i.e. the effect of various salts and stabilizing additives), temperature and reaction times. A straightforward example of this is PCR, where even though most polymerases have already been optimized for use with a supplied reaction buffer, the outcome of the reaction is highly dependent on the design (length and sequence composition) of the oligonucleotides (primers) and the temperatures used in the thermal cycling protocol. The experimental work of developing and optimizing new assays is thus largely based on trial and error. As expected from such a strategy, learning to interpret the abundance of negative results is often the most useful skill set one can adopt.

Page 27: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

20

As previously cited, it is the progression of molecular techniques which provide the foundation for new discoveries and subsequent new ideas. The role of method development in molecular biology is of utmost importance even if in the short term it often does not lead to biological insights of medical value. As exemplified by the progression of techniques presented in this thesis, lots of small improvements have enabled an incremental development and advancement in terms of scale and scientific impact. Such is typically the way of a technologist. It is seldom that new technologies arise from novel discoveries, such as a new protein with functions that have yet to be exploited; although such cases are typically ones to have the largest impact. The beauty of immense complexity which embodies the natural world means that new discoveries can come from the most remote or exotic lifeforms on earth, as even though their genome may not resemble the human genome in the slightest, their evolutionary dependence on DNA means it can be applied to manipulate our own genome. Some prominent examples of this include the isolation of thermostable Taq-polymerase from Thermus aquaticus bacteria living in underwater thermal vents (33), or more recently the engineering of the prokaryotic CRISPR/Cas machinery to enable selective cleavage of DNA for genome editing (60).

The processing of nucleic acids in its natural state to a digital form of sequence information begins with DNA extraction. At its core, DNA extraction entails the mechanical or chemical degradation (lysis) of cell membranes, followed by the separation of DNA from proteins and RNAs using selective binding or degradation and aggregation of unwanted debris. The method by which DNA is extracted is of particular interest for applications aiming to sequence long stretches of DNA, as it requires more meticulous protocols to conserve the integrity of large DNA fragments. Once a pure sample of DNA has been obtained, the material is typically amplified in one of many ways using polymerases to enrich for specific regions of a genome.

Page 28: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

21

Polymerases and ligases are both frequently used to add sequencing adapters (short oligonucleotides of predetermined sequences), which are required for a sequencing readout to be obtained. Standard PCR is however not well suited for applications like whole genome amplification where sequence variety of the human genome makes it a burdensome endeavor. Specialty enzymes like the Φ29 polymerase have instead been utilized for such purposes as it features high processivity and strand displacement capability that enables isothermal random amplification with non-specific hexamer oligonucleotides that can bind (hybridize) to the genome at even intervals (61). Another benefit of the Φ29 polymerase is that it features an inherent proofreading activity with a consequent 100-fold reduction in DNA replication errors when compared to Taq-polymerases (62). However, for the amplification of genomic DNA extracted from single cells it is difficult to obtain consistent results since ultra-low input material (~7 pg) typically results in uneven amplification (with allelic dropout). This issue was addressed by MALBAC (63), a technique whereby the genomes of single cells are first amplified in a linear fashion before exponential amplification ensues. It enables the genome to be amplified more evenly by relying on a universal primer sequence to minimize the amplification bias. Reducing the volume of enzymatic reactions has been a popular strategy in molecular assays as it improves the efficiency and throughput while also lowering the cost of reagents (64). Miniaturization of reactions from microliter to nanoliter well plates, and then to picoliter scale emulsion droplets has been a clear trend over the past decade. An advantage of this is improved uniformity of enzymatic reactions as there is less competition between enzymes and its substrate (65). When taken to the extreme, emulsions enable the compartmentalization of single DNA molecules in millions of discrete droplets, all in the combined volume of an ordinary PCR tube. Emulsions are not without their issues though, as generating the droplets (most commonly done

Page 29: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

22

with microfluidic chips) is laborious compared to standard well-plate reactions. Microfluidic chip-based droplet generation works by pumping an aqueous phase (with reagents) though micrometer scale channels, to a T-junction where an oil phase is introduced at perpendicular angles. Since the two liquids are immiscible, when the flow rates are adjusted properly the aqueous phase will bud off sequentially, forming evenly sized droplets that can then be collected. An issue with this approach is that it requires an elaborate set up of highly precise pumps and the production of microfluidic chips. Such equipment can be purchased but tend to be expensive, and the alternative of producing such chips in-house requires expertise that goes beyond what is available in most laboratories (56). The alternative to microfluidics is simply to shake the two phases (water/oil), whereby one ends up with a non-uniform distribution of droplets, but by controlling the shaking frequency the size range can be controlled (as shown in Paper II). In either case, chemical stabilizers known as surfactants are added to the oil or aqueous phase, to prevent coalescence (i.e. merging) of aqueous droplets. A common problem is that droplet coalescence will occur during aggressive treatments such as heating the droplets over 90℃ during PCR (56), which depending on the assay design may or may not be acceptable. The development of techniques presented in this thesis have to large degree entailed optimization of reagents and conditions to yield droplets that remain stable throughout reaction protocols.

As the title of this thesis suggest, barcoding of DNA molecules have played a central part in the assays that have been developed. The premise of barcoding the contents of separate reaction volumes is that the products can be distinguished by sequencing, even after reactions have been pooled and mixed. Barcodes may consist of a sequence of predetermined nucleotides, or randomized sequences that are identified through defined adjacent sequences. In the case of haplotyping, having a shared barcode between PCR products that originate from the same genomic fragment means that the

Page 30: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

23

sequence information can be phased over hundreds of kilobases. High throughput barcoding through PCR requires the delivery of a population of barcoding molecules to each specific reaction compartment. However, synthesizing and manually pipetting a batch of 100,000+ different oligonucleotides into individual reaction wells would be very expensive and practically impossible to execute. A more appealing solution is to utilize barcoded beads, whereby each bead is covered in oligonucleotides that are sufficient for PCR amplification and barcodes that are unique to each bead. However, the production of bead populations of sufficient complexity (i.e. the amount of different barcodes) is far from trivial and has only recently become a common feature of molecular techniques (55, 66-68). The first peer-reviewed publication to utilize such a strategy of bead-based barcoding in emulsion droplets was the technique described in Paper I. This was closely followed by the joint publication of two independently developed technologies for single cell analysis using barcoded beads in droplets (67, 68). These approaches utilized barcoded beads produced by laborious processes of combinatorial extension cycling (68) and split-pool cycling of reverse-directed phosphoramidite synthesis (67). An alternative to barcoded beads is instead to generate barcode populations from oligonucleotides synthesized with randomized sequences. The first group to utilize such a strategy employed a microfluidic device to first generate droplets with single copies of barcodes for amplification by PCR, and then fusing these droplets (on-chip) with individual droplets containing template molecules that needed to be barcoded (56). Simplifying this process by removing the dependency on microfluidic chips, as well as the need to fuse barcode droplets with droplets containing template molecules, is the foundation of the strategy described in Paper II.

The notion of diluting genomic fragments in compartments

and then barcoding the contents of those compartments was first described in 2012 by the company Complete Genomics (53). Their

Page 31: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

24

technique was based on ligating barcoded adaptors to pools of ~500 genomic fragments in 384-well plates, with the premise that by diluting molecules the probability of having both alleles of a gene in the same well was minimal. The technology was initially offered as a service by the company which had also developed their own sequencing technology, but after the company was bought by BGI the system never reached the market (69). A similar technique named Moleculo (70) was obtained by Illumina in 2013 (71), also to offer long read phasing as a service but with limited success. Illumina has since then developed additional techniques based on ‘contiguity preserving transposition’ to yield long-range phasing information - by combining transposase reactions with combinatorial barcoding (54) and later through transposases linked to barcoded beads (57), but neither have yet to be adopted by the scientific community due to laborious and expensive protocols. With the only viable alternative being that of 10x Genomics, it has been a goal throughout my doctoral studies to develop a technique for genome-wide haplotyping that could easily be adapted by anyone in the scientific community. The technique presented in Paper IV is the result of those efforts.

Page 32: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

25

CHAPTER FIVE | Connecting the dots

I'm fascinated by the idea that genetics is digital. A gene is a long sequence of coded letters, like computer information. Modern biology is becoming very much a branch of information technology. - Richard Dawkins

The identification and interpretation of variant genotypes with medically relevant phenotypic properties requires intelligently designed studies and bioinformatic solutions. As previously discussed, the value of immense amounts of short read sequencing data has yet to live up to expectations in this regard. It is the informal policy of Illumina to rely on the greater scientific community (i.e. academia) to tackle these issues in unison. Although the field of bioinformatics has exploded over the past decade, many large-scale studies still focus on expanding our understanding of the true complexity of the human genome (72, 73). It is widely recognized that obtaining a fully resolved human genome reference will require the use of long-range sequencing technologies to minimize assembly gaps and accurately map repetitive regions (74). In particular, the investigation of large structural variants has gained traction with many studies showcasing their importance in complexes diseases (73, 75, 76). After two decades of work to sequence the human genome, the latest release of the human reference genome still contains over 800 gaps (where continuous sequence blocks cannot be conjoined) (74, 77). In an attempt to fill these gaps, a wide variety of technologies have often been combined to alleviate the biases and drawbacks inherent to each (73, 78). Although the bioinformatics process to make sense of such a rich variety of data types is far from trivial, progress is being made to resolve haplotype blocks spanning entire chromosomes (79). We are now at the point where tightly packed

Page 33: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

26

heterochromatic regions (e.g. centromeres and telomeres) of the genome make up the majority of the missing information (74).

As new technologies for haplotype-resolved genomes are becoming more readily available to researchers, the field ought to shift its focus towards assembly of diploid genomes. The wealth of information offered by diploid genomes as opposed to unresolved haplotypes is made evident by the prevalence of heterozygous structural variants and wide-spread allelic differences (80). It is also clear that with such level of detail, robust methods for genome annotation will be needed as the premise of personalized diploid genomes come closer to being reality (81).

The introduction of long read sequencing platforms has

required entirely new assembly methods (82, 83) to deal with the data. The same goes for barcode linked reads, which have only recently begun to gain traction in the field. Some computational solutions exist for dealing with linked reads but few have been widely adopted as of yet. 10x Genomics have developed software to accompany their platform; which are free to use but they are black boxes with little or no use for developers interested in non-standard applications. Fortunately, assembly pipelines which utilize linked reads for de novo assembly of complex human (84) and metagenomic (85) samples do exist and these have both been used in Paper IV.

Page 34: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

27

CHAPTER SIX | Present Investigations

I'm not saying there is no method to our madness, but there is a definite sense of madness to our method. - Self

Paper I | Phasing single DNA molecules by massively parallel barcoding

The year was 2013, in a time just before the sequencing market were to be monopolized by the short-read platform offered by Illumina. The National Genomics Infrastructure (NGI) in Stockholm was still utilizing the 454 GS FLX system, performing library preparations based on emulsion PCR with DNA-coated bead particles. Availability of this infrastructure facilitated the birth of a new idea that would not only come to shape my thesis as a whole, but one that would be frequently featured in the field over the years to come - droplet barcoding. The premise of the article was to use barcoded beads to attach unique barcodes to single template molecules, so that sequencing reads with the same barcode could be linked to a shared molecular origin.

In this study, we showcase an assay for phasing two variable

regions of the 16S ribosomal RNA (rRNA) gene in bacteria. The 16S rRNA gene is frequently targeted for phylogenetic studies as it consists of interspersed regions of highly conserved and highly variable regions that are unique to each sub-species of prokaryotes. The assertion being that with more comprehensive sequencing coverage of each DNA fragment, the characterization of species in metagenomic samples would be improved. Also, assigning the reads from each barcode (i.e. droplet) to single molecules, rather than groups of molecules, meant that we could avoid the risk of false positive phasing calls from multiple molecules corresponding to the same bacterial species.

Page 35: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

28

As illustrated in Figure I, this is achieved in two steps, where in the first step 454 beads undergo emulsion PCR with a custom protocol to produce a library of ~1 million barcoded beads; each one coated with ~10 million copies of a mutually exclusive randomized barcode. The oligonucleotides on the beads also feature custom sequences at the 3’ end; used to target predefined regions of an organism's genome (step two). Once produced, complex libraries of uniquely barcoded beads can thus be used for a wide range of applications. In the second step, a second emulsion PCR reaction is performed with single barcoded beads and single template molecules. In each droplet, the barcode is linked to multiple parts of a template molecule through amplification.

Figure I. Methodological overview

In the paper, we first validated that the generation of barcoded beads was based on monoclonal amplification (for 92.2% of sequenced beads), meaning that single copies of the barcoding oligonucleotide gave rise to the population of barcoded primers on

Page 36: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

29

each bead. In using these beads to phase DNA molecules, we generated sequencing data from 66,000 single DNA molecules in parallel, and 93.6% of the barcode clusters which featured both target regions were shown to be phased. As hypothesized, the results yielded an improvement in our ability to distinguish different sub-classes of bacteria. At the time it was accepted for publication, it was the first study to feature beads for high-throughput barcoding in emulsion droplets.

Paper II | Droplet Barcode Sequencing for targeted linked-read haplotyping of single DNA molecules

The announcement that the 454 sequencing platform would be discontinued in 2016 meant that our strategy for generating barcoded beads, as well as the proprietary reagents used for emulsification, needed to be replaced. The idea to remove the need for barcoded beads and instead perform barcoding with clonally amplified oligonucleotides in droplets was appealing for a number of reasons. At this time, several other groups had developed their own barcoded beads to analyze the transcriptome of single cells. However, the concept of producing barcoded beads ultimately seemed to be nothing but a laborious prerequisite for the intended assay reaction (in our case haplotyping). With a plan in mind, the quest to find an adequate solution for the emulsion reaction began, i.e. one where the droplets would be stable throughout a PCR protocol. Supply of well-functioning surfactants proved hard to come by at first, but once a promising formula had been devised we could begin optimizing the reaction for haplotyping. As a proof of concept, we chose to showcase the phasing of all exons in the HLA-A gene. The HLA gene family plays an integral part in our immune system, as it encodes for the major histocompatibility complexes (MHC) which regulate cellular responses to antigens in our bodies. These genes feature the highest degree of variance in the human genome, and are present over long stretches which

Page 37: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

30

makes them difficult to resolve with short-read sequencing data. HLA alleles are frequently characterized in a measure of histocompatibility to minimize the risk of patient rejection of donor transplants.

The basis of the method is encapsulation of single barcoding oligonucleotides, together with single template molecules, in emulsion droplets formed by shaking. Barcoding of the template molecules is achieved through a two-part emulsion PCR protocol (Figure II). Reaction conditions are set so that barcodes and template molecules are first amplified asymmetrically, yielding single stranded amplicons. The template molecules are simultaneously amplified using multiple pairs of primers, designed so that the clonally amplified barcodes can be coupled to the template amplicons in a secondary step of the same PCR protocol. The coupling reaction is performed at a lower annealing temperature (after template amplification), to ensure an even distribution of different template amplicons. For the HLA application, seven pairs of primers were used, designed to target each of the 8 highly heterogeneous exons of the HLA-A locus.

Figure II. Methodological overview

Similar to the method detailed in Paper I, reads are grouped according to a barcode following standard short read sequencing.

Page 38: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

31

Reads with the same barcode are assumed to stem from the same molecule of origin, enabling phasing of variants across distances spanning far beyond the sequenced read length. By looking at an extended family of related individuals we could resolve the alleles present in each person, profile the single nucleotide variations (SNVs) which defined those alleles, and reconstruct a pedigree chart showing which alleles had been passed on by respective parents of each generation. The assay was used to resolve both alleles from each of 8 different reference individuals in a single reaction. The classification of alleles was validated against external references which despite extensive short-read sequencing were not able to determine the alleles to the same level of specificity as our method.

The concept of droplet barcoding sequencing enables a wide range of biological applications; such as the analysis of single cells, DNA-assisted proteomics, and characterization of exosomes - all of which are projects which have been part of my doctorate studies but have for various reasons not yet materialized into peer-reviewed publications. Paper III | Comprehensive haplotyping of the HLA gene family using nanopore sequencing

With the launch of the Oxford Nanopore sequencing platform, NGI participated in the early access program to evaluate and help develop the technology. This lead to new research opportunities and applications of this technology. Following the haployping of HLA-A in Paper II, it was intriguing to explore nanopore sequencing for this purpose. Obtaining full length sequences would enable phased sequencing and HLA genotyping of the highest resolution, including details of not only exonic variations but those within intronic regions as well. Due to limited sequencing throughput, target genes would however need to be enriched beforehand.

Page 39: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

32

For clinical applications, the accuracy of HLA typing can be of utmost importance for the histocompatibility between transplantation recipients and their donors. A study was designed to expand the haplotyping to six of the most clinically relevant genes, which together make up over 96% of the 18,700 variant alleles currently listed in the IMGT/HLA database (86). These included Class I (HLA-A, HLA-B and HLA-C) and Class II (HLA-DRB1, HLA-DQB1 and HLA-DPB1) genes.

Figure III. Experimental and computational workflow

Samples from 8 reference individuals (from two well-characterized families) underwent long range PCR to amplify the 6 genes from each individual, and amplicons were assigned a sample-specific barcode (Figure III). Following sequencing of all samples with a single MinION flow cell, the data was processed to determine the alleles present in each individual. The results show that DNA fragments below 10 kb are frequently translocated through the pore, enabling the coverage of sequenced bases to be high enough to compensate for the relatively high error profile of the Oxford Nanofore technology. For genes which meet these criteria (Class I genes and HLA-DRB1), the genotyping results were excellent, with 97% of the identified alleles matching that of external references. In addition, the majority of haplotypes were characterized with the highest degree of accuracy (four-field resolution of HLA-typing).

Page 40: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

33

For genes longer than 10 kb (HLA-HQB1 and HLA-DBP1) we observed a significant reduction in sequencing reads spanning the full length of the amplicons. With limited coverage, it became more difficult to generate error-free consensus sequences to be matched against alleles available in the IMGT/HLA database. Still, most of the alleles could be identified, but often with three-field rather than four-field resolution. Overall however, these results support the idea that nanopore sequencing can be a viable clinical solution for fast and accurate HLA-typing. Molecular typing techniques have over the past few years become increasingly prevalent in clinical settings where serological tests have previously been the standard.

Paper IV | Efficient whole genome haplotyping and single molecule phasing with barcode-linked reads.

It is hard to describe the culmination of 5 years of developments and dreams in one paper. From the very beginning of my pursuit towards a doctorate degree, it has been my intention to devise a method capable of phasing reads in a non-targeted fashion on a genome-wide scale. Multiple systems of genome amplification strategies were proposed, designed and tested before the project finally started to take its final shape. Eventually, we would settle on the use of transposases, which offer the unique capability of fragmenting DNA and simultaneously inserting predetermined sequences of DNA at random positions throughout the genome – a process referred to as tagmentation (87).

Long-range phasing information is required to call structural variants in the human genome, which have been shown to have a greater effect on gene expression than SNVs. However, short-read sequencing technologies are not able to resolve such variants, and users of such technologies are thereby missing out on being able to investigate forms of genetic variation with proven clinical relevance. The technique presented in this paper builds upon the

Page 41: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

34

strategy for simultaneous amplification and coupling of barcodes to template molecules devised in Paper II. By using commercially available bead-linked transposases, long genomic fragments are tagmented at multiple positions throughout the length of the fragments. Covalent bonds are formed between the DNA and the beads; preserving the proximity of DNA fragments from the shared genomic origin (Figure IV). By introducing these DNA-coated beads in emulsion droplets for barcoding, the same barcode can thus be attached to each molecule. After emulsion breakage and sequencing by Illumina, the reads are grouped according to a shared genomic loci and long-range phasing information of the original fragments can be reconstructed. In effect, this enables the benefit of adding long-range phasing information to readily available short read sequencing data.

Figure IV. Methodological overview

Performing the assay without microfluidic devices or barcoded beads means that it can easily be reproduced in laboratories all over the world, for a cost of just $19 USD per library. Furthermore, the throughput and resolution of the method can be tuned by adjusting the amount of input material. Unlike other technologies, our assay is capable of generating libraries from

Page 42: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

35

picograms to nanograms of input material. As shown in the paper, this enables haplotyping of whole human genomes and de novo sequencing of human and bacterial genomes. From a library produced with 25 pg of Escherichia coli DNA we see that the reference-free assembly is improved from 130 scaffolds with standard short read sequences, to only 8 scaffolds when taking the barcode-links into account. Furthermore, the N50 length of reconstructed molecules were over 42 kb, and 97.8% of the 4.6 megabase (Mb) genome could be reconstructed with just two scaffolds.

For the human genome, we showcase phasing of up to 99% of the called SNVs, with an estimated accuracy of 95% as determined by comparison to external datasets. The N50 of phase block lengths was 2.8 Mb, and the longest continuous phase block was 12 Mb. Using external software, we could identify 29 large structural variants (>30 kb) of which 28 (96.6%) could be verified by comparing the results to 10x Genomics data based on the same individual. Finally, we performed a reference-free assembly of the human genome, which despite a modest sequencing coverage of 35X yielded an N50 length of scaffolds of 5.6 Mb, with the longest contig spanning around 48 Mb. Compared to the GRCh38 reference, this assembly covered 85.4% of the human genome with heterochromatic regions (estimated to constitute 5-10% of the human genome) not excluded.

In conclusion, this technology for the first time empowers researchers to explore the advantages of genome-wide haplotyping and de novo sequencing without a significant increase to costs associated with library preparation or sequencing.

Page 43: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

37

CHAPTER SEVEN | Looking to the future

In the year 2020 you will be able to go into the drug store, have your DNA sequence read in an hour or so, and given back to you on a compact disc so you can analyse it. - Walter Gilbert

This quote from 1980 by Gilbert, who shared the Nobel Prize in Chemistry with Sanger in the same year, has turned out to be remarkably accurate. Current technologies are not far from realizing this prediction. Although the ‘compact disc’ would more likely be in the form of a USB memory stick, it is not inconceivable that with improvements in the scale of technologies like nanopore sequencing a market for over-the-counter sequencing could emerge over the coming years. In fact, multiple companies are already offering to sequence your genome for ancestral tracking and evaluating rudimentary predispositions to common diseases. As previously discussed, the challenge is no longer a question of being able to sequence individual genomes, but rather being able to comprehend the data one gets out. It is therefore prudent to consider the impact of widely available sequencing data, given that knowledge of medical predispositions could if not regulated properly, could come to influence our society in negative manner.

Artificial intelligence is likely to be an integral tool for

bioinformatics in the foreseeable future. With massive datasets and wide variability between samples, methods for pattern recognition and machine learning will be key to streamline complex analytical workflows. By enabling new forms of analyses, artificial intelligence could help optimize the time and effort required for large-scale datasets. Such approaches are expected to accelerate biological discoveries of clinical relevance, in particular with regards to pharmacology and personalized medicine.

Page 44: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

38

The technology presented in Paper IV paves the way for applications of sequencing based on miniscule amounts of input material. With assay reactions based on a few picograms, it is not a far reach to envision the possibility of using this strategy for single cell analysis. Having haplotype-resolved genomes from single cells would enable cellular heterogeneity to be studied with previously unattained levels of detail, with the most obvious application being characterization of heterogeneous cancer cells from tumors. Current obstacles of such an assay are likely to be non-destructive and efficient DNA extraction, to avoid allelic dropout and preferably being able to open up heterochromatic regions of the genome. These are however aspects which have been showcased before, and are thus likely to be possible with optimizations of existing protocols. In regards to the emulsion reaction, it is possible that the procedure of cell lysis - to make the genome available for tagmentation - could influence the proposed assay chemistry. Since reagents cannot easily be added or removed from emulsion droplets, a feasible solution to potential issues would be to perform reactions in micro-fabricated wells instead. Fortunately, the chemistry described in Paper IV for amplification and coupling of barcodes to tagmented template molecules should be equally well suited for well plates.

Page 45: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

39

ACKNOWLEDGEMENTS

The wealth of knowledge I have gathered throughput my journey towards a doctorate would not have been reality if it was not for the support of my mentor and friend. Afshin, perhaps the biggest compliment I can give, is that you have successfully transformed me from being pessimistic to cautiously optimistic when it comes to testing new ideas in the lab. Thank you for always having an open door, and for showing such great interest in my results, no matter how miniscule or quick I may have been to disregard them. It has been a pleasure sharing your passion for technology development, and it brings me great satisfaction that we could end this journey with the completion of a project that was little more than an ambitious dream at the beginning.

I am also thankful to you Joakim, for giving me the opportunity and financial support to pursue this doctorate. Your enthusiasm and curiosity for new developments is truly inspirational.

The publications included in this thesis would not have been a reality without the efforts of my co-authors. The time spent discussing things, in and out of the lab, with or without a beer in hand; are memories that will be cherished. For this is I thank you Erik and Tobias. A special note of gratitude also goes out to my partners in the lab with whom I have not had the pleasure of authoring a peer-reviewed paper with. Given the ambitious nature of our projects, I do not think it has been due to a lack of ingenuity or effort. It has truly been a pleasure teaming up with my SciLife sisters in the lab, Mickan and Kim; your personalities are nothing short of remarkable, and whatever the future holds for you both I am confident that any business or organization will benefit greatly from having you in it.

Page 46: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

40

If one were to ask me at the start of my PhD, who of all my colleagues I would come to value the most, it would not have been you Sanja. It took me the better part of two years to get through that tough skin of yours, but to my surprise you have transformed into someone I’m proud to call a good friend. True integrity is hard to come by, and I appreciate you always saying what’s on your mind, despite of the consequences it may bring. I would offer you some words of encouragement for your continued career in science, but I think it’s clear to everyone you don’t need it. The fuse has already been lit and one can only hope you join you among the stars in the future.

The extension of my gratitude would not be complete without the mention of friends and colleagues who made SciLife such a great place to work at. For this I thank you Emelie, Maja, José, Fredrik, Rapolas, Stefania, Nemanja, Zaneta, Konstantin, Joseph, Ludvig B, Ludvig L, Anders, Linnea, Patrik, Jonas, Britta, Linda, Eva, Sami, Mahya, Hooman, Pontus, Annelie, Yue, Johannes, Yilin, Simon, Emanuela and Mengxiao. Your contribution to the social aspects of work cannot be overstated, thank you all for the poker evenings, team-building events, conference trips, beer club and fika breaks. As acknowledged in the papers, my work would not have been possible without the support from NGI Stockholm. I am grateful for you always taking time to help, in particular for bringing my samples to the front of the queue for sequencing. To Max, Phil, Remi-Andre, Christian, Bahram, Pär, Lina, Sebastian, Joel, Szilveszter, Mattias and Chuan; It has been pleasure working with you all.

Last but not least, this acknowledgement would not be complete without a mention of my family. To my parents, my loving wife Olga and daughter Patricia; thank you for your support. Patricia, at the age of 4 you were likely the youngest person to ever start a DNA sequencing instrument. This thesis will already be published by the time you are born Estelle, but I am sure we will find a record for you to break as well.

Page 47: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

41

REFERENCES

1. Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS discovery. American Journal of Human Genetics 90, 7–24 (2012).

2. Mendel, G., Corcos, A. F. & Monaghan F. V. Gregor Mendel's Experiments on Plant Hybrids: A Guided Study. Rutgers University Press (1993).

3. Edelson, E. Gregor Mendel, and the roots of genetics. Oxford portraits in science (1999).

4. Mendel, G. Versuche uber Plflanzenhybriden. Verh. natf. ver. Brunn. abh. 3, S3 (1866).

5. Dahm, R. Friedrich Miescher and the discovery of DNA. Developmental Biology 278, 274–288 (2005).

6. Pray, L. A. Discovery of DNA Structure and Function: Watson and Crick. Nat. Educ. 1, 6 (2008).

7. Elson, D. & Chargaff, E. On the desoxyribonucleic acid content of sea urchin gametes. Experientia 8, 143–145 (1952).

8. Chargaff, E., Lipshitz, R. & Green, C. Composition of the desoxypentose nucleic acids of four genera of sea-urchin. J. Biol. Chem. 195, 155–160 (1952).

9. Watson, J. D. & Crick, F. H. C. Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature 171, 737–738 (1953).

10. Crick, F. On protein synthesis. Symp. Soc. Exp. Biol. 12, 138–163 (1958).

11. Crick, F. Central dogma of molecular biology. Nature 227, 561–563 (1970).

12. Crick, F. H. C., Barnett, L., Brenner, S. & Watts-Tobin, R. J. General nature of the genetic code for proteins. Nature 192, 1227–1232 (1961).

13. Yanofsky, C. Establishing the triplet nature of the genetic code. Cell 128, 815-818 (2007).

14. Rosenow, C., Saxena, R. M., Durst, M. & Gingeras, T. R. Prokaryotic RNA preparation methods useful for high density array analysis: comparison of two approaches. Nucleic Acids Res. 29 (2001).

15. Le Page, M. At least 75 per cent of our DNA really is useless junk after all. New Scientist Ltd. (2017). Available at <https://www.newscientist.com/article/ 2140926-at-least-75-per- cent-of-our-dna-really-is-useless-junk-after-all/>

16. Zhang, K. et al. Oxygenation of the Mesoproterozoic ocean and the evolution of complex eukaryotes. Nat. Geosci. 11, 345-250 (2018).

17. Graur, D. An upper limit on the functional fraction of the human genome. Genome Biol. Evol. 9, 1880-1885 (2017).

Page 48: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

42

18. Zimmer, C. Is Most of Our DNA Garbage? The New York Times Company (2015). Available at <https://www.nytimes.com/2015/03/08/magazine/is-most-of-our-dna-garbage.html>

19. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

20. T. Hubbard et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).

21. Benson, D. A. et al. GenBank. Nucleic Acids Res. 30, 17–20 (2002).

22. The International HapMap Consortium, The International HapMap Project. Nature 426, 789–796 (2003).

23. Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872-876 (2008).

24. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

25. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5 (2007).

26. Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677-685 (2017).

27. Begley, S. The Human Genome Was Never Completely Sequenced. Scientific American (2017). Available at <https://www.scientificamerican.com/article/the-human-genome-was-never- completely-sequenced/>

28. Gilham, P. T. & Khorana, H. G. Studies on Polynucleotides. I. A New and General Method for the Chemical Synthesis of the C5′-C3′ Internucleotidic Linkage. Syntheses of Deoxyribo-dinucleotides. J. Am. Chem. Soc. 80, 6212-6222 (1958).

29. Letsinger, R. L. & Mahadevan, V. Oligonucleotide Synthesis on a Polymer Support. J. Am. Chem. Soc. 87, 3526-3527 (1965).

30. Letsinger, R. L. & Ogilvie, K. K. Synthesis of Oligothymidylates via Phosphotriester Intermediates. J. Am. Chem. Soc. 91 (1969).

31. McBride, L. J. & Caruthers, M. H. An investigation of several deoxynucleoside phosphoramidites useful for synthesizing deoxyoligonucleotides. Tetrahedron Lett. 24 (1983).

32. Saiki, R. K. et al. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science 239, 487-491 (1988).

33. Chien, A., Edgar, D. B. & Trela, J. M. Deoxyribonucleic acid polymerase from the extreme thermophile Thermus aquaticus. J. Bacteriol. 127, 1550-1557 (1976).

34. Sanger, F. & Tuppy, H. The amino-acid sequence in the phenylalanyl chain of insulin. I. The identification of lower peptides from partial hydrolysates. Biochem. J. 49, 463-481 (1951).

Page 49: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

43

35. Sanger, F. & Tuppy, H. The Amino-acid Sequence in the Phenylalanyl Chain of Insulin 2. The investigation of peptides from enzymic hydrolysates. Biochem. J. 49, 481-490 (1951).

36. Barrell, B. G. & Sanger, F. The sequence of phenylalanine tRNA from E. coli. FEBS Lett. 3, 275-278 (1969).

37. Sanger, F. et al. Nucleotide sequence of bacteriophage φX174 DNA. Nature 265, 687-695 (1977).

38. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. 74, 5463-54-67 (1977).

39. Nyrén, P., Pettersson, B. & Uhlén, M. Solid phase DNA minisequencing by an enzymatic luminometric inorganic pyrophosphate detection assay. Anal. Biochem. 208, 171-175 (1993).

40. Ronaghi, M., Uhlén, M. & Nyrén, P. A sequencing method based on real-time pyrophosphate. Science 281 (1998).

41. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-380 (2005).

42. Adessi, C. et al. Solid phase DNA amplification: characterisation of primer attachment and amplification mechanisms. Nucleic Acids Res. 28 (2000).

43. Fedurco, M., Romieu, A., Williams, S., Lawrence, I. & Turcatti, G. BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. Nucleic Acids Res. 34 (2006).

44. Wadman, M. James Watson’s genome sequenced at high speed. Nature (2008). Available at <https://www.nature.com/news/2008/080416/full/452788b.html>

45. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

46. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17, 333-351 (2016).

47. PacBio Sequel System. Pacific Biosciences 2018. Available at <https://www.pacb.com/products-and-services/sequel-system/>

48. Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nat. Nanotechnol. 4, 265–270 (2009).

49. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

50. Laver, T. et al. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol. Detect. Quantif. 3, 1–8 (2015).

51. Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15 (2018).

52. Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407-410 (2017).

Page 50: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

44

53. Peters, B.A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190–195 (2012).

54. Amini, S. et al. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat. Genet. 46, 1343–1349 (2014).

55. Zheng, G.X.Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).

56. Lan, F., Haliburton, J. R., Yuan, A. & Abate, A. R. Droplet barcoding for massively parallel single-molecule deep sequencing. Nat. Commun. 7 (2016).

57. Zhang, F. et al. Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube. Nat. Biotechnol. 35, 852–857 (2017).

58. Hadfield, J. Choosing an NGS library prep kit provider. CoreGenomics (2014). Available at <http://core-genomics.blogspot.com/2014/04/choosing-ngs-library -prep-kit-provider.html>

59. Herper M. Illumina Promises To Sequence Human Genome For $100 -- But Not Quite Yet. Forbes (2017). Available at <https://www.forbes.com/sites/ matthewherper/2017/01/09/illumina-promises-to-sequence-human-genome-for-100-but-not-quite-yet/>

60. Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339 (2013).

61. Dean F.B. et al. Comprehensive human genome amplification using multiple displacement amplification. Proc. Natl. Acad. Sci. 99, 5261–5266 (2002).

62. Motley, S. T. et al. Improved Multiple Displacement Amplification (iMDA) and Ultraclean Reagents. BMC Genomics 15, 443 (2014).

63. Zong, C., Lu, S., Chapman, A. R. & Xie, X. S. Genome-Wide Detection of Single-Nucleotide and Copy-Number Variations of a Single Human Cell. Science 338, 1622-1626 (2012).

64. Fu, Y. et al. Uniform and accurate single-cell sequencing based on emulsion whole-genome amplification. Proc. Natl. Acad. Sci. 112, 11923-11928 (2015).

65. De Bourcy, C. F. A. et al. A quantitative comparison of single-cell whole genome amplification methods. PLoS One 9 (2014).

66. Borgström, E. et al. Phasing of single DNA molecules by massively parallel barcoding. Nat. Commun. 6 (2015).

67. Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

68. Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).

69. Karow, J. BGI Halts Revolocity Launch, Cuts Complete Genomics Staff as Part of Strategic Shift. GenomeWeb (2015). Available at <https://www.genomeweb.com/ sequencing-technology/bgi-halts-revolocity-launch-cuts-complete-genomics-staff-part-strategic-shift>

Page 51: Phasing single DNA molecules with barcode linked ...kth.diva-portal.org/smash/get/diva2:1249410/FULLTEXT01.pdftechnology is utilized to generate a haplotype-resolved human genome,

45

70. Voskoboynik, A. et al. The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2 (2013).

71. Press Release. Illumina Announces Phasing Analysis Service for Human Whole-Genome Sequencing. Illumina (2013). Available at <https://on.mktw.net/ 2Pr3Pw8>

72. Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).

73. Weissensteiner, M. H. et al. Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications. Genome Res. 27, 697-708 (2017).

74. Phillippy, A. M. New advances in sequence assembly. Genome Res. 27 (2017).

75. Feuk, L., Carson, A. & Scherer, S. Structural variation in the human genome. Nat Rev Genet 7, 85–97 (2006).

76. Moncunill, V. et al. Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads. Nat. Biotechnol. 32, 1106–1112 (2014).

77. Schneider V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849-864 (2017).

78. Seo J. S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).

79. Edge P., Bafna V. & Bansal V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801-812 (2017).

80. Aleman, F. The necessity of diploid genome sequencing to unravel the genetic component of complex phenotypes. Front. Genet. 8 (2017).

81. Fiddles, I. T. et al. Comparative Annotation Toolkit (CAT)—simultaneous clade and personal genome annotation. Genome Res. 28, 1029-1038 (2017).

82. Koren S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722-736 (2017).

83. Vaser R., Sović I., Nagarajan N. & Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737-746 (2017).

84. Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M. & Jaffe, D. B. Direct determination of diploid genome sequences. Genome Res. 27, 757–767 (2017).

85. Bishara, A. et al. Strain-resolved microbiome sequencing reveals mobile elements that drive bacterial competition on a clinical timescale. bioRxiv 125211 (2018). doi: 10.1101/125211

86. Robinson, J. et al. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res 43, 423-431 (2015).

87. Adey, A. & Shendure, J. Ultra-low-input, tagmentation-based whole-genome bisulfite sequencing. Genome Res. 22, 1139-1143 (2012).