Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Inferring Functional Groups from Microbial GeneInferring Functional Groups from Microbial Gene Catalogue with Probabilistic Topic Models
Xin Chen1 TingTing He2 Xiaohua Hu1 Yuan An1 Xindong Wu3Xin Chen1, TingTing He2, Xiaohua Hu1, Yuan An1, Xindong Wu3
1College of Information Science and Technology, Drexel University, Philadelphia, PA 19104, USA2Dept. of Computer Science at Central China Normal University, Wuhan, China3Department of Computer Science, University of Vermont, Burlington, VT, USA
1
Backgrounds: GenomicsBackgrounds: GenomicsBackgrounds: Genomics Backgrounds: Genomics • Genomics refers to the analysis of genomes A genome can beGenomics refers to the analysis of genomes. A genome can be
thought of as the complete set of DNA sequences that codes for the hereditary material that is passed on from generation to generation.
• These DNA sequences include all of the genes (the functional and physical unit of heredity passed from parent to offspring) and transcripts (the RNA copies that are the initial step in decoding thetranscripts (the RNA copies that are the initial step in decoding the genetic information) included within the genome.
Th i f t th i d l i f ll f th• Thus, genomics refers to the sequencing and analysis of all of these genomic entities, including genes and transcripts, in an organism.
2
Backgrounds:Backgrounds: GenBankGenBank and NCBIand NCBIBackgrounds: Backgrounds: GenBankGenBank and NCBIand NCBI• In recent years we see growth of GenBank and NCBI with the
advancement of gene sequencing technology.advancement of gene sequencing technology.
3
Backgrounds: annotating algorithmsBackgrounds: annotating algorithmsBackgrounds: annotating algorithms Backgrounds: annotating algorithms • As the growth of GenBank and NCBI, a lot of annotating algorithms
are developed to match genomic sequences to GenBank /NCBIare developed to match genomic sequences to GenBank /NCBI standard reference and attach meta-information to the sequences.
4
Backgrounds: metaBackgrounds: meta--informationinformationgg• The annotated meta-information involves hierarchical data such as
NCBI Taxonomy and Gene Ontology.
5
Challenges: Challenges: MetagenomicsMetagenomics• With the fast advancing sequencing techniques, large amounts of
sequenced genomes and meta-genomes from uncultured microbial samples (microbe) have become availablesamples (microbe) have become available.
• The goal of metagenomics is to study the genome-wide gene-expression data from uncultured environment samples (like the ocean soil anddata from uncultured environment samples (like the ocean, soil and human body) and understand the underlying biological processes. 6
Research QuestionsResearch QuestionsResearch QuestionsResearch QuestionsWhat’s the major research questions of our study? • We use our data mining framework to investigate
following questions:1) Given a large number of genome fragments from an microbial1) Given a large number of genome fragments from an microbial
samples, what genomes are there?• Answering this question requires mapping the meta-genomic reads to
taxonomic units (usually a homology-based sequence alignment, and this ( y gy q g ,task is also known as taxonomic classification or taxonomic analysis).
2) What are the major functions of these genomes?• The answers to this question involve annotating the major functional units q g j
(such as signal transduction, metabolic capacity and gene regulatory) on the genome-level (a.k.a. functional analysis).
Our research objective:• We aim to develop a new method that is able to analyze the
genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by thecharacterize a set of common genomic features shared by the same species, tell their functional roles. 7
Related topics in this presentation:Related topics in this presentation:
• Structural annotation and protein encoding regions• Homology-based functional analysis
T i M d l• Topic Models…
8
Structural annotation and protein encoding regionsp g g
• Structural annotation• Structural annotation– Annotating the regions of known open reading frames (ORF’s),
non-coding genes (rRNA, tRNA, miRNA), Promoters and UTR’s in the DNA sequences
9
Structure annotation and protein encoding regions
NCBI d d f h d il d l
(continue)
• NCBI standard reference sequences have detailed structural annotations of both non-protein encoding regions (such as tRNA) and protein encoding regions (CDS) as well as the corresponding gene names (if applicable). The GenBank accession number of each reference sequence is available on each NCBI online query.
10
Related topics in this presentation:Related topics in this presentation:
• Structural annotation and protein encoding regions• Homology-based functional analysis
T i M d l• Topic Models…
11
Functional analysis - overviewy
• Functional analysis
– Uncover the major gene functions related to the genomic sequencessequences
– Requires explaining the biochemical activity (a.k.a. molecular q p g y (function) of gene product, identifying the biology process to which the gene or gene product contribute (including information about enzyme, pathway and metabolic capabilities related to theabout enzyme, pathway and metabolic capabilities related to the gene).
12
Homology-based functional analysis(Richter and
H l b d h h b l i d d hi
Huson, 2009)
• Homology-based approach has been recently introduced to achieve functional annotation for metagenomic reads (Richter and Huson, 2009).
• The framework begins with a homology based BLASTX algorithm to match the metagenomic fragments against the reference sequencesmatch the metagenomic fragments against the reference sequences in NCBI database.
• The BLASTX hits will associate fragments with related protein ID and gene names. After that, with the help of the Gene Ontology (GO) database to refer associated gene names to corresponding(GO) database to refer associated gene names to corresponding GO terms, thus provides an overview of gene function and products for metagenomic fragments.
13
Homology-based functional analysis(Richter and Huson, 2009)
GO terms obtained from database identifier mapping (Richter and Huson, 2009)GO te s obta ed o database de t e app g ( c te a d uso , 009)
14
Limitations with Homology-based Functional Analysis Methods
1 H l b d h h l h l f l l1. Homology-based approaches very much reply on the result of local sequence alignment (such as BLAST and BLASTX) to the known open reading frames (ORF). – The BLAST-like local alignment may either return hundreds of hits, or return no
hits, depending on the threshold of E-value used. In the latter case, the current methods are unable to provide any functional annotation. In the former case, it
ll l k f ti b k t f th d th hit hi h k thusually lacks of a proper tie-breaker to further reduce the hits, which makes the functional annotation some how ambiguous (with hundreds of probable explanation)
2 The homology based functional annotation methods did not provide2. The homology-based functional annotation methods did not provide any insight about the “major” functional capabilities of genomes (like which gene functions are more commonly shared by strains f th i ) th i i it f th t t d GOfrom the same species), as there is no priority for the annotated GO terms.
15
Related topics in this presentation:Related topics in this presentation:
• Structural annotation and protein encoding regions• Homology-based functional analysis
T i M d l• Topic Models…
16
Topic Modeling Topic Modeling -- IntuitiveIntuitivep gp gOf all the sensory impressions proceeding to the brain, the visual experiences are the• Intuitive the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal
• Intuitive– Assume the data we
see is generated by For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the
sensory, brain, visual, perception,
retinal, cerebral cortex,
see is generated by some parameterized random process. g y p j g
discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By
, ,eye, cell, optical
nerve, imageHubel, Wiesel
p– Learn the parameters
that best explain the p yfollowing the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the
Hubel, Wieselpdata.
– Use the model to gimage falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for
predict (infer) new data, based on data seen so far p p
a specific detail in the pattern of the retinal image.
seen so far.17
NotationsNotations
Word• Word– Basic unit.– Item from a vocabulary indexed by {1, . . . ,V}.
• Document– Sequence of N words denoted by w = (w1 w2 wN)Sequence of N words, denoted by w (w1,w2, . . . ,wN).
• CollectionA t t l f D d t d t d b C { 1 2 D}– A total of D documents, denoted by C = {w1,w2, . . . ,wD}.
• Topicp– Denoted by z, the total number is K.– Each topic has its unique word distribution p(w|z)
18
Background & Existing Techniques of Generative Latent Topic Models
• The Naïve Bayesian modelLikelihood of word w given• The Naïve Bayesian model
* arg max ( | ) ( ) ( | )z p z w p z p w z= ∝
word w given topic z
arg max ( | ) ( ) ( | )z p z w p z p w z= ∝
Word-Topic decision
Prior Probability of Topic z
• The probabilistic latent semantic indexing (PLSI) model
decision of Topic z
Assumption:
Each document has a mixture of k topics.
Fitting the model involves:
Estimating the topic specific word distributions p(wi|zk) and document specific
PLSI Model (Hoffman, 2001)
p( i| k) ptopic distributions p(zk|dj) from the corpse via maximum likelihood estimation (MLE). 19
Latent Dirichlet Allocation (LDA) Model (Blei, 2003)( ) ( )
• In PLSI model, the topic mixture probability p(z |d ) for
θd~Dir(α)probability p(zk|dj) for documents are fixed once the model is estimated. For new coming document, the model g ,needed to be re-estimated. Thus it is not scalable.
( | ) ~ ( )j dp z d Multi θ
• The LDA model treats the probability of latent topics for each document p(z|d) and the conditional probability of words
( | ) ~ ( )j jip w z Multi φ
conditional probability of words for each latent topic p(w|z) as latent random variables which are subject to change when new
~ ( )j Dirφ βj g
document comes., ,.
, ,.
( | , , )wi di j i j
wi i di j i
n np z j w
W n T nβ αβ α
− −
− −
+ += ∝ ⋅
+ +-i -wiw z
20
LDA Model Estimation - Gibbs Sampling Monte Carlo process (Griffiths, 2004)Probability of a topic being assigned to a word given other observations:
( | , , ) ( | , , ) ( | , )wi i i wip z j w p w z j p z j= ∝ = ⋅ =-i -wi -i -wi -i -wiw z w z w zProbability of a topic being assigned to a word given other observations:
,.( | , , ) ( | , , , ) ( | , )
wii jj j j
i wi i
np w z j p w z j p d
W nβ
ϕ ϕ ϕβ
−+= = = =
+∫-i -wi -i -wi -i -wiw z w z w z
,
,.
( | , ) ( | ) ( | , )d
i jd d dd
i
np z j p z j p d
T nα
θ θ θα
−
−
+= = = ⋅ =
+∫-i -wi -i -wiw z w z,i jW nβ −+
( | , , , )j jip w z j ϕ ϕ= =-i -wiw z
( | ) ( | ) ( )j j jp p pϕ ϕ ϕ∝ ⋅w z w z
( | , ) ( , | ) ( )d d dp p pθ θ θ∝ ⋅-i -wi -i -wiw z w z
d d
in which
Since
and
( | , ) ( , | ) ( )p p pϕ ϕ ϕ∝ ⋅-i -wi -i -wiw z w z
( , | ) ~ ( )j jp Multiϕ ϕ-i -wiw z
( , | ) ~ ( )d dp Multiθ θ-i -wiw z
( ) ~ ( )dp Dirθ α
and . It follows that We have ( ) ~ ( )jp Dirϕ β
,( | , ) ~ ( )j wii jp Dir nϕ β −+-i -wiw z
,( | , ) ~ ( )d di jp Dir nθ α −+-i -wiw z
21
Mote Carlo processMote-Carlo process
• Given the word-topic posterior probability, the Monte Carlo process becomes really straightforward, which is similar to throwing dice (given the probability ofis similar to throwing dice (given the probability of each facet to appear) to determine the assignment of topics to each words for the next round. p
( | ) 1p z j w j K= =w z
Given probability for each word:
( | , , ), 1...wi ip z j w j K= =-i -wiw z
New topic assignment for each word.
22
Statistical relationships of words and topicsStatistical relationships of words and topics
23
An example of topic assignment to wordsAn example of topic assignment to words
24
Experiments
25
Experiment: Inferring Functional Groups from Microbial Gene Catalogue with Topic Models• In our experiment based on the functional elements derived from• In our experiment, based on the functional elements derived from
non-redundant CDs catalogue, we show that the configuration of functional groups in meta-genome samples can be inferred by probabilistic topic modelingprobabilistic topic modeling.
• The probabilistic topic modeling is a Bayesian method that is able to p p g yextract useful topical information from unlabeled data. When used to study microbial samples the functional elements (including taxonomic levels and indicators of gene orthologous groups andtaxonomic levels, and indicators of gene orthologous groups and KEGG pathway mappings) bear an analogy with ‘words’.
• Estimating the probabilistic topic model can uncover the configuration of functional groups (the latent topic) in each sample. Which may be further used to study the genotype-phenotype y y g yp p ypconnection of human disease.
26
Experimental Data Collection• In our experiment, we conduct a probabilistic topic modeling
experiment to identify functional groups from human gut microbial
p
experiment to identify functional groups from human gut microbial community data is generated by [Qin, et al. 2010], which is openly accessible via http://gutmeta.genomics.org.cn/
The human gut microbial samples from [Qin, et al. 2010] belong to both healthy subjects (HS) and patients with y j ( ) pinflammatory bowel disease (IBD). Specifically, the IBD patients are from two different groups, one group withtwo different groups, one group with Crohn’s disease (CD), and the other group with ulcerative colitis (UC).
In total, there are 85 healthy samples, 15 UC samples and 12 CD samples.
27
Experimental Data Collection (continue)
• According to [Qin, et al. 2010], the Illumina GA reads from human g [Q , ],gut microbial samples are firstly assembled into longer contigs. After that, the Glimmer program was used to predict protein-encoding sequences (CDs) from assembled contigssequences (CDs) from assembled contigs.
• The predicted CDs sequences were then aligned to each other and form a non-redundant CDs catalog (a.k.a. minimal gut genome). The non-redundant CDs catalog consists of 3,299,822 non-redundant CDs sequences with an average length of 704 bp. q g g p
CDs_id: MH0001Name: GL0006996 MH0001 [Lack 3'-end] [mRNA] locus=scaffold96 9:1:1206:-_ _[ _ ]_[ ]_ _Length: 1206COG/KO: COG4799 K01966Pathway maping: map00280,map00640Taxonomic level: species Eubacterium eligens
28
Taxonomic level: species - Eubacterium eligens
Experimental Data Collection (continue)• In our experiment, three types of functional elements are derived
from the non-redundant CDs catalog, i.e. the NCBI taxonomic level indicators, indicator of gene orthologous groups and KEGG pathway indicators.
• Given a non-redundant CDs sequence its NCBI taxonomical level isGiven a non redundant CDs sequence, its NCBI taxonomical level is obtained by carrying out BLASTP alignment against the NCBI NR database. The taxonomical level of each non-redundant CDs sequence is determined by the lowest common ancestor (LCA)sequence is determined by the lowest common ancestor (LCA) –based algorithm. The taxonomic abundance data for each sample can be computed by counting the indicators of NCBI taxonomical l llevels.
• The assignments of gene orthologous indicator and KEGG pathway indicator are achieved by BLASTP alignment of the amino-acid y gsequence from predicted CDs to the eggNOG database and KEGG database.
29
Experimental Data Collection (continue)p ( )Genus ClostridiumGenus BacteroidesPhylum Firmicutes
NCBI Taxonomic Levels
COG0463 : Glycosyltransferases involved in cell wall biogenesis
Phylum FirmicutesClass ClostridiaGenus Bacillus
Orthologous Group COG0463 : Glycosyltransferases involved in cell wall biogenesisCOG0642 : Signal transduction histidine kinaseCOG1132 : "ABC-type multidrug transport system, ATPase and permease
components"
Orthologous Group Indicators
COG0438 : Glycosyltransferase
map00230 : Metabolism_Nucleotide Metabolism_Purine metabolismmap00240 : Metabolism Nucleotide Metabolism Pyrimidine metabolism
KEGG Pathway Indicators
• The union of unique functional elements jointly defines a fixed word b l I t t l th 647 136 NCBI t i l l
map00240 : Metabolism_Nucleotide Metabolism_Pyrimidine metabolismmap00350 : Metabolism_Amino Acid Metabolism_Tyrosine metabolism
vocabulary. In total, there are 647,136 NCBI taxonomic level indicators, with a vocabulary size of 748; there are a total of 1,293,764 gene orthologous group indicators, with a vocabulary size of 4667; and there are 953,493 KEGG pathway indicators, with a vocabulary size of 237. 30
Groups of functional elements in microbial itcommunity
Given non-redundant CDs catalog, and derived functional elements, we are interested in identifying the frequent co-occurrence patterns of
31
we are interested in identifying the frequent co occurrence patterns of functional elements (a.k.a. functional groups).
Generative process of proposed model • Commonly shared functional elements across samples may suggest
functional similarity and biological relevance among samples. To
p p p
functional similarity and biological relevance among samples. To cover such information, a genome-wide background distribution of functional elements need to be estimated, which leads to the introduction of the background topic z in topic modelingintroduction of the background topic z0 in topic modeling.
32
Illustration of the background topic of gene OGs indicatorsOGs indicators
Background Topic - Indicator of Gene OGs Gene OGs Indicator Descriptions Probability
COG0463 Glycosyltransferases involved in cell wall biogenesis 0.00813
COG0642 Signal transduction histidine kinase 0 00708COG0642 Signal transduction histidine kinase 0.00708COG0582 Integrase 0.00698
COG1132 ABC-type multidrug transport system, ATPase and permease components" 0.00689ATPase and permease components
COG0438 Glycosyltransferase 0.00664
COG0745 Response regulators consisting of a CheY-like receiver domain and a 0.00644winged-helix DNA-binding domain
COG1396 Predicted transcriptional regulators 0.00595
COG0577 ABC-type antimicrobial peptide transport system permease component 0.00594transport system, permease component
COG2207 AraC-type DNA-binding domain-containing proteins 0.00389
COG3250 Beta-galactosidase/beta-glucuronidase 0.00344COG3 50 e a ga ac os dase/be a g ucu o dase 0 003
33
Illustration of the background topic of KEGG Pathway IndicatorsPathway Indicators
Background Topic - KEGG Pathway Indicator Pathway Map ID Descriptions ProbabilityPathway Map ID Descriptions Probability
map00230 Metabolism_Nucleotide Metabolism_Purinemetabolism 0.0333
map00051 Metabolism_Carbohydrate Metabolism_Fructosed t b li 0.0264map00051 and mannose metabolism 0.0264
map00500 Metabolism_Carbohydrate Metabolism_Starch and sucrose metabolism 0.0260
00240 Metabolism Nucleotide Metabolism Pyrimidine 0 0222map00240 Metabolism_Nucleotide Metabolism_Pyrimidinemetabolism 0.0222
map00350 Metabolism_Amino Acid Metabolism_Tyrosinemetabolism 0.0221
M t b li A i A id M t b li "Gl imap00260 Metabolism_Amino Acid Metabolism_"Glycine, serine and threonine metabolism" 0.0220
map00010 Metabolism_Carbohydrate Metabolism_Glycolysis / Gluconeogenesis 0.0190g
map00620 Metabolism_Carbohydrate Metabolism_Pyruvatemetabolism 0.0176
map00251 Metabolism_Amino Acid Metabolism_Glutamatemetabolism 0.0169p metabolism
map00550 Metabolism_Glycan Biosynthesis and Metabolism_Peptidoglycan biosynthesis 0.0168 34
Uncovered latent topics with respect to NCBI taxonomic indicatorsNCBI taxonomic indicators
Illustration of the most relevant latent topics with prespect to different taxa
Topic ID MI Score Topic ID MI Score Topic ID MI Scoref il E tfamily_Enterobacteriaceae Topic 48 0.02476 Topic 121 0.00915 Topic 31 0.00279genus_Clostri
dium Topic 50 0.01628 Topic 153 0.01001 Topic 95 0.00765B tgenus_Bacter
oides Topic 156 0.03030 Topic 77 0.02018 Topic 52 0.01661phylum_Bact
eroidetes Topic 132 0.00476 Topic 165 0.00260 Topic 67 0.00257ph l m Firm
Discoveries: For each taxon latent topics are sorted with respect to the mutual
phylum_Firmicutes Topic 0 0.01256 Topic 99 0.00550 Topic 193 0.00212
Discoveries: For each taxon, latent topics are sorted with respect to the mutual information score (MI score). The MI severs as a relevance measurement between taxa and latent topics. It shows that phylum Firmicutes is most relevant to the background topic (Topic 0) Similarly genus Clostridium is most relevant toto the background topic (Topic 0). Similarly, genus Clostridium is most relevant to Topic 50, 153, 95 and genus Bacteroides is most relevant to Topic 156, 77, 52.
35
Uncovered latent topics with respect to NCBI taxonomic indicatorsNCBI taxonomic indicators
Illustration of top-ranked latent topics with respect t diff t i bi l l
MH0001 p(topic|sample) O2.UC-1 p(topic|sampl
e) V1.CD-1 p(topic|sample) …
to different microbial samples
Topic 0 0.475 Topic 0 0.363 Topic 0 0.286 …
Topic 124 0.116 Topic 95 0.101 Topic 61 0.124 …
Topic 181 0.103 Topic 143 0.062 Topic 12 0.116 …
Topic 159 0.040 Topic 83 0.059 Topic 115 0.050 …
Topic 86 0.027 Topic 65 0.056 Topic 52 0.048 …
Topic 72 0.018 Topic 139 0.034 Topic 32 0.037 …p p p
Topic 19 0.017 Topic 59 0.033 Topic 50 0.036 …
Discoveries : the probability of Topic 0 in Healthy and UC samples (0.475 in MH0001 and 0.363 in O2.UC-1) is much higher than that in CD samples (0.286 in V1.CD-1). This suggests that for CD samples, the proportion of bacteria belong to phylum Firmicutes is significantly reduced. The prevalence of Topic 95 and 52 in
l O2 UC 1 d l V1 CD 1 i di h i d iblsamples O2.UC-1 and sample V1.CD-1 may indicate the existence and possibly high abundance of genus Clostridium and genus Bacteroides, correspondingly.36
Uncovered latent topics with respect to NCBI taxonomic indicatorsNCBI taxonomic indicators
37
Summary of DiscoveriesSummary of Discoveries • Our discoveries from the results is evidenced by the recent
di i i f l i bi t t d f i fl t b l didiscoveries in fecal microbiota study of inflammatory bowel disease (IBD) patients [Gerber, 2007], [Harry S. et. al. 2006], [Manichanh C et al., 2006], [Walker A. et. al. 2011].
• It has been reported that there is a significant reduction in the proportion of bacteria belonging to phylum Firmicutes in CDproportion of bacteria belonging to phylum Firmicutes in CD samples, which is consistent with our results.
• This can be explained by the fact mucosal microbial diversity is reduced in IBDs, particular in CD, which is associated with bacterial invasion of the mucosa. In UC, the inflammation is typically moreinvasion of the mucosa. In UC, the inflammation is typically more superficial; therefore, the reduction of phylum Firmicutes in UC is not significant.
38
ConclusionsConclusions • Based on the functional elements derived from the non-
redundant CDs catalogue, we have shown that the configuration of functional groups encoded in the gene-expression data of meta-genome samples can be inferred byexpression data of meta genome samples can be inferred by applying probabilistic topic modeling to functional elements derived from the non-redundant CDs catalogue.
• The latent topics estimated from human gut microbial samples are evidenced by the recent discoveries in fecal microbiotaare evidenced by the recent discoveries in fecal microbiota study, which demonstrate the effectiveness of the proposed method.
39
Future workFuture work• In the proposed model, the number of functional group has to
be specified in advance, or iteratively tuned by criteria such as log-likelihood and perplexity.
• In future work we propose to use nonparametric hierarchical• In future work, we propose to use nonparametric hierarchical Bayesian models (such as HDP model) to handle the uncertainty in the number of functional groups, which provide the flexibility of modeling microbial sequences with unknown functional group numbers.
40
Q ti ?Questions?
41
Backup Slides
42
Mutual InformationAfter estimating the topic model and assigning a latent topic to each functional element the relevance between latent topics andfunctional element, the relevance between latent topics and functional element indicators (i.e. NCBI taxonomic level indicators, indicator of gene orthologous groups and KEGG pathway i di t ) b bt i d b l l ti th t l i f tiindicators) can be obtained by calculating the mutual information (MI) between functional element indicators and obtained latent topics based on the final latent topic assignments to functional elements.
( , )( , ) ( , )log
( ) ( )g t
g t g t
p R ZMI R Z p R Z
p R p Z=
in which Rg and Zt are binary indicator variables corresponding to
( ) ( )g tp R p Z
the functional element and the latent topic, respectively. The variable pair (Rg,Zt) indicates whether a latent topic has been assigned to a specific functional element.
43
Likelihood Comparison
1
( | ) ( | , ) ( | )t t t
zt
T
t z z t zt
p p z p z dϕ
ϕ ϕ ϕ=
⎡ ⎤= ⎢ ⎥⎣ ⎦∏ ∫w z w1
( ) ( )0
( ) ( )1 0
( ) ( )( ) ( ). .( ) ( ) ( ) ( )
t
i i
i i
tw wT T tw w
W Wt t
n nW Wn W n W
β ηβ ηβ β η η
=
⋅ ⋅=
⎣ ⎦
Γ + Γ +⎡ ⎤Γ Γ= ⋅⎢ ⎥Γ Γ + Γ Γ +⎣ ⎦
∏ ∏∏
0t⎣ ⎦
44
Likelihood Comparison (continue)
1
( | ) ( | , ) ( | )t t t
zt
T
t z z t zt
p p z p z dϕ
ϕ ϕ ϕ=
⎡ ⎤= ⎢ ⎥⎣ ⎦∏ ∫w z w
( )
1( ) ( )
0
( ) ( )1 0
( ) ( )( ) ( ). .( ) ( ) ( ) ( )
t
i i
i i
tw wT T tw w
W Wt t
n nW Wn W n W
β ηβ ηβ β η η
=
⋅ ⋅=
⎣ ⎦
Γ + Γ +⎡ ⎤Γ Γ= ⋅⎢ ⎥Γ Γ + Γ Γ +⎣ ⎦
∏ ∏∏
0t⎣ ⎦
45
Perplexity ComparisonyThe perplexity is calculated for held-out testing data. In our experiment, we use a 50% subset of the functional elements asexperiment, we use a 50% subset of the functional elements as training data and the other 50% as testing data.
On constructing the two subsets we ensure that functional elementsOn constructing the two subsets, we ensure that functional elements from the same sample are equally split to both subsets. In practice, it is the inverse predicted model likelihood of data in held-out testing data using parameters inferred from the trained topic model Thus the
⎡ ⎤
data, using parameters inferred from the trained topic model. Thus the smaller perplexity value indicates better model fitting.
1
1
log( ( ))( ) exp
test
test
D
jtest D t
jj
pperplexity D
N=
=
⎡ ⎤−⎢ ⎥=⎢ ⎥⎣ ⎦
∑∑
jw
46
Perplexity Comparison (continue)y ( )
47
Dirichlet Process (DP) as a Non-Parametric Mixture ModelsThe Dirichlet Process (DP) is defined as a distribution of random probability measureThe Dirichlet Process (DP) is defined as a distribution of random probability measure G0 ~ DP(γ, H), in which γ is a concentration parameter and H is a base measure defined on a sample space Θ. By its definition, for any finite measurable partition of Θ: {A1, …,Ar}, (G0(A1),…,G0(Ar)) ~ Dirichlet(γ H(A1),…, γ H(Ar)). , r}, ( 0( 1), , 0( r)) (γ ( 1), , γ ( r))
( )G β δ θ∞
∑1
(1 ) ~ (1 )k
Betaβ α α α γ−
= ∏Dirichlet Process can also be constructed by stick-breaking construction as follows:
01
( )k kk
G β δ θ=
= ∑1
(1 ), ~ (1, )k k i ki
Betaβ α α α γ=
= −∏Dirichlet process Dirichlet process constructed by stick-breaking by its definition:
p y gconstruction:
- Data sample xi drawn from a base distribution with associated parameters Θk
48,in which
The weights of mixture components β = {βk} (k=1,…,∞) are also refer to as β ~ GEM(γ).
Hierarchical Dirichlet Process (HDP)The Hierarchical Dirichlet Process (HDP) considers G ~ DP(γ H) as a global probabilityThe Hierarchical Dirichlet Process (HDP) considers G0 DP(γ, H) as a global probability measure across the corpora and defines a set of child random probability measures Gj ~ DP(α0, G0) for each document j, which leads to different document-level distribution over semantic mixture components: (Gj(A1),…,Gj(Ar)) ~ Dirichlet(α0 G0 (A1),…, α0 G0 (Ar))semantic mixture components: (Gj(A1),…,Gj(Ar)) Dirichlet(α0 G0 (A1),…, α0 G0 (Ar))
Each Gj can also be constructed by stick-breaking construction as: 1
( )j jk kk
G π δ θ∞
=
= ∑1k
in whch πj={πjk} (k=1,…,∞) specifies the weights of mixture component indicator k.
Substitute the stick-breaking construction of G0 and Gj,it follows that:
1 1
0 0,..., ~ ( ,..., )r r
jk jk k kk K k K k K k K
Dirichletπ π α β α β∈ ∈ ∈ ∈
⎛ ⎞⎜ ⎟⎝ ⎠∑ ∑ ∑ ∑
it follows that:
Based on the aggregation properties of Dirichlet distribution and its connection with Beta distribution, it shows that:
1k k− ⎛ ⎞⎛ ⎞∑∏ 0 011
' (1 ' ), ' ~ , 1jk jk jl jk k lll
Betaπ π π π α β α β==
⎛ ⎞⎛ ⎞= − −⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠
∑∏It then follows that πj ~ DP(α0, β) Stick-breaking construction of
49
hierarchical Dirichlet process