View
236
Download
3
Category
Preview:
Citation preview
Quantnet Basics:Visualization, Similarity, Text Mining
Lukas BorkeWolfgang Karl Härdle
Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de
Motivation 1-1
Transparency and Reproducibility
� Quantnet – open access code-sharing platformI Quantlets: program codes (R, MATLAB, SAS), various authorsI QuantNetXploRer
Quantnet Basics
Motivation 1-2
Popularity
Figure 1: Quantlet downloads by year and countryQuantnet Basics
Motivation 1-3
Visualization
Figure 2: Quantlets from Statistics of Financial Markets (SFE) and AppliedMultivariate Statistical Analysis (MVA)
Quantnet Basics
Motivation 1-4
Research Goals
� VisualizationI QuantletsI ClustersI Relationships
� Data MiningI SimilarityI Semantic structureI Text Mining – Keyword Extracting
Quantnet Basics
Outline
1. Motivation X
2. Interactive Structure3. Vector Space Model (VSM)4. Empirical results5. Keyword Extracting6. Conclusion
Quantnet Basics
Interactive Structure 2-1
� Searching parameters: Quantletname, Description, Datafile,Author
� Data types: R, Matlab, SAS
Quantnet Basics
Interactive Structure 2-2
Integrated exploring and navigating
Quantnet Basics
Interactive Structure 2-3
Figure 3: Quantlet MVAreturns containing the search term “time series“
Quantnet Basics
Interactive Structure 2-4
Figure 4: All Quantlets in QuantNetXploRer, search term “time series“
Quantnet Basics
Vector Space Model (VSM) 3-1
Vector Space Model (VSM)
� Model structureI Text to Vector: Weighting scheme, Similarity, DistanceI Basic VSMI Generalized VSMI LSI – Latent Semantic Indexing
Quantnet Basics
Vector Space Model (VSM) 3-2
Text to Vector
� D = {d1, . . . , dn} – set of documents.� T = {t1, . . . , tm} – dictionary, i.e., the set of all different
terms occurring in Quantnet.� tf (d , t) – absolute frequency of term t ∈ T in document
d ∈ D.� idf (t)
def= log(|D|/nt) – inverse document frequency, with
nt = |{d ∈ D|t ∈ d}|.� w(d) = {w(d , t1), . . . ,w(d , tm)}, d ∈ D – documents as
vectors in a m-dimensional space.� w(d , ti ) – calculated by a weighting scheme.
Quantnet Basics
Vector Space Model (VSM) 3-3
Weighting scheme, Similarity, Distance� Salton et al. (1994): the tf-idf – weighting scheme w(d , t) for
t ∈ T in d ∈ D :
w(d , t) =tf (d , t)idf (t)√∑m
j=1 tf (d , tj)2idf (tj)2,m = |T |
� (normalized tf-idf) Similarity S of two documents
S(d1, d2) =m∑
k=1
w(d1, tk) · w(d2, tk) = w(d1)>w(d2)
� A frequently used distance measure is the Euclidian distance:
distd(d1, d2)def=
√√√√ m∑k=1
{w(d1, tk)− w(d2, tk)}2
Quantnet Basics
Vector Space Model (VSM) 3-4
Example 1: German children’s rhymes
Let D = {d1, d2, d3} be the set of documents/rhymes:
Rhyme 1: Hänschen klein ging allein in die weite Welt hinein.d1 = {hanschen, klein, ging , allein, in, die,weite,welt, hinein}
Rhyme 2: Backe, backe Kuchen, der Bäcker hat gerufen.d2 = {backe, kuchen, der , backer , hat, gerufen}
Rhyme 3: Die Affen rasen durch den Wald. Der eine macht denandern kalt.d3 = {die, affen, rasen, durch, den,wald , der , eine,macht, andern, kalt}
Quantnet Basics
Vector Space Model (VSM) 3-5
Example 1: German children’s rhymes
This implies:
T = {hanschen, klein, ging , allein, in, die,weite,welt, hinein,
backe, kuchen, der , backer , hat, gerufen,
affen, rasen, durch, den,wald , eine,macht, andern, kalt}= {t1, . . . , t24}
Hence, |D| = 3, |T | = 24.
Quantnet Basics
Vector Space Model (VSM) 3-6
Figure 5: Weighting vectors of the 3 rhymes in a radar chartQuantnet Basics
Vector Space Model (VSM) 3-7
Example 1: German children’s rhymes
With the weighting vectors above we get the similarity matrix:
MS =
1 0 0.0140 1 0.014
0.014 0.014 1
And the distance matrix:
MD =
0√2 1.405√
2 0 1.4051.405 1.405 0
Quantnet Basics
Vector Space Model (VSM) 3-8
Basic VSM
� vertical vector d , indexed by terms – Document representation� matrix D = [d1, . . . , dn] – Document corpus representation,
also called “term by document“ matrix� considering linear transformations P we get a general similarity
S(d1, d2) = (Pd1)>(Pd2) = d>1 P>Pd2
� every mapping P defines another VSM
� MS = D>(P>P)D – similarity matrix
Quantnet Basics
Vector Space Model (VSM) 3-9
Example 2: tf and tf-idf similarities in BVSM
� with P = Im and d = {tf (d , t1), . . . , tf (d , tm)}> we get theclassical tf-similarity:Mtf
S = D>D
� with diagonal P(i , i)idf = idf (ti ) andd = {tf (d , t1), . . . , tf (d , tm)}> we get the classicaltf-idf-similarity:Mtf−idf
S = D>(P idf )>P idf D
Quantnet Basics
Vector Space Model (VSM) 3-10
Drawbacks of BVSM
� Uncorrelated/orthogonal terms in the feature space� Documents must have common terms to be similar� Sparseness of document vectors and similarity matrices
Question� How to incorporate information about semantics?
Solution� Using statistical information about term-term correlations� Semantic smoothing
Quantnet Basics
Vector Space Model (VSM) 3-11
Generalized VSM – term-term correlations
� S(d1, d2) = (D>d1)>(D>d2) = d>1 DD>d2 – the GVSMsimilarity
� MS = D>(DD>)D – similarity matrix� DD> – term by term matrix, having a nonzero ij entry if and
only if there is a document containing both the i-th and thej-th terms
� terms become semantically related if co-occuring often in thesame documents
� also known as a dual space method (Sheridan and Ballerini,1996)
� when there are less documents than terms – dimensionalityreduction
Quantnet Basics
Vector Space Model (VSM) 3-12
Generalized VSM – Semantic smoothing
� More natural method of incorporating semantics is by directlyusing a semantic network
� (Miller et al., 1993) used the semantic network WordNet� Term distance in the hierarchical tree provided by WordNet
gives an estimation of their semantic proximity� (Siolas and d’Alche-Buc, 2000) have included the semantics
into the similarity matrix by handcrafting the VSM matrix P
� MS = D>(P>P)D = D>P2D – similarity matrix
Quantnet Basics
Vector Space Model (VSM) 3-13
LSI – Latent Semantic Indexing
� LSI measures semantic information through co-occurrenceanalysis (Deerwester et al., 1990)
� Technique – singular value decomposition (SVD) of the matrixD = UΣV>
� P = U>k = IkU> – projection operator onto the first kdimensions
� MS = D>(UIkU>)D – similarity matrix� It can be shown: MS = V ΛkV>, with
D>D = V Σ>U>UΣV> = V ΛV> and Λii = λi = σ2i
eigenvalues of V ; Λk consisting of the first k eigenvalues andzero-values else.
Quantnet Basics
Empirical results 4-1
3 Models for the QuantNet
� Models – BVSM, GVSM and LSI� Dataset – the whole Quantnet� Documents – 1580 Quantlets
Quantnet Basics
Empirical results 4-2
Figure 6: Model characteristicsQuantnet Basics
Empirical results 4-3
Figure 7: Quantiles of similarity values of 3 models
� Blue dots – BVSM; Green dots – GVSM; Red line – LSI
Quantnet Basics
Empirical results 4-4
Sparseness results
BVSM GVSM LSIAbsolute number 2452444 2214064 2083526Relative number 0.982 0.887 0.835
Matrix Dim 2496400
Table 1: Model Performance regarding the number of zero-values in thesimilarity matrix.
Quantnet Basics
Keyword Extracting 5-1
Index Term Selection I
Goal: decrease the number of words for indexing, so that only theselected keywords describe the documents (Deerwester et al., 1990;Witten et al., 1999)
A simple method for keyword extracting is based on their entropy.∀t ∈ T the entropy is defined:
W (t) = 1 +1
log2 |D|∑d∈D
P(d , t) log2 P(d , t),
with P(d , t) = tf (d ,t)∑nl=1 tf (dl ,t)
Quantnet Basics
Keyword Extracting 5-2
Index Term Selection II
The entropy as a measure of the importance of a word in the givendomain context:
W (t) is high ⇒ prefer this t as index.
An index term selection method (fixed number of index terms) isdiscussed in “Experiments in Term Weighting and KeywordExtraction in Document Clustering“ (Borgelt et al., 2004).
Quantnet Basics
Conclusion 6-1
Conclusion I
� Similarity and Distance available for extendedVisualization
� Different weighting scheme approaches and Vector SpaceModels allow adapted Similarity based Text Searching
� Incorporating term-term Correlations and Semanticssignificantly improves the comparison performance
� More automation and quality through Index Term Selection
Quantnet Basics
Conclusion 6-2
Conclusion II
Text Mining offers more models and methods like:
� Classification
� Clustering
� Latent Dirichlet Allocation (LDA) topic model
� TopicTiling
They are worth being researched and applied to the Quantnet.
Quantnet Basics
Quantnet Basics:Visualization, Similarity, Text Mining
Lukas BorkeWolfgang Karl Härdle
Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlin
http://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de
References 7-1
References
Borgelt, C. and Nürnberger, A.Experiments in Term Weighting and Keyword Extraction inDocument ClusteringLWA, pp. 123-130, Humbold-Universität Berlin, 2004
Bostock, M., Heer, J., Ogievetsky, V. and communityD3: Data-Driven Documentsavailable on d3js.org, 2014
Chen, C., Härdle, W. and Unwin, A.Handbook of Data VisualizationSpringer, 2008
Quantnet Basics
References 7-2
References
Elsayed, T., Lin, J. and Oard, D. W.Pairwise Document Similarity in Large Collections withMapReduceProceedings of the 46th Annual Meeting of the Association ofComputational Linguistics (ACL), pp. 265-268, 2008
Feldman, R. and Dagan, I.Mining Text Using Keyword DistributionsJournal of Intelligent Information Systems, 10(3), pp. 281-300,DOI: 10.1023/A:1008623632443, 1998
Gentle, J. E., Härdle, W. and Mori, Y.Handbook of Computational StatisticsSpringer, 2nd ed., 2012
Quantnet Basics
References 7-3
References
Hastie, T., Tibshirani, R. and Friedman, J.The Elements of Statistical Learning: Data Mining, Inference,and PredictionSpringer, 2nd ed., 2009
Härdle, W. and Simar, L.Applied Multivariate Statistical AnalysisSpringer, 3nd ed., 2012
Hotho, A., Nürnberger, A. and Paass, G.A Brief Survey of Text MiningLDV Forum, 20(1), pp 19-62, available on www.jlcl.org, 2005
Quantnet Basics
References 7-4
References
Salton, G., Allan, J., Buckley, C. and Singhal, A.Automatic Analysis, Theme Generation, and Summarization ofMachine-Readable TextsScience, 264(5164), pp. 1421-1426,DOI: 10.1126/science.264.5164.1421, 1994
Witten, I., Paynter, G., Frank, E., Gutwin, C. andNevill-Manning, C.KEA: Practical Automatic Keyphrase ExtractionDL ’99 Proceedings of the fourth ACM conference on Digitallibraries, pp. 254-255, DOI: 10.1145/313238.313437, 1999
Quantnet Basics
Appendix 8-1
Data Mining: DM
DM is the computational process of discovering/representingpatterns in large data sets involving methods at the intersection ofartificial intelligence, machine learning, statistics, anddatabase systems.
1. Numerical DM2. Visual DM3. Text Mining
(applied on considerably weaker structured text data)
Quantnet Basics
Appendix 8-2
Text Mining
Text Mining or Knowledge Discovery from Text (KDT) dealswith the machine supported analysis of text (Feldman et al., 1995).
It uses techniques from:� Information Retrieval (IR)� Information extraction� Natural Language Processing (NLP)
and connects them with the methods of DM.
Quantnet Basics
Appendix 8-3
Similarity, Distance, Data Mining –Overview
1. Find a formal representation of the Quantlets2. Find a similarity measure on the space of Quantlets3. Afterwards the construction of a distance measure is simple:
distance(x , y) =√
sim(x , x) + sim(y , y)− 2 · sim(x , y)
Having similarity and distance ⇒ vast amount of Data Mining,Text Mining and Visualization technics.
Quantnet Basics
Appendix 8-4
Distance measure
A frequently used distance measure is the Euclidian distance:
distd(d1, d2)def= dist{w(d1),w(d2)} def
=
√√√√ m∑k=1
{w(d1, tk)− w(d2, tk)}2
It holds for tf-idf:
cosφ =x>y
|x | · |y |= 1− 1
2dist2
(x
|x |,
y
|y |
),
where x|x | means w(d1), y
|y | means w(d2) and cosφ is the anglebetween x and y .
Quantnet Basics
Appendix 8-5
Figure 8: Algorithm for Computing Pairwise Similarity Matrix
Remark: postings(t) denotes the list of documents that containterm t.
The idea: a term contributes to the similarity between twodocuments only if it has non-zero weights in both.
Quantnet Basics
Appendix 8-6
3 Models on 3 Datasets
� Models – BVSM, GVSM and LSI� Datasets – 2 books, 1 project from Quantnet� Project 1 - TEDAS: Tail Event Driven Asset Allocation
(micro size - 4 Qlets)� Book 1 - BCS: Basic Elements of Computational Statistics
(low size - 48 Qlets)� Book 2 - SFE: Statistics of Financial Markets
(medium size - 337 Qlets)
Quantnet Basics
Appendix 8-7
Figure 9: Model characteristics of TEDAS
Quantnet Basics
Appendix 8-8
Figure 10: Quantiles of similarity values of 3 models on TEDAS
� Blue dots – BVSM; Green line – GVSM; Red line – LSI
Quantnet Basics
Appendix 8-9
Figure 11: Model characteristics of BCS
Quantnet Basics
Appendix 8-10
Figure 12: Quantiles of similarity values of 3 models on BCS
� Blue dots – BVSM; Green line – GVSM; Red line – LSI
Quantnet Basics
Appendix 8-11
Figure 13: Model characteristics of SFE
Quantnet Basics
Appendix 8-12
Figure 14: Quantiles of similarity values of 3 models on SFE
� Blue dots – BVSM; Green line – GVSM; Red line – LSI
Quantnet Basics
Appendix 8-13
Sparseness results
TEDAS BCS SFE MVA? STF? SFS?
BVSM 8 504 108668 75424 44576 17146GVSM 8 0 96940 71464 44204 16612
LSI 8 262 84262 65712 43952 15400Matrix Dim 16 2304 113569 77841 45369 18225
Table 2: Model Performance regarding the number of zero-values in thesimilarity matrix. MVA?, STF? and SFS? were additionally examined.
Quantnet Basics
Recommended