Upload
lidia-pivovarova
View
177
Download
0
Embed Size (px)
Citation preview
Dmitry Bugaychenko, Eugeny Malytin
Trend detection at OK
Trend detection at a glance
Texts extractionInput: raw user activities logs in JSONOutput: extracted text and metadataIn-between:
Unified data collection pipeline: Kafka + Hadoop + SamzaDifferent type of objects: posts, photos, videos, comments.Large volumes: 50Gb of raw data daily, 20Gb after extractionInitial filtering applied: too small documents removed
Language detectionInput: single extracted textOutput: text labeled with languageIn-between:
Based on open source library https://github.com/optimaize/language-detector
Math is built on top of trigram distribution, 70+ languagesCustom language profiles added for:
Azerbaijani, Armenian, Georgian, Kazakh, Kyrgyz, Tajik, Turkmen, Uzbek https://github.com/denniean/language_profiles
Language distribution priors are important!
Tokenization and canonizationInput: text with language labelOutput: tokens streamIn-between:
Apache Lucene Analyzers (tokenization, stop words removal, stemming)Profiles for 23 languages available, including: Russian, Armenian,
Latvian.Most ex-USSR languages still missing: Azerbaijani, Belarusian, Georgian,
Kazakh, Kyrgyz, Tajik, Turkmen, Ukrainian, Uzbek etc.
Dictionary extractionInput: corpus as a set of token streamsOutput: words index (dictionary)In-between:
Term frequency limits for inclusionPrevious day dictionary analyzed to keep indices for
common tokens the sameLarge enough to capture multiple languages (1M+)
VectorizationInput: tokens stream and dictionaryOutput: sparse vectorIn-between:
Raw term frequency vectorization
DeduplicationInput: corpus as a set of vectorsOutput: corpus with duplicates removedIn-between:
Cosine as similarity measure (>0.9 => duplicates)Random projection hashing to speedup calculation18-bit hash, 50% basis sparsity
Current day statisticsInput: filtered corpus as a set of token streamsOutput: % of documents term or 2-gram where used in
for terms and 2-gramsIn-between:
2-gram additionAggregationAbsolute filtrationDifferent limits for terms and 2-grams
Accumulated state aggregationInput: current day statistics, previous day accumulated stateOutput: Exponentially weighted moving average and variance
for terms and 2-grams (new accumulated state)In-between:
Inclusion limit > exclusion limitDifferent limits for terms and 2-grams
Trending terms identificationInput: Exponentially weighted moving average and
variance for terms and 2-gramsOutput: Trending terms and 2-grams with significanceIn-between:
Trending terms identification
Trending terms clusteringInput: list of trending terms, corpus as a set of token streamsOutput: trending terms grouped into clusters with high level of
concurrencesIn-between:
Term-term matrix of normalized pointwise mutual information
DBSCAN clustering (ELKI implementation) with cosine distance
Trending terms clustering
Relevant documents extractionInput: identified trending term clusters, corpus as a set of
token streamsOutput: set of relevant documents and “spammines”
level for each clusterIn-between:
For each document find most relevant cluster by counting terms
For each cluster select top liked documentsCount unique users/groups/IPs relative to overall count
Results visualizationInput: trending terms clusters with relevant documentsOutput: Nice interactive visualization In-between:
Add navigation for dates and clustersExtract geo location for each documentPlot on an interactive mapDisplay details on hover
Visualization
Visualization
Need for speed!Trends are valuable only while they are in trendDaily batch processing is inherently laggingAlternatives:
Mini-batchStreaming!
Lambda-architecture
Streaming trending terms detection
Not yet there!Visualizing just trending terms is not informativeClustering requiredRelevant documents extraction requiredMini-batch model is more appropriate here
Mini-batch trend clustering
Mini-batch trend clustering
Technologies usedApache Kafka for data collectionApache YARN for resource negotiationApache Spark for batch and mini-batch processingApache Samza for streaming processingApache Lucene for texts preprocessingOptimaze languange-detectorELKI for clustering
More linksLanguage-detector https://github.com/optimaize/language-
detectorExtra profiles https://github.com/denniean/language_profilesTrends math
http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD14-SigniTrend-preprint.pdf
NPMI https://en.wikipedia.org/wiki/Pointwise_mutual_information
DBSCAN https://en.wikipedia.org/wiki/DBSCAN
Thank you for your attention!
?