AINL 2016: Bugaychenko

Dmitry Bugaychenko, Eugeny Malytin

Trend detection at OK

Trend detection at a glance

Texts extractionInput: raw user activities logs in JSONOutput: extracted text and metadataIn-between:

Unified data collection pipeline: Kafka + Hadoop + SamzaDifferent type of objects: posts, photos, videos, comments.Large volumes: 50Gb of raw data daily, 20Gb after extractionInitial filtering applied: too small documents removed

Language detectionInput: single extracted textOutput: text labeled with languageIn-between:

Based on open source library https://github.com/optimaize/language-detector

Math is built on top of trigram distribution, 70+ languagesCustom language profiles added for:

Azerbaijani, Armenian, Georgian, Kazakh, Kyrgyz, Tajik, Turkmen, Uzbek https://github.com/denniean/language_profiles

Language distribution priors are important!

https://github.com/optimaize/language-detector


https://github.com/denniean/language_profiles


Tokenization and canonizationInput: text with language labelOutput: tokens streamIn-between:

Apache Lucene Analyzers (tokenization, stop words removal, stemming)Profiles for 23 languages available, including: Russian, Armenian,

Latvian.Most ex-USSR languages still missing: Azerbaijani, Belarusian, Georgian,

Kazakh, Kyrgyz, Tajik, Turkmen, Ukrainian, Uzbek etc.

Dictionary extractionInput: corpus as a set of token streamsOutput: words index (dictionary)In-between:

Term frequency limits for inclusionPrevious day dictionary analyzed to keep indices for

common tokens the sameLarge enough to capture multiple languages (1M+)

VectorizationInput: tokens stream and dictionaryOutput: sparse vectorIn-between:

Raw term frequency vectorization

DeduplicationInput: corpus as a set of vectorsOutput: corpus with duplicates removedIn-between:

Cosine as similarity measure (>0.9 => duplicates)Random projection hashing to speedup calculation18-bit hash, 50% basis sparsity

Current day statisticsInput: filtered corpus as a set of token streamsOutput: % of documents term or 2-gram where used in

for terms and 2-gramsIn-between:

2-gram additionAggregationAbsolute filtrationDifferent limits for terms and 2-grams

Accumulated state aggregationInput: current day statistics, previous day accumulated stateOutput: Exponentially weighted moving average and variance

for terms and 2-grams (new accumulated state)In-between:

Inclusion limit > exclusion limitDifferent limits for terms and 2-grams

Trending terms identificationInput: Exponentially weighted moving average and

variance for terms and 2-gramsOutput: Trending terms and 2-grams with significanceIn-between:

Trending terms identification

Trending terms clusteringInput: list of trending terms, corpus as a set of token streamsOutput: trending terms grouped into clusters with high level of

concurrencesIn-between:

Term-term matrix of normalized pointwise mutual information

DBSCAN clustering (ELKI implementation) with cosine distance

Trending terms clustering

Relevant documents extractionInput: identified trending term clusters, corpus as a set of

token streamsOutput: set of relevant documents and “spammines”

level for each clusterIn-between:

For each document find most relevant cluster by counting terms

For each cluster select top liked documentsCount unique users/groups/IPs relative to overall count

Results visualizationInput: trending terms clusters with relevant documentsOutput: Nice interactive visualization In-between:

Add navigation for dates and clustersExtract geo location for each documentPlot on an interactive mapDisplay details on hover

Visualization

Visualization

Need for speed!Trends are valuable only while they are in trendDaily batch processing is inherently laggingAlternatives:

Mini-batchStreaming!

Lambda-architecture

Streaming trending terms detection

Not yet there!Visualizing just trending terms is not informativeClustering requiredRelevant documents extraction requiredMini-batch model is more appropriate here

Mini-batch trend clustering

Mini-batch trend clustering

Technologies usedApache Kafka for data collectionApache YARN for resource negotiationApache Spark for batch and mini-batch processingApache Samza for streaming processingApache Lucene for texts preprocessingOptimaze languange-detectorELKI for clustering

More linksLanguage-detector https://github.com/optimaize/language-

detectorExtra profiles https://github.com/denniean/language_profilesTrends math

http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD14-SigniTrend-preprint.pdf

NPMI https://en.wikipedia.org/wiki/Pointwise_mutual_information

DBSCAN https://en.wikipedia.org/wiki/DBSCAN









https://en.wikipedia.org/wiki/Pointwise_mutual_information

https://en.wikipedia.org/wiki/Pointwise_mutual_information

https://en.wikipedia.org/wiki/DBSCAN

https://en.wikipedia.org/wiki/DBSCAN

Thank you for your attention!

?

Science

AINL 2016: Bugaychenko