NLP in Practice… and at Scale
Ivan Bilan, Data Engineer
Ivan Bilan (LinkedIn)● B. Sc. and currently M. Sc. at CIS● Honors Degree in Technology Management at CDTM● Data Engineer at TrustYou ● Previously Data Engineer at MobileX AG and IT at LMU● External Consultant at Adidas, Cloudeo AG and Paterva/Maltego
Research interests:
● Author Profiling and Identification● Automated Summarization● Relation Extraction
For every hotel on the planet, provide a summary
of traveler reviews.What does TrustYou do?
✓ Excellent hotel!*
✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »✓ Great for partying“Nice weekend getaway or for partying”✗ Solo travelers complain about TVs
nhow Berlin (Full summary)
TrustYou ArchitectureHadoop Cluster
Local Grammars
Text Generation
Machine LearningAggregation
Crawling API
3M new reviews per week!
Crawling
Scrapy
● Build your own web crawlers○ Extract data via CSS selectors, XPath, regexes …○ Handles queuing, request parallelism, cookies,
throttling … ● Scrapy Example Code
Semantic Analysis
● “Nice room”● “Room wasn‘t so great”● “The air-conditioning
was so powerful that we were cold in the room even when it was off.”
● “อาหารรสชาติดี”● ” خدمة جیدة“
● At core: Rule-based linguistic system (CFG’s)
● 20 languages● Classify opinions in 140
categories● ML mostly in summarization● Hadoop: Scale out CPU
○ ~1B opinions in DB
Semantic Analysis at TrustYou
ML @ TrustYou
● gensim doc2vec model to create hotel embeddings
● TF-IDF vectors● Combined with geographical
data
Creating Training Sets● For each hotel type, we need to create reliable training sets● Based on review content, amenities and (when possible) geographical
information● Open Street Map project: contains geo information of interest for several
categories
Examples:
- Coordinates of coastlines- Highways- Skilifts- Tourist attractions...
Word2Vec
● Map words to vectors● “Step up” from
bag-of-words model
● ‘Cats’ and ‘dogs’ should be similar – because they occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,
-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,
# ...
-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,
-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,
-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],
dtype=float32)
Classification Pipeline- Have a separate TF-IDF vectorizer / or word2vec for each language- Combine TF-IDF vectors before the classification step- Give different weights to each language to control their contribution to
the final classification results- Select only 80% of top word features using chi-squared function- Classify using Gradient Boosting - ensembling of decision trees
(https://github.com/dmlc/xgboost - the state-of-the-art in machine learning tasks based on Kaggle competitions)
- Gradient Boosting for the curious (lots of math): https://en.wikipedia.org/wiki/Gradient_boosting
Workflow Management& Scaling Up
● Framework for distributed data processinga. Load data into a special collection called “RDD”b. Apply actions on it, e.g. .map, .reduceByKey …c. Spark distributes work in a cluster automatically
● Native Scala, but has a great Python API
Spark
● Build complex pipelines ofbatch jobs
● Example:a. First, crawl new hotel reviews of the dayb. Then, analyze textc. Store results in DB
● Helps you with parallelism and error recovery
Luigi
Workflows at TrustYou
https://www.trustyou.com/job-department/engineering If you have any questions:[email protected]