1. Big Data at Move The past, present and future of Big Data at
Move - Realtor.com
2. About me Data Warehouse Architect Move Inc (realtor.com)
Pluralsight Author Passion for data and technology Ahmad Alkilani
linkedin.com/in/ahmadalkilani EASkills.com
3. Topics History of Moves enterprise data warehouse Why Hadoop
found a home at Move High level architecture Where we are now Where
were heading in the future Q & A
4. Move Inc. Leader in online real estate and operator of
realtor.com Over 410 million minutes per month on Move websites
Over 300 million user engagement events per day on realtor.com and
mobile apps Connecting consumers and customers requires lots of
data
5. growth - 1,000,000,000 2,000,000,000 3,000,000,000
4,000,000,000 5,000,000,000 6,000,000,000 7,000,000,000 Raw Events
Move Inc. (Realtor.com and Mobile)
6. proactive... Transitioned from legacy warehouse and ETLs
Near real-time collection
7. Bigger servers 8 processors 10 core each 2 TB of RAM! Solid
state drives Fusion IO cards 10 Terabytes each server Worked Great!
Until we realized we could only store 50 days worth of data!
reactive - 1 2 3 4 5 6 7 Billions Raw Events Move Inc. (Realtor.com
and Mobile)
8. Started with 13 nodes at a fraction of the cost of our SSD
monster servers - Cost Plan to continue to scale out Ease of
scalability Current capacity is ~125 TB Good starting point
proactive
9. Big picture
10. In more details Hive over HCatalog Transferred to HDFS and
then the Hive Warehouse HDFS External Tables against data in HDFS
Data moves to Hive Warehouse Dynamic Partition Inserts Partition
Pruning Snappy Compression Dynamic Tables with Maps and Arrays
11. ETL & Querying Hive Hive Warehouse Aggregates SQL
Server (EDW) Multi-Inserts Single Pass Details Stats
12. ETL & Querying Hive Separate files for different keys
of a Map Resort to MapReduce instead of Hive and use
MultipleOutputs class Dynamic Partition Inserts again & Hadoop
-getmerge
13. Some lessons learned Our ETLs are still expensive Putting
our data loads and cluster at the mercy of our analysts. Not a very
good idea Use Queues to guarantee room for ETLs to do their job
Default queue is for users Specialized queue is for ETL Keep an eye
on the slots available Use .hiverc file to automatically control
behavior
14. Where were headed Re-evaluate tool selection Talend/Pentaho
Real-time analytics Kafka/Honu/Flume/Storm/StreamInsight Hive
Geospatial Integrating different technologies is OK
15. D3.js with Asp.Net SignalR Visualizing search activity and
active listings in different states