  1. 1. Big Data at Move The past, present and future of Big Data at Move -
  2. 2. About me Data Warehouse Architect Move Inc ( Pluralsight Author Passion for data and technology Ahmad Alkilani
  3. 3. Topics History of Moves enterprise data warehouse Why Hadoop found a home at Move High level architecture Where we are now Where were heading in the future Q & A
  4. 4. Move Inc. Leader in online real estate and operator of Over 410 million minutes per month on Move websites Over 300 million user engagement events per day on and mobile apps Connecting consumers and customers requires lots of data
  5. 5. growth - 1,000,000,000 2,000,000,000 3,000,000,000 4,000,000,000 5,000,000,000 6,000,000,000 7,000,000,000 Raw Events Move Inc. ( and Mobile)
  6. 6. proactive... Transitioned from legacy warehouse and ETLs Near real-time collection
  7. 7. Bigger servers 8 processors 10 core each 2 TB of RAM! Solid state drives Fusion IO cards 10 Terabytes each server Worked Great! Until we realized we could only store 50 days worth of data! reactive - 1 2 3 4 5 6 7 Billions Raw Events Move Inc. ( and Mobile)
  8. 8. Started with 13 nodes at a fraction of the cost of our SSD monster servers - Cost Plan to continue to scale out Ease of scalability Current capacity is ~125 TB Good starting point proactive
  9. 9. Big picture
  10. 10. In more details Hive over HCatalog Transferred to HDFS and then the Hive Warehouse HDFS External Tables against data in HDFS Data moves to Hive Warehouse Dynamic Partition Inserts Partition Pruning Snappy Compression Dynamic Tables with Maps and Arrays
  11. 11. ETL & Querying Hive Hive Warehouse Aggregates SQL Server (EDW) Multi-Inserts Single Pass Details Stats
  12. 12. ETL & Querying Hive Separate files for different keys of a Map Resort to MapReduce instead of Hive and use MultipleOutputs class Dynamic Partition Inserts again & Hadoop -getmerge
  13. 13. Some lessons learned Our ETLs are still expensive Putting our data loads and cluster at the mercy of our analysts. Not a very good idea Use Queues to guarantee room for ETLs to do their job Default queue is for users Specialized queue is for ETL Keep an eye on the slots available Use .hiverc file to automatically control behavior
  14. 14. Where were headed Re-evaluate tool selection Talend/Pentaho Real-time analytics Kafka/Honu/Flume/Storm/StreamInsight Hive Geospatial Integrating different technologies is OK
  15. 15. D3.js with Asp.Net SignalR Visualizing search activity and active listings in different states
  16. 16. Questions?
