Upload
aleesha-skinner
View
212
Download
0
Embed Size (px)
Citation preview
Big Data Analytics
Carlos Ordonez
Big Data Analytics research
• Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
• How? Fast external algorithms; memory-efficient data structures at two storage levels; parallel: multi-threaded or multi-node
• Efficiency goal: linear time O(n) and linear speedup• Hardware? single node or parallel cluster• Infrastructure? parallel file system; any large files• Challenging: Theory+programming in action
Systems research today
• Transaction processing? Main memory, lock-free
• Efficient analysis? Optimal joins, compiled queries, streams, exploit ample RAM, explout multi-core
• Compiler versus interpreter?
• Massive storage? Posix, HDFS
• Fast external algorithms? Simple tasks.
• Parallel computation? Multi-core with threads, Shared-nothing, message-passing
• Exploiting new hardware? Difficult/customized
• Analyzing: queries, cubes, statistics. Machine learning
• Hot today: Information integration (database+files)
DB Systems involves Core CS research:Theory+Programming
• Theory we use:– Time complexity (big O()) and I/O cost (disk, solid state memory)– Data structures (trees, hash tables, linked lists)– Relational model and information retrieval models– Multivariate statistics, machine learning, discrete mathematics, linear
algebra– Compilers and programming languages: parsing/compiling/optimizing
code; recursion• Programming:
– Languages: mostly C++, but also R, SQL, Java– Unix, but we have a lot of past work on MS Windows– Systems: Threads, binary I/O, parallel file systems, code generation, code
optimization, interpreter runtime
Sample of target problemsBusiness Intelligence: cubes, lattices
Big Data summarization: vector outer products
Bayesian statistics: MCMC, classification, regression, variable/feature selection
Graph transitive closure and linear recursion
Why join the DBMS group?
• Just came back from ATT Labs (formerly the famous ATT Bell Labs)..my head is spinning with C++ 14 and Unix commands. Currently programming with my PhD students.
• Balance between theory (mathematics) and programming (C++)
• Mature and stable CS research area
• Job prospect upon graduation is excellent. Great opportunity to join industrial labs.
• Visit my web page, DBLP. Google “Ordonez SQL”, stop by on my office hours