Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Big Data Analytics

Carlos Ordonez

Big Data Analytics research

• Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

• How? Fast external algorithms; memory-efficient data structures at two storage levels; parallel: multi-threaded or multi-node

• Efficiency goal: linear time O(n) and linear speedup• Hardware? single node or parallel cluster• Infrastructure? parallel file system; any large files• Challenging: Theory+programming in action

Systems research today

• Transaction processing? Main memory, lock-free

• Efficient analysis? Optimal joins, compiled queries, streams, exploit ample RAM, explout multi-core

• Compiler versus interpreter?

• Massive storage? Posix, HDFS

• Fast external algorithms? Simple tasks.

• Parallel computation? Multi-core with threads, Shared-nothing, message-passing

• Exploiting new hardware? Difficult/customized

• Analyzing: queries, cubes, statistics. Machine learning

• Hot today: Information integration (database+files)

DB Systems involves Core CS research:Theory+Programming

• Theory we use:– Time complexity (big O()) and I/O cost (disk, solid state memory)– Data structures (trees, hash tables, linked lists)– Relational model and information retrieval models– Multivariate statistics, machine learning, discrete mathematics, linear

algebra– Compilers and programming languages: parsing/compiling/optimizing

code; recursion• Programming:

– Languages: mostly C++, but also R, SQL, Java– Unix, but we have a lot of past work on MS Windows– Systems: Threads, binary I/O, parallel file systems, code generation, code

optimization, interpreter runtime

Sample of target problemsBusiness Intelligence: cubes, lattices

Big Data summarization: vector outer products

Bayesian statistics: MCMC, classification, regression, variable/feature selection

Graph transitive closure and linear recursion

Why join the DBMS group?

• Just came back from ATT Labs (formerly the famous ATT Bell Labs)..my head is spinning with C++ 14 and Unix commands. Currently programming with my PhD students.

• Balance between theory (mathematics) and programming (C++)

• Mature and stable CS research area

• Job prospect upon graduation is excellent. Great opportunity to join industrial labs.

• Visit my web page, DBLP. Google “Ordonez SQL”, stop by on my office hours

Documents

Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)