6
Big Data Analytics Carlos Ordonez

Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Embed Size (px)

Citation preview

Page 1: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Big Data Analytics

Carlos Ordonez

Page 2: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Big Data Analytics research

• Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

• How? Fast external algorithms; memory-efficient data structures at two storage levels; parallel: multi-threaded or multi-node

• Efficiency goal: linear time O(n) and linear speedup• Hardware? single node or parallel cluster• Infrastructure? parallel file system; any large files• Challenging: Theory+programming in action

Page 3: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Systems research today

• Transaction processing? Main memory, lock-free

• Efficient analysis? Optimal joins, compiled queries, streams, exploit ample RAM, explout multi-core

• Compiler versus interpreter?

• Massive storage? Posix, HDFS

• Fast external algorithms? Simple tasks.

• Parallel computation? Multi-core with threads, Shared-nothing, message-passing

• Exploiting new hardware? Difficult/customized

• Analyzing: queries, cubes, statistics. Machine learning

• Hot today: Information integration (database+files)

Page 4: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

DB Systems involves Core CS research:Theory+Programming

• Theory we use:– Time complexity (big O()) and I/O cost (disk, solid state memory)– Data structures (trees, hash tables, linked lists)– Relational model and information retrieval models– Multivariate statistics, machine learning, discrete mathematics, linear

algebra– Compilers and programming languages: parsing/compiling/optimizing

code; recursion• Programming:

– Languages: mostly C++, but also R, SQL, Java– Unix, but we have a lot of past work on MS Windows– Systems: Threads, binary I/O, parallel file systems, code generation, code

optimization, interpreter runtime

Page 5: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Sample of target problemsBusiness Intelligence: cubes, lattices

Big Data summarization: vector outer products

Bayesian statistics: MCMC, classification, regression, variable/feature selection

Graph transitive closure and linear recursion

Page 6: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Why join the DBMS group?

• Just came back from ATT Labs (formerly the famous ATT Bell Labs)..my head is spinning with C++ 14 and Unix commands. Currently programming with my PhD students.

• Balance between theory (mathematics) and programming (C++)

• Mature and stable CS research area

• Job prospect upon graduation is excellent. Great opportunity to join industrial labs.

• Visit my web page, DBLP. Google “Ordonez SQL”, stop by on my office hours