3. Vertical DataFirst, a brief description of Data Warehouses versus Database Management Systems
C.J. Date recommended, circa 1980,Do transaction processing on a DataBase Management System (DBMS), rather than doing file processing on file systems. Using a DBMS, instead of file systems, unifies data resources, centralizes control, standardizes usages, minimizes redundancy and inconsistency, maximizes data value and usage, yadda, yadda, yadda...
Inmon, et all, circa 1990Buy a separate Data Warehouse for long-running queries and data mining (separate from DBMS for transaction processing).Double your hardware! Double your software! Double your fun!
Data Warehouses (DWs)vs.DataBase Management Systems (DBMSs)What happened?
Inmon's idea was a great marketing success!,
but fortold a great Concurrency Control Research & Development (CC R&D) failure!CC R&D people had failed to integrate transaction and query processing, Also Known As (AKA) OnLine Transaction Processing (OLTP) and OnLine Analytic Processing (OLAP), that is, update and read workloads) in one system with acceptable performance!
Marketing of Data Warehouses was so successful, nobody noticed the failure! (or seem to mind paying double ;-(
Most enterprises now have a separate DW from their DBMS
Some still hope that DWs and DBs will one day be unified again.
The industry may demand it eventually; e.g., Already, there is research work on real time updating of DWs
For now lets just focus on DATA.
You run up against two curses immediately in data processing.
Curse of cardinality: solutions dont scale well with respect to record volume."files are too deep!"
Curse of dimensionality: solutions dont scale with respect to attribute dimension."files are too wide!"
Curse of cardinality is a problem in the horizontal and vertical world!In the horizontal world it was disguised as curse of the slow join. In the horizontal world we decompose relations to get good design (e.g., 3rd normal form), but then we pay for that by requiring many slow joins to get the answers we need.
Techniques to address these curses.Horizontal processing of vertical data (instead of the ubiquitous vertical processing of horizontal (record orientated) data.
Parallelizing the processing engine.Parallelize the software engine on clusters of computers.
Parallelize the greyware engine on clusters of people (i.e., enable visualization and use the web...).Why do we need better techniques for data analysis, querying and mining?Data volume expands by Parkinsons Law: Data volume expands to fill available data storage.Disk-storage expands by Moores law: Available storage doubles every 9 months!
A few successes: 1. Precision AgricultureProducer are able to analyze the color intensity patterns fromaerial and satellite photos taken in mid season to predict yield(find associations between electromagnetic reflection and yeild).One is hi_green & low_red hi_yield. That is very intuitive.
A stronger association was found strictly by data mining: hi_NIR & low_redhi_yieldOnce found in historical data (through data mining), producers just query TIFF images mid-season for low_NIR & high_red grid cells.Where low yeild is predicted, they then apply additional nitrogen.Can producers use Landsat images of China of predict wheat prices before planting? and a synchronized yield map (crop yield taken at harvest); thus, 4 feature attributes (B,G,R,Y) and ~100,000 pixelsYield prediction: dataset consists of an aerial photograph (RGB TIFF image taken during the growing season)Grasshopper Infestation Prediction (again involving RSI data)Pixel classification on remotely sensed imagery holds significant promise to achieve early detection. Pixel classification (signaturing) has many apps: pest detection, fire detection, wet-lands monitoring (for signaturing we developed the SMILEY software/greyware system) http:midas.cs.ndsu.nodak.edu/~smiley
2. Sensor Network DataMicro and Nano scale sensor blocksare being developed for sensing
Biological agentsChemical agentsMotion detectioncoatings deteriorationRF-tagging of inventory (RFID tags for Supply Chain Mgmt)Structural materials fatigue
There will be trillions++ of individual sensors creating mountains of data.
2. A Sensor Network Application:Each energized nano-sensor transmits a ping (location is triangulated from the ping). These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. This is the expendable, one-time, cheap sensor version.
A more sophisticated CEASR device could sense and transmit the intensity levels, lighting up the display voxel with the same intensity.Wherever threshold level is sensed (chem, bio, thermal...)a ping is registered in a compressed structure (P-tree detailed definition coming up) for that location.Nano-sensors droppedinto the Situation space Soldier sees replica of sensedsituation prior to entering space.:.:.:.:..::.:. : ::: ..:. . :: :.:: :..:..::. .:: ..:.::...:.:.:.:..::.:. : ::: ..:. . :: :.:: :..:..::. .:: ..:.::...:.:.:.:..::.:. : ::: ..:. . :: :.:: :..:..::. .:: ..:.::..
Using Alien Technologys Fluidic Self-assembly (FSA) technology, clear plexiglass laminates are joined into a cube, with a embedded nano-LED at each voxel.CubE for Active Situation Replication (CEASR)The single compressed structure (P-tree) containing all the information is transmitted to the cube, where the pattern is reconstructed (uncompress, display).
3. Anthropology Application
Digital Archive Network for Anthropology (DANA)(analyze, query and mine arthropological artifacts (shape, color, discovery location,)
What has spawned these successes?(i.e., What is Data Mining?)Querying is asking specific questions for specific answersData Mining is finding the patterns that exist in data (going into MOUNTAINS of raw data for the information gems hidden in that mountain of data.)
Data Mining versus QueryingEven on the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record02).On the Data Mining end, the surface has barely been scratched.But even those scratches had a great impact One of the early scatchers becamethe biggest corporation in the world. A Non-scratcher filed for bankruptcy protection. There is a whole spectrum of techniques to get information from data:Our Approach: Vertical, compressed data structures, Predicate-trees or Peano-trees (Ptrees in either case)1 processed horizontally (DBMSs process horizontal data vertically)Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of scalability and curse of dimensionality.1 Ptree Technology is patentedby North Dakota State University
00000011To find # occurences of 7 0 1 4, horizontally AND basic Ptrees(next slide)Predicate trees (Ptrees): vertically project each attribute, Given a table structure into horizontal records. Process it (scan it) verticallyTop-down construction of the 1-dimensional Ptree of R11, denoted, P11:
Record the truth of the universal predicate pure 1 in a tree recursively on halves (1/21 subsets),until purity is achieved.But it is pure (pure0) so this branch endsthen vertically project each bit position of each attribute,then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows:P11R(A1 A2 A3 A4)2 7 6 16 7 6 03 7 5 12 7 5 73 2 1 42 2 1 57 0 1 47 0 1 4Base 10Base 2What are Ptrees?(e.g., to find # of occurences of 7 0 1 4 =?)2
To count occurrences of 7,0,1,4 use 111000001100: 0 P11^P12^P13^P21^P22^P23^P31^P32^P33^P41^P42^P43 = 0 0 01 ^R(A1 A2 A3 A4)2 7 6 13 7 6 02 7 5 12 7 5 75 2 1 42 2 1 57 0 1 47 0 1 421-level has the only 1-bit so the 1-count = 1*21 = 2# change
00001011Top-down construction of basic P-trees is best for understanding, bottom-up is much faster (once across).Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal, collapsing of pure siblings as we go:0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43
In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name, SNAME, and gender, GENCourses (course has a number, C#, name, CNAME, State where the course is offered, ST, TERMand ONE relationship,Enrollments (a student, S#, enrolls in a class, C#, and gets a grade in that class, GR).The horizontal Education Database consists of 3 files, each of which consists of a number of instances of identically structured horizontal records:C#|CNAME|ST|TERM0 |BI |ND| F1 |DB |ND| S2 |DM |NJ| S3 |DS |ND| F4 |SE |NJ| S5 |AI |ND| FCoursesS#|SNAME|GEN 0 |CLAY | M1 |THAD | M2 |QING | F3 |AMAL | M4 |BARB | F5 |JOAN | FStudentS#|C#|GR 0 |1 |B 0 |0 |A 3 |1 |A 3 |3 |B 1 |3 |B1 |0 |D2 |2 |D2 |3 |A4 |4 |B5 |5 |BEnrollmentsA Education Database ExampleWe have already talked about the process of structuring data in a horizontal database (e.g., develop an Entity-Relationship diagram or ER diagram, etc. - in this case:What is the process of structuring this data into a vertical database? To be honest, that is an open question. Much research is needed on that issue! (great term paper topics exist here!!!)We will discuss this a little more on the next slide.
S:S#___|SNAME|GEN 0 000|CLAY |M 0 1 001|THAD |M 0 2 010|QING |F 1 3 011|BARB |F 1 4 100|AMAL |M 0 5 101|JOAN |F 11. Code some attributes in binary (shown in red italics to the right of each field value encoded). For numeric fields, we have used standard binary encoding. For gender, F=1 and M=0. For term, Fall=0, Spring=1. For grade, A=11, B=10, C=01, D=00 (which could be called GPA encoding?). We have also abreviated STUDENT to S, COURSE to C and ENROLLMENT to E.C:C#___|CNAME|ST|TERM 0 000|BI |ND|F 0 1 001|DB |ND|S 1 2 010|DM |NJ|S 1 3 011|DS |ND|F 0 4 100|SE |NJ|S 1 5 101|AI |ND|F 0E:S#___|C#___|GR . 0 000|1 001|B 10 0 000|0 000|A 11 3 011|1 001|A 11 3 011|3 011|D 00 1 001|3 011|D 00 1 001|0 000|B 10 2 010|2 010|B 10 2 010|3 011|A 11 4 100|4 100|B 10 5 101|5 101|B 10One way to begin to vertically structure this data is:The above encoding seem natural. But how did we decide which attributes are to be encoded and which are not? As a term paper topic, that would be one of the main issues to researchNote, we have decided not to encode names (our rough reasoning (not researched) is that there would be little advantage and it would be difficult (e.g. if name is a CHAR(25) datatype, then in binary that's 25*8 = 200 bits!). Note that we have decided not to encode State. That may be a mistake! Especially in this case, since it would be so easy (only 2 States ever? so 1 bit), but more generally there could be 50 and that would mean at least 6 bits.2. Another binary encoding scheme (which can be used for numeric and non-numeric fields) is value map or bitmap encoding. The concept is simple. For each possible value, a, in the domain of the attribute, A, we encode 1=true and 0=false for the predicate A=a. The resulting single bit column becomes a map where a 1 means that row has A-value = a and a 0 means that row or tuple has A-value which is not a.There is a wealth of existing research on bit encoding. There is also quite a bit of research on vertical databases. There is even the first commercial vertical database announced called Vertica (check it out by Googling that name). Vertica was created by the same guy, Mike Stonebraker, who created one of the first Relational Databases, Ingres.
The Vertical bit sliced (uncompressed P-trees - P for Predicate) attributes stored as:
S.s2S.s1S.s0S.gC.c2C.c1C.c0C.tE.s2E.s1E.s0E.c2E.c1E.c0E.g1E.g000000000000001100010001100000011 1000100100101100 1011101000100010 0101010101101111 0111011001101100 01001010 01000111 10010010 10110110
The Vertical (un-bit-sliced) attributes are stored:S.name C.name C.st |CLAY | |BI | |ND| |THAD | |DB | |ND| |QING | |DM | |NJ| |BARB | |DS | |ND| |AMAL | |SE | |NJ| |JOAN | |AI | |ND|Way-1 for vertically structuring the Educational DatabaseThe W1 VDBMS would then be stored as:Before moving on to 2 dimesional data encoding (e.g., images), we show one query processing algorithm for VDBMSs. Much more on this comes later in section 9 of the course.
S.s2001100S:s n gen|0 000| |0||1 001| |0||2 100| |1||3 111| |1||4 010| |0||5 011| |1|Vertical Query Processing (another great term paper research area - see the notes on Query Processing, section 9 for more details)SELECT S.n, E.g FROM S, EWHERE S.s=E.s & E.g=DE:s |grade|0 000 |B 10||0 000 |A 11||3 011 |A 11||3 011 |D 00||1 001 |D 00||1 001 |B 10||2 010 |B 10||2 010 |A 11||4 100 |B 10||5 101 |B 10|E.s20000000011S.s1000011S.s0010101S.g110010E.s10011001100E.s00011110001E.g11110011111E.g00110000100In the EDUC database (Students, Courses, Enrollments), numeric attributes are represented vertically as P-trees (not compressed). Categorical attributes (non-numeric) are projected to a 1 column vertical file
S.s2001100S.s1000011S.s0010101S.g1 10010E.s20000000011E.s10011001100E.s00011110001E.g11110011111E'.g01011011011C.c10011C.c10101C.r10111C.r21110SELECT S.n, E.g FROM S, EWHERE S.s=E.s & E.g=DEM0001000000For the selection mask, E.g=D we perform EM= E'.g1 AND E'.g2(want both bits to be zero).E'.g10001100000E.g00100100100
For the join, E.s=S.s an indexed nested loop like method can be used.S'.s1111101S.s2110010S.s2000000S.s1000010S.s0010000S.nCLAYTHADGOODBRADPERYJOANE.s20000000000E.s10010000100E.s00010000000Get 1st masked E.s value, 000b Mask S tuples: PS.s2AND PS.s1AND PS.s0 S.s0100010SM100000Get S.n-value(s), C, pair it with E.g-value(s), output concatenation, S.n E.gCLAY DEM0001000000SELECT S.n, E.g FROM S, EWHERE S.s=E.s & E.g=DNOTE: The cost of processing this query is almost independent of cardinality of the files (number of rows). So if there were 12,000 students and 100,000 enrollments, except for the extra cost of ANDing longer bit vectors (which is insignificant - AND is the fasted operation in any computer), the query would process very rapidly. This is because no scan is required.