Upload
lucidworks
View
338
Download
1
Embed Size (px)
Citation preview
•Logs
•Clickstream
•Call Detail Record (CDR)
•Financial Transactions
•Sensor Data
Example Event Data
•Denormalized (a lot of redundancy)
•Nested
•Schema-less
•Historical (immutable i.e. will never change)
•Isolated (can often be processed in parallel)
Characteristics of Event Data
• A time series is a sequence of data points.
• The points are typically measured at successive points in
time spaced at uniform time intervals.
• A time series has a natural temporal ordering.
Time Series Data
• A stochastic model for a time series will generally reflect the fact that
observations close together in time will be more closely related than
observations further apart.
• This is one of the reasons to group some event data into sessions.
• A session is a time window for all events generated by a given source
(like a unique user ID).
Stochastic Model
• If entities frequently co-occur, we may infer an interdependence
between those entities.
• For example, a click event can be described using a click(user, query,
docId) tuple.
• We can then associate each of those entities with each other and with
other similar events within a session.
Co-Occurrence (Adjacency) Matrix
• NB: We generate the co-occurrence relations not just between the
same field types e.g. (query, query) pairs, but also "cross-field"
relations e.g. (query, docId), (docId, user) pairs etc.
• This will give us an N x N adjacency matrix, where N = all unique
instances of the field types that we want to calculate co-occurrence
relations for.
Co-Occurrence (Adjacency) Matrix
• The co-occurrence matrix can be used to generate a weighted
directed graph.
• Here each node can be of any of the types we are interested in (users,
queries, sessions, documents etc.).
• The weights for the edges between the nodes in the graph are the
frequency of co-occurrence i.e. how many times the source and target
nodes have occurred together.
Weighted Directed Graph
• For a given item, we may only have a small number of items that co-
occur with it (based on the co-occurrence matrix).
• By adding in the data from navigating the graph (which comes from
the matrix), we increase the diversity of suggestions.
• Items that appear in multiple source lists then rise to the top.
• This helps improve the quality of the recommendations & reduce bias.
Graph Navigation