Upload
vincent-burckhardt
View
2.448
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Social is generating large volumes of data about the business (who interacts with whom, when, and in what context). However, little of this data is being actively leveraged in order to generate insights that allow the business to work smarter and faster. This technical session describes how to capture and collect interactions within IBM Connections through its public APIs and apply a variety of analytics, including map/reduce and graph analytics, on a scalable Hadoop platform. This allows us to uncover insights into what the corporate network structure looks like, how information propagates across the organization, how are opinions formed, and how resilient is the organization to attrition.
Citation preview
© 2014 IBM Corporation
AD306Turbocharge Your Enterprise Social Network with Analytics
Vincent Burckhardt, IBMDavid Robinson, IBM
2
Please Note
IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
Agenda
A Peek into Data Science
Extracting IBM Connections data for analytical purposes
Analytics And Connections Data
3
4
A Peek Into Data Science
What Is This Thing Called Data Science ?
5
Credit: Rachel Schutt/Cathy O’Neil
6
A Single Coffee Receipt
12/10/2013
date time cashier size qty itemlocation
13:09 Chris Raleigh500 reg 1 mocha
spent
.80
7
A Year’s Worth Of Coffee Receipts For One Person
01/10/2013
date time cashier size qty itemlocation
13:53 Chris Raleigh500 reg 1 mocha
spent
.80
01/12/2013 14:02 Doug Carrabou reg 1 mocha .80
01/14/2013 13:09 Nadia Raleigh500 reg 1 vanilla .75
02/01/2013 14:02 Nadia Raleigh500 lg 1 mocha 1.10
03/14/2013 13:14 Chris Raleigh500 reg 1 blend .60
04/20/2013 13:32 Nadia Stardoe lg 1 mocha 1.10
…12/14/2013 13:14 Bev Raleigh500 reg 1 blend .60
12/20/2013 13:32 Nadia Winston’s reg 1 mocha 1.10
InsightsM-F, 1-2 pm72% Raleigh50075% regular63% mocha$.87 avg spending
8
A Year’s Worth Of Coffee Receipts For Many People
01/10/2013
date time cashier size qty itemlocation
13:53 Chris Raleigh500 reg 1 mocha
spent
.80
01/12/2013 14:02 Doug Carrabou reg 1 mocha .80
01/14/2013 13:09 Nadia Raleigh500 reg 1 vanilla .75
02/01/2013 14:02 Nadia Raleigh500 lg 1 mocha 1.10
03/14/2013 13:14 Chris Raleigh500 reg 1 blend .60
04/20/2013 13:32 Nadia Stardoe lg 1 mocha 1.10
…12/14/2013 13:14 Bev Raleigh500 reg 1 blend .60
12/20/2013 13:32 Nadia Winston’s reg 1 mocha 1.10
person
Joel
Toni
Joni
Joe
Dan
Dave
Ken
Sally
You get the idea…
9
Business Actions From Insights
From a single transaction (one receipt)
To engaging the customer with relevant actions (many receipts)
- Coupons for food - Weekend offers ?- Loyalty card ?- Employee rewards ?
Datafication
“The process of taking all aspects of life and turning them into data”– Google’s augmented-reality glasses– Twitter for thoughts– LinkedIn for professional networks
Creating new products with data, improving existing products with data
10
Credit: Kenneth Cukier/Victor Mayer-SchoenbergerMay/June 2013 Foreign Affairshttp://tinyurl.com/ke6cqku
Today we’ll show you how to add Lotus Connections to the list
11
The Value of Connections ?
Obvious value:– Collaboration tool
Business Insights
Connections Analytics
Perhaps “not so obvious” value:–“Social Receipts” …Datafication of Interaction Patterns…Business Insights !
12
Possible Questions Connections Data Can Help Answer
Are you effectively communicating your message ?
Are other’s responding to your message ?
Are customers, business partners, contractors, employees responding to your message?
Who are brokers of information in the organization ?
What Lotus communities are the most effective ?
What are the communication patterns like between divisions ?
What are the communication characteristics of high performing organizations ?
Ask Your Question… Find Your Business Value
13
Extracting IBM Connections datafor analytical purposes
IBM Connections
Home pageSee what's happening across your social network
CommunitiesWork with people who share common roles and expertise
FilesPost, share, and discover documents, presentations, images, and more
Micro-bloggingReach out for help your social network
ProfilesFind the people you need
WikisCreate web content together
ActivitiesOrganize your work and tap your professional network
BookmarksSave, share, and discover bookmarks
BlogsPresent your own ideas, and learn from others
ForumsExchange ideas with, and benefit from the expertise of others
Connections Maximizes The Value of Social Data
IBM Connections provides APIs and SPIs that allow the value of the social data to be maximized by external systems:
– ALL Connections data can be accessed by external systems
– Open, transparent, breaking down silos
Pull data from IBM Connections– Programmatically access much of the same
information that you can through the IBM Connections user interface
Have Connections push data to you– All data changes (CUD) event in all IBM Connections
components can be supplied to external consumers
Connections Architecture
Directory
JMX / WSAdminAdministration
Search
Person Card
User Directory
IBM Connections Apps
RDB
Common Services
NavigationalHeader File
System
Connections Architecture
HTML
Directory
JMX / WSAdminAdministration
Search
Person Card
User Directory
HTTP Server & Proxy Cache
POST
JavaScript Atom FeedAtom Entry
PUT DELETE GET
HTML Form
IBM Connections Apps
RDB
Common Services
REST API
Feed Reader
Sametime Portlets Your AppLotus NotesBrowser Mashups
JSON
Microsoft Office
NavigationalHeader
Connections Atom API
FileSystem
Connections Architecture
HTML
Directory
JMX / WSAdminAdministration
Search
Person Card
User Directory
HTTP Server & Proxy Cache
POST
JavaScript Atom FeedAtom Entry
PUT DELETE GET
HTML Form
IBM Connections Apps
RDB
Common Services
Other Enterprise Services
REST API
Feed Reader
Sametime Portlets Your AppLotus NotesBrowser Mashups
JSON
Microsoft Office
NavigationalHeader
Connections Atom API
Integration bus Event SPI
Your App
FileSystem
The Event SPI is the social data fire-hose
Designed to allow 3rd party to get notified whenever a data change happens in any of the IBM Connections service
– Real-time events generated by IBM Connections include all create, update, and delete (CUD) operations.
– Potential to represent the complete interaction footprint of the enterprise
– Allowing to capture, persist, model, analyze, visualize and monetize your enterprise network
SPI (System Programming Interface) vs API (Application Programming Interface)
– SPI at lower level than APIs ... contribute Java code at system level
– By contributing Java code written to this SPI, 3rd parties can listen to creation, deletion and update (and more!) events of content within IBM Connections
Event SPI – Programming aspects
Events: collections of data generated when activities (data-modifying, notifications) occur in IBM Connections
– In the SPI, an event is represented by a Java bean / object
– A Event encapsulate data such as the type of action and the object (and container) involved in the action
Events are delivered to Event Handlers: – An event handler is a Java class implemented by a 3rd
party (you!)– Event handlers are registered in an XML file (event-
config.xml)• Instructing what type of event to send to a given
handler– Connections delivers Java bean representing the event
to registered event handler(s)
Event SPI
Handler 1
Handler 2Handler N
Event-config.xml
Event SPI – available data in each eventblog.entry.created:
“Amy Jones posted a blog entry in the blog named XYZ”
The person who initiated this action.
Details: External id, name and, if not disabled, email address
Type Item ContainerActor
Type of action
Example: CREATE, UPDATE, DELETE, NOTIFY, MEMBERSHIP, ..
General concept for representing an individual entity within a container
Details: id, name, textual content, HTML and ATOM paths
General concept for representing a "bucket" or "container" that contains other items
Details: id, name
Event SPI – available data in each event
Many more data fields encapsulated in events:
– Correlation item set to represent parent-child relationship (events about commenting action)
– Target set, allowing to deduce interaction between content and people
– Membership delta field, indicating who has been added/removed from a community, activity, ...
– ... see Event SPI documentation for full list (JavaDoc)
Key point: the event model encapsulates
all of data needed to understand the interaction between people, content and
containers in the platform
Event SPI in the context of an analytic solution
Challenges of analytics:
Large amount of incoming event stream– Over 100+ events per second CUD– Growing on longer term– Scalable framework for analysis
• Horizontal scale to address growth
(Near) real-time indexing
No data loss
Taming the fire-hose... (1/2)
Analysis, even basic, is time consuming, thus:
Analysis should not occur in the event handler, but in an external system (“Analytics Service”)
The event handler should not wait until the analytic service processes the event
– It would result in an accumulation of events at Connections level
– Problematic as Connections queue retaining events to be delivered to event handler has a limited depth
=> Design event handler to consume and process events as fast as possible, ie: as the interface between IBM Connections and an external system
“Data backbone” Storage for asynchronous processing
Event SPI
Analytics Service
Event Handler
Goal: retaining as many events as possible for further analysis
Taming the fire-hose... (2/2)
Characteristics of the data backbone– Distributed and highly available– Horizontal scale– High throughput– Agnostic to consumers' state
Multiple options– Message broker
MQ / MQTT / ActiveMQ / Apache Kafka
– Database– ...
Integration with a message broker – Apache Kafka
Send JSON representation of the event. Serialization to JSON through Open Source GSON library
Java class implementing the EventHandler interface
Integration with a message broker – Apache Kafka
Registration – through events-config.xml
Java class implementing EventHandler interface
Subscriptions define the events delivered by the SPI to the event handler.
Filtered by event name, source (IBM service), or/and type (CREATE, UPDATE, DELETE, ...)
Properties: name/value pair injected in the event handler java class.Typically used to pass config. settings
Integration with a message broker – Apache Kafka
Deployment – jar and dependencies made available to the SPI (running in the IBM Connections News application) through a Shared Library in WebSphere Application Server
3rd party events can also participate in the social analytics solution
IBM Connections provides OpenSocial Activity Streams APIs allowing 3rd party to push their own events to the Activity Stream
From Connections 4.5:– Events pushed through the Activity Stream
APIs are also surfaced in the Event SPI– An option allows to NOT surface an event
in the Activity Stream APIs, ie: only surface through the Event SPIs
=> 3rd party application can also participate in the social analytics graph simply by publishing to the Connections Activity Stream APIs
Pulling data – when is it needed ?
30
You can “pull” all data from Connections...
but is it really needed?
Good news:
Events surface in most case all data needed for analytics purposes (including the content the event is about)
Events about the same object repeat data– If there are X events about the same object, the item/correlation data set will always contain the most
up-to-date information about the referenced object
For an analytic solution – in a nutshell, this means that the Event SPI should be sufficient in most cases
Pulling data – when is it needed ?
“Push” approach (Event SPI) is sufficient to build most analytic solution– All necessary content (textual content, tags, …) is surfaced in every single event– All operation changing relationships (ie: adding/removing member, colleague, follower) are surfaced
as events
“Pull” (REST APIs) approaches should stay limited to:
1. “Bootstrap” the Analytics Service based on a Connections system with data existing prior to the introduction of the event handler used in your analytic solution
• Essentially building membership/network data (as needed)• Seeding the content should not be needed, as it is repeated whenever an event about the content
is generated
2. Fetching data not available through the Event SPI• Relatively “rare” for events generated from Connections
Pulling data from Connections
32
2 main approaches for pulling data from Connections
1. REST APIs (Atom / OpenSocial format)– REST-style HTTP based APIs (XML, Json format)– Transparency: programmatically access much of the same information that can be accessed through
the IBM Connections UI– “Drink your own champagne” - public APIs used internally by plug-ins, mobile … and even some
components Web UI (Activity Stream, Activities, …)
2. Seedlist– Designed to allow crawling of Connections data for indexing purpose by a search engine– Surfacing all content in the system – therefore it can be of some value for an analytic solution– HTTP based APIs (Atom XML format)
Seedlist
Example: /forums/seedlist/myserver returns ALL forum entries in the system– Textual content, author, number of comments, number of recommendations, parent id,
ACL
Authentication aspects for the REST APIs
REST APIs support basic authentication, form-based authentication and (for most APIs) Oauth
Private data: strict enforcement of access on API calls– Not very convenient for access by an analytic
system...
“Super user” – Concept of “super user” - access control checks on
private data are by-passed– The “super user” is a user mapped in the JEE
“admin” role across all Connections services
Public data: APIs that access public data don't require authentication
– Provided that the environment is not configured to prevent anonymous access
Pulling data from Connections – What to use, when?
REST APIs (Atom / OS APIs) Seedlist
Pro • Fine granularity: access content / meta-data for a specific object / container
• Access relationship information
APIs are available for fetching membership lists, network information, who liked a given object, ...
• Batch retrieval of textual content• Incremental updates (but the Event
SPI is much more suitable for this purpose)
Cons • Lack of batch retrieval capabilities • Focused around content - does not expose all the data (missing tags membership information, ...)
In some very specific cases, data not available in a form easily consumable to build an analytic solution– Example: getting the list of followers for a given object in the system– Query directly the Connections databases (in these specific cases only)– Database schema can change overtime and is private
Key points
Leverage the Event SPI as much as possible– Provides (most of) the data needed for any elaborated
analytics solution– Just let Connections push data to you! Easier, perform
well
“Fill the gaps” by pulling data from the Atom/Seedlist APIs
– Initial loading of relationship / content data– Data not available through the Event SPI
One final warning:– Analytic solution access to private data through the Event SPI, and Atom/Seedlist APIs (with admin
role)
=> Ensure your solution is not leaking private data to unauthorized users
37
Analytics And Connections Data
The “Enterprise” Workflow
38
Data
Sources
ETL Data
Prep
Analytics
Data
Consumption
Credit: Paco Nathan
The Analytics Data Service
Hadoop/Zookeeper
Map/ReduceTools
Big Table DB
Graph Database
GraphAnalytics
WebServer
node.js
data analytics service
UI service
identity
service Workflowcoordinator
StreamProcessing
pub/sub
Frequently Heard Big Data DimensionsA Fuzzy definition:
– 4Vs: volume, velocity, variety, value– Can’t fit or be processed on a single machine– data intensive vs. compute intensive– Analytics focused
40
Big Data Aspects For Us To ConsiderConnections data:
semi-structured, line formatted output, that works well with “a hadoop cluster” and graph
time and spacial aspects
de-normalized
combined with multiple data sources
calculations = data too
explored for insights, innovate with data
doesn’t ‘expire’, sticky
The difference between “BI” and “Analytics”– Hadoop environments are designed to interpret the data at processing time– Processing attributes chosen by the person processing the data
41
‘Simple’ Analytics Are Often Best
More data usually beats better algorithms– LOTs of data. Simple algorithms is not a bad plan.
But you will probably always want to ‘sample’ for efficiency
42
Credit: Anand Rajaraman, Netflixhttp://anand.typepad.com/datawocky/2008/03/more-data-usual.html
Handling The Data From Connections
Full Refresh– Often called “bulk load”
Delta Updates– Streaming via the SPI
What do you do with the data as it comes in ?– Files ?– Directly into stores ?– Directly into analytics ?
A need for real time analytics ?
43
Why A Property Graph In Analytics ?
A property graph has:– key/value properties– both vertices and edges can have any number of properties– directed relationships– (hint: this is not rdf)
Reference: https://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph
We want to answer questions like:– Context around the event– Cause and effect of an event– Things related to an event
Property graphs are a very useful tool– Data science part– Production part
44
Name: bob Name:roger
calls
Graph Analytics: A Specific Example For Connections Data
45
em·i·nence
noun \ e-mə-nən(t)s\ˈ
: a condition of being well-known and successful
Source: Merriam-Webster OnlineHow might we use graph technology in our analytics service to calculate a person’s eminence ?
Graph Analytics – A Glimpse At Eminence Calculations
46
Person A Person BStatus Update Status UpdateComment
creates createscomments on
Look for this graph pattern, thencount comments and weight by who commented, normalize… = an eminence scoreelement
A real eminence score canhave 13 or more measuresjust from Connections metadata alone.
Visualizing Analytics: A Real Dashboard Example
47
Scores are fictionalized
Gradually Add More Data and Analytics For Deeper Insights
48
Finding potentially obese people…
Source: The Wall Street Journal
What other sources of data are there outside of Connections ?
What other data is coming in the Connections Event SPI ?(hint: it can be more than just connections data)
For us:
CRMConnections
Other…
Articles E-mail
Summary: Find Business Value In Your Connections Data
From “transactions”/“social receipts” To insights
Effective use of Connections APIs
Key insights using Big Data Analytics on Connections Data
Engagement for better productivity and faster execution – – at the personal, organizational and company wide levels
Your insights are limited only by the data and your ability to process it for insights
49
For More Information
Visit IBM’s Emerging Technology Page !
http://www.ibm.com/sna
http://www.ibm.com/engage
Stop by the Innovation center to see more
I’ll be there to answer your specific questions !
More information about the Connections APIs and SPIs in the IBM Connections product wiki under “Developing”
50
Access Connect Online to complete your session surveys using any:– Web or mobile browser – Connect Online kiosk onsite
51
52
Engage Online
SocialBiz User Group socialbizug.org– Join the epicenter of Notes and Collaboration user groups
Follow us on Twitter– @IBMConnect and @IBMSocialBiz
LinkedIn http://bit.ly/SBComm– Participate in the IBM Social Business group on LinkedIn:
Facebook https://www.facebook.com/IBMSocialBiz– Like IBM Social Business on Facebook
Social Business Insights blog ibm.com/blogs/socialbusiness– Read and engage with our bloggers
53
Acknowledgements and Disclaimers
© Copyright IBM Corporation 2014. All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, Lotus, and IBM Connections are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.