30
Identity Data Mining Building an Identity Data Mining Engine in PHP Jonathan LeBlanc (@jcleblanc )

Building an Identity Extraction Engine

Embed Size (px)

DESCRIPTION

When it comes to building customized experiences for your users, the biggest key is in understanding who those users are and what they're interested in. The largest problem with the traditional method for doing this, which is through a profile system, is that this is all user-curated content, meaning that the user has the ability to enter in whatever they want and be whoever they want. While this gives people the opportunity to portray themselves how they wish to the outside world, it is an unreliable identity source because it's based on perceived identity. In this session we will take a practical look into constructing an identity entity extraction engine, using PHP, from web sources. This will deliver us a highly personalized, automated identity mechanism to be able to drive customized experiences to users based on their derived personalities. We will explore concepts such as: - Building a categorization profile of interests for users using web sources that the user interacts with. - Using weighting mechanisms, like the Open Graph Protocol, to drive higher levels of entity relevance. - Creating personality overlays between multiple users to surface new content sources. - Dealing with users who are unknown to you by combining identity data capturing with HTML5 storage mechanisms.

Citation preview

Page 1: Building an Identity Extraction Engine

Identit

y Data

Min

ing

Building an Id

entity D

ata M

ining Engine in

PHPJonath

an LeBlanc (

@jcl

eblanc)

Page 2: Building an Identity Extraction Engine

Premise

You can determine the personality profile of a person based on their browsing habits

Page 3: Building an Identity Extraction Engine

Technology was the Solution!

Page 4: Building an Identity Extraction Engine

Then I Read This…

Us & Them

The Science of Identity

By David Berreby

Page 5: Building an Identity Extraction Engine

The Different States of Knowledge

What a person knows

What a person knows they don’t know

What a person doesn’t know they don’t know

Page 6: Building an Identity Extraction Engine

Technology was NOT the Solution

Identity and discovery are

NOT a technology solution

Page 7: Building an Identity Extraction Engine

Our Subject Material

Page 8: Building an Identity Extraction Engine

Our Subject Material

HTML content is unstructured

There are some pretty bad web practices on the interwebz

You can’t trust that anything semantically valid will be present

Page 9: Building an Identity Extraction Engine

How We’ll Capture This Data

Start with base linguistics

Extend with available extras

Page 10: Building an Identity Extraction Engine

The Com

ponents

Page 11: Building an Identity Extraction Engine

The Basic Pieces

Page Data

Scrapey Scrapey

Keywords Without all

the fluff

WeightingWord diets

FTW

Page 12: Building an Identity Extraction Engine

Capture Raw Page Data

Semantic data on the webis sucktastic

Assume 5 year olds built the sites

Language is the key

Page 13: Building an Identity Extraction Engine

Extract Keywords

We now have a big jumble of words. Let’s extract

Why is “and” a top word? Stop words = sad panda

Page 14: Building an Identity Extraction Engine

Weight Keywords

All content is not created equal

Meta and headers and semantics oh my!

This is where we leech off the work of others

Page 15: Building an Identity Extraction Engine

Simple

Ext

ract

ion E

ngine

Page 16: Building an Identity Extraction Engine

Questions to Keep in Mind

Should I use regex to parse web content?

How do users interact with page content?

What key identifiers can be monitored to detect interest?

Page 17: Building an Identity Extraction Engine

Fetching the Data: The Request

$html = file_get_contents('URL');

$c = curl_init('URL');

The Simple Way

The Controlled Way

Page 18: Building an Identity Extraction Engine

Fetching the Data: cURL$req = curl_init($url);

$options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 );

curl_setopt_array($req, $options);

Page 19: Building an Identity Extraction Engine

//list of findable / replaceable string characters $find = array('/\r/', '/\n/', '/\s\s+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = preg_replace('#<script(.*?)>(.*?)</ script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</     style>#is', '', $mod_content);

$mod_content = strip_tags($mod_content);$mod_content = strtolower($mod_content);$mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content);$mod_content = explode(' ', $mod_content);

natcasesort($mod_content);

Page 20: Building an Identity Extraction Engine

//set up list of stop words and the final found stopped list$common_words = array('a', ..., 'zero'); $searched_words = array();

//extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if(strlen($word) > 2 && !in_array($word, $common_words)){         $searched_words[$word]++;     } }

arsort($searched_words, SORT_NUMERIC);

Page 21: Building an Identity Extraction Engine

Scraping Site Meta Data

//load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content);

//scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;

Page 22: Building an Identity Extraction Engine

//loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i);   if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){       $dataReturn["description"] = $meta->getAttribute("content");     }   } else { if($meta->getAttribute("name") == "description"){      $dataReturn["description"] = $meta->getAttribute("content");     } else if($meta->getAttribute("name") == "keywords”){       $dataReturn[”keywords"] = $meta->getAttribute("content");     }   } }

Page 23: Building an Identity Extraction Engine

Extendin

g the E

ngine

Page 24: Building an Identity Extraction Engine

Weighting Important Data

Tags you should care about: meta (include OG), title, description, h1+, header

Bonus points for adding in content location modifiers

Page 25: Building an Identity Extraction Engine

Weighting Important Tags

//our keyword weights$weights = array("keywords" => "3.0",                             "meta" => "2.0",                             "header1" => "1.5",                             "header2" => "1.2");

//add modifier hereif(strlen($word) > 2 && !in_array($word, $common_words)){     $searched_words[$word]++; }

Page 26: Building an Identity Extraction Engine

Expanding to Phrases

2-3 adjacent words, making up a direct relevant callout

Seems easy right? Just like single words

Language gets wonky without stop words

Page 27: Building an Identity Extraction Engine

Working with Unknown Users

The majority of users won’t be immediately targetable

Use HTML5 LocalStorage & Cookie backup

Page 28: Building an Identity Extraction Engine

Adding in Time Interactions

Interaction with a site does not necessarily mean interest in it

Time needs to also include an interaction component

Gift buying seasons see interest variations

Page 29: Building an Identity Extraction Engine

Grouping Using Commonality

InterestsUser A

InterestsUser B

Inte

rests

Com

mon

Page 30: Building an Identity Extraction Engine

Thank You!

Questio

ns?

www.slidesh

are.co

m/jc

leblanc