4
December 2014 TECHNICAL WHITE PAPER | TLD Suggestion System 1 TLD Suggestion System A probabilistic model for the TLD suggestion Overview The TLD/GTLD suggestion system aims to offer a decision making tool. This tool is intended give domain name service clients a relevant classification of TLDs (top-level domains) based on qualitative and quantitative indicators (such as location, language, buying frequency, etc.). With the arrival of new extensions GTLD (generic top-level domains), Gandi's offer continues to grow and become richer. The TLD/GTLD suggestion system contributes to the simplification and clarification of the choice of popular extensions for a similar customer profile. It also allows better targeting of which extensions to buy. In this way our clients can obtain a list of relevant TLDs among those that Gandi offers depending on his/her language, and his/her geographical location. The Fig.1 describes the overall approach of our suggestion system which consists of a learning system comprised of two main phases: • A learning phase, which consists of estimating the distribution of a given TLD, its frequency, and correlation with others TLD. This estimation is done periodically based on anonymous purchase data, previously obtained during the prediction phase. The estimation period is one day and the estimation is performed against the purchase data over the past 6 months. • A prediction phase allows us to determine an order of relevance for a given TLD of with vector X, from the pretreatment of a given query. A descending-order sort of relevance scores relating to the TLD is presented to the customer. The X observations will be used to supply the database of a given TLD if the client purchases this TLD. The estimations in the learning phase is based on three probabilistic models: a model of the TLD distribution in N dimensional (language, location, etc.) space for a given TLD; a model of the TLD buying frequency weighted by a boost factor; and a modeling of the inter-TLD correlation, determining a distance margin between two TLDs. The TLD classification in the prediction phase is based on the score of each TLD that is obtained through a linear product of the log-likelihood of TLD distribution, the weighted frequency and the correlation coefficient. TLD distribution modelling TLD distribution is modelled as a “Gaussian Mixture Model” (GMM 1 ). GMM is a probabilistic model for evaluating random variables with multiple Gaussian components. This model has become a popular tool. It is used successfully in a growing number of disciplines such as astronomy, genetics, biology, economics, etc. This models popularity comes essentially from its ability to adapt to the abundance or scarcity, as well as the diversity of the elements in the dataset to analyse. Indeed this model is famous for its ability to effectively reconstruct missing data in a dataset. In our parametric modeling, each TLD is considered to have I observations ( , ,.., ,…, ). We consider that our observations (dataset) follow a probability distribution whose density function is normal. Each observation Xi of the TLD dataset is a vector comprised of n dimensions ( , ,.., ) belonging to (ie: n continuous variables). Each of these observations belong to one of the components ∈( , ,.., ) with each one following a normal distribution with a mean ∈ℝ and a covariance matrix ∈ℝ . The probability density function of a K-component mixture is expressed as a linear combination of the K density function of the normal distribution, given by: (, )= ( | , ) Where ,… ∈ℝ : the proportions of the K components of the mixture with ( > 0, ∑ = 1), G = ( ,… , ,…, , ,…, ) the global parameter of the mixture, and the density function of the normal distribution is given by: ( | , )= ( ) | | ( ) ( ) 1. GMM (Gaussian Mixture Model): Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B 39 (1): 1–38 Fig 1. TLD suggestion system

TLD Suggestion System - Gandi.netpres.gandi.net/suggest/TLDsuggest.pdf · December 2014 TECHNICAL WHITE PAPER | TLD Suggestion System 4 Where ´(´≥ , =10 )represents a semantic

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TLD Suggestion System - Gandi.netpres.gandi.net/suggest/TLDsuggest.pdf · December 2014 TECHNICAL WHITE PAPER | TLD Suggestion System 4 Where ´(´≥ , =10 )represents a semantic

December 2014

TECHNICAL WHITE PAPER | TLD Suggestion System 1

TLD Suggestion SystemA probabilistic model for the TLD suggestion

OverviewThe TLD/GTLD suggestion system aims to offer a decision making tool. This tool is intended give domain name service clients arelevant classification of TLDs (top-level domains) based on qualitative and quantitative indicators (such as location, language,buying frequency, etc.). With the arrival of new extensions GTLD (generic top-level domains), Gandi's offer continues to grow andbecome richer. The TLD/GTLD suggestion system contributes to the simplification and clarification of the choice of popularextensions for a similar customer profile. It also allows better targeting of which extensions to buy. In this way our clients can obtaina list of relevant TLDs among those that Gandi offers depending on his/her language, and his/her geographical location.The Fig.1 describes the overall approach of our suggestion system which consists of a learning system comprised of two main phases:• A learning phase, which consists of estimating thedistribution of a given TLD, its frequency, and correlationwith others TLD. This estimation is done periodically basedon anonymous purchase data, previously obtained duringthe prediction phase. The estimation period is one day andthe estimation is performed against the purchase data overthe past 6 months.• A prediction phase allows us to determine an order ofrelevance for a given TLD of with vector X, from thepretreatment of a given query. A descending-order sort ofrelevance scores relating to the TLD is presented to thecustomer. The X observations will be used to supply the database of a given TLD if the client purchases this TLD.The estimations in the learning phase is based on three probabilistic models: a model of the TLD distribution in N dimensional(language, location, etc.) space for a given TLD; a model of the TLD buying frequency weighted by a boost factor; and a modeling ofthe inter-TLD correlation, determining a distance margin between two TLDs. The TLD classification in the prediction phase is basedon the score of each TLD that is obtained through a linear product of the log-likelihood of TLD distribution, the weighted frequencyand the correlation coefficient.

TLD distribution modellingTLD distribution is modelled as a “Gaussian Mixture Model” (GMM1). GMM is a probabilistic model for evaluating random variableswith multiple Gaussian components. This model has become a popular tool. It is used successfully in a growing number of disciplinessuch as astronomy, genetics, biology, economics, etc. This models popularity comes essentially from its ability to adapt to theabundance or scarcity, as well as the diversity of the elements in the dataset to analyse. Indeed this model is famous for its abilityto effectively reconstruct missing data in a dataset.In our parametric modeling, each TLD is considered to have I observations ( , , . . , , … , ). We consider that our observations(dataset) follow a probability distribution whose density function is normal. Each observation Xi of the TLD dataset is a vectorcomprised of n dimensions ( , , . . , ) belonging to (ie: n continuous variables). Each of these observations belong to one ofthe components ∈ ( , , . . , ) with each one following a normal distribution with a mean ∈ ℝ and a covariance matrix∈ ℝ .The probability density function of a K-component mixture is expressed as a linear combination of the K density function of thenormal distribution, given by: ( , ) = ( | , )Where , … ∈ ℝ : the proportions of the K components of the mixture with ( > 0, ∑ = 1), G =( , … , , … , , , … , ) the global parameter of the mixture, and the density function of the normal distribution is given by:( | , ) = ( ) | | ( ) ( )1. GMM (Gaussian Mixture Model): Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EMAlgorithm". Journal of the Royal Statistical Society, Series B 39 (1): 1–38

Fig 1. TLD suggestion system

Page 2: TLD Suggestion System - Gandi.netpres.gandi.net/suggest/TLDsuggest.pdf · December 2014 TECHNICAL WHITE PAPER | TLD Suggestion System 4 Where ´(´≥ , =10 )represents a semantic

December 2014

TECHNICAL WHITE PAPER | TLD Suggestion System 2

As the global parameter G ( , … , , … , , , … , ) is unknown, the goal is to estimate it from a set of observations( , , … , ). To do that we are going to maximize the observed log-likelihood (calculated on the observations Xt ) given by:( , ) = log ( | , )However, there is no analytical solution to maximize the log-likelihood. For this we will use the EM (Expectation-Maximization) [1]algorithm which offers good results when there are missing observations. The implementation of the EM algorithm is simple whenthe log-likelihood maximization is done on the observations ( , , … , ) also representing the missing observations. Starting withan arbitrary global parameter G- , the algorithm iterates over the following steps: Step E (mathematical expectation):

Calculate mathematical expectation = ( | , )∑ ( | , ) Step M (maximization): parameter G+

Calculate the ratio = ∑Calculate the average (vector) = ∑∑Calculate the covariance (matrix) = ∑ ( ) ( )∑

The EM algorithm stops after a predefined number of iterations Niter or when the log-likelihood, which grows at every iteration,becomes stationary. Fig.2 shows an example of how to estimate a global parameter G from a set of observations (figure 1.a) X ∈ℝ ( , ). We can imagine, for example, that X represents the geographical coordinates of the IP addresses of I clients of the .comTLD, ( = longitude, = latitude). The number of components of the mixture K can be predefined or calculated thanks to a K-meanstype of clustering mechanism, or with a Dirichlet process (DPM). In our example, K is fixed to 3 components. With the EM algorithmwe can estimate the parameters of the three components ( , , ), ( , , ), and ( , , ) of the global parameterG. Fig.3 shows the membership of the X observations for the three components. Once the three components are estimated, we canthen calculate the probability density of missing observations. Fig.4 shows the density probability of the Gaussian mixture G on aℝ space.

Fig 3. Distribution of observations on the 3components

Fig 4. Probability density of a Gaussian mixture with 3components

In our suggestion system which has a number T of TLD extensions, the distribution of each extension t is modeled as a Gaussianmixture G t ( ( , , ), …, ( , , )) of K components. The estimation of the global parameter G t is done thanks to theEM algorithm, based on observations X t. ( , , . . , , … , ). Each time an extension t is bought, an observation is added to theset Xt.An observation is a vector of n dimensions ( , , . . , ) from a space. The dimension of our vector depends on the numberof search criteria used in our suggestion system. In our study, three search criteria are defined (leaving open opportunities forimprovement with the addition of new relevant criteria): Location coordinates of the IP address of the client: = longitude, = latitude. The language used by the client browser, represented in 2 dimensions: , Type of client (individual, company …), if he is logged-in, represented on 2 dimensions: , .Note that if the dimension of the vector is large, it implies an increase in the learning time (estimation time) of the global parameterG t , and also the time to calculate the probability density of a missing observation.

Fig 2. Representation of observations X(x1, x2)

Page 3: TLD Suggestion System - Gandi.netpres.gandi.net/suggest/TLDsuggest.pdf · December 2014 TECHNICAL WHITE PAPER | TLD Suggestion System 4 Where ´(´≥ , =10 )represents a semantic

December 2014

TECHNICAL WHITE PAPER | TLD Suggestion System 3

Fig.5 and Fig.6 respectively show the probability density of the Gaussian mixture of the .com TLD for English and the .fr TLD forFrench languages. The parameter G .com andG .fr estimation is done based on observations of the .com TLD purchases over 6 months.

Fig 5. Probability density of a Gaussian mixture of the TLD.com for English language

Fig 6 Probability density of a Gaussian mixture of the TLD.fr for French language

TLD frequency modellingThe TLD frequency is modeled as purchase frequency weighted by TLD t over a period of time ∆ , given by:ˇ = ∑Where the purchase frequency of a TLD t is itself given by: = ∑∑ ∑ , ∑ = 1,= { , , … , } is a vector of positive weights corresponding to each TLD. The weights can be used as boost constants for agiven TLD, or modeled as a linear function, given by: = +Where is a criterion of interest and the boost for TLD t, and is the weighting of the criterion on the boost.Fig.7 shows the frequency with a time period ∆ of 15 days. We can see the impact of a weight = 5 for the .eu TLD on the otherTLD frequencies.

Fig 7. TLD frequency

TLD correlation modellingThe third part of our suggestion tool is the modeling of the correlation between extensions. Indeed, some correlations are obvious:the semantic meaning of some GTLDs are the same, which makes them virtually identical. For example, the GTLD “photo”, ”photos”,“pic”, “photography” all have equivalent semantic meaning. Other correlations are less obvious and must be obtained throughanalysis of the purchase data and the percentage of TLD purchases made in pairs. To model this correlation as a mathematicalfunction we first modeled a distance metric between TLDs, given by:´ = ´ + (1 − ´)

Page 4: TLD Suggestion System - Gandi.netpres.gandi.net/suggest/TLDsuggest.pdf · December 2014 TECHNICAL WHITE PAPER | TLD Suggestion System 4 Where ´(´≥ , =10 )represents a semantic

December 2014

TECHNICAL WHITE PAPER | TLD Suggestion System 4

Where ´( ´ ≥ , = 10 ) represents a semantic distance between the TLD and the TLD ´, for example: the distance ´ =between =. ℎ and ´ =. ℎ .´ is the simultaneous purchase frequency of each pair of TLD and ´.

The correlation coefficient is then modeled as the probability density of the exponential distribution, given by:´ = ´So we get a matrix of correlation coefficient ∈ we have to process.

Fig 8. TLD suggestion system with the three models: distribution, frequency and correlation

ValidationAn analysis of the results (Fig.9) of the sales positions of TLDs indicates that: More than 91% of purchases are made from the first 30 positions listed Purchases made from the system ABC (February) are well positioned

in the suggestion system For each added criteria, a comparison is made to examine the global

impact on the suggestion system. The main goal is to achieve more than95% of purchases within the first 10 positions

The experimental results, obtained using a model, showed us two things.The reactivity of our system to process the queries (response time of 57msfor 20 simultaneous queries) and the efficiency to determine a TLDrelevance order based on qualitative and quantitative indicators (location,language, purchase frequency, etc.).

SummaryOur suggestion system offers the following advantages. We can: Efficiently suggest relevant TLD for clients based on specific criteria. Modify the various search criteria (suggestion / learning) without changing the current model and without using other systems. Have response time be independent of the number of TLDs queried. The calculation of scores from the suggestion is linear in

terms of TLD and in terms of number of criteria. Have low dependency between TLDs through learning phases, enabling continuous incrementation of the TLD without

impacting calculations already made.

Fig 9. TLD position: Suggest vs ABC