4
Frequency of Word Combinations using Apriori Algorithm B649 Term Project Group 9 Hemanth Gokavarapu Santhosh Kumar Saminathan [email protected] [email protected] Introduction: The data processing and data mining are becoming more popular and vital in the modern days of data intensive computing. We execute the computations in parallel in order to increase speed and performance. In this scenario, we have to get the right data at the right time effectively. Our project concentrates on searching the right combination of words in a big data set to find the correct pair. In this way we can get the correct combination of words from a huge data set. Our approach uses the Apriori algorithm. The implementation of the project is done using the MapReduce framework and HBase database. Motivation: The frequently accessed item sets are very much useful in the real world applications to provide better solutions. For example an ecommerce application is very much interested to know what the customers have bought frequently in the past. Similarly the frequently asked queries or questions are useful in many of the modern applications. In our project we calculate the frequency of exact combination of words using the apriori algorithm. Approach: The apriori algorithm is used to find the association rules for a given transaction set. It finds the most frequent subset by using the association rule mining. It follows the bottom up approach when frequent subsets are extended one item at a time (the candidate generation step), and groups of candidates are tested against the data. The algorithm terminates when no further extensions are found. This approach is explained clearly in the below flow chart. Checking the number of times each item has occurred in the given set forms the initial set of frequent sets. From this set the candidate item sets are generated for each of the combination thus leading to the next level. Now for the newly formed set, the frequency is determined by checking against the data. The following factors are considered for eliminating the itemsets. Confidence (A>B)= (Number of tuples containing both A and B) / (Number of tuples containing A) Support(A>B) = (Number of tuples containing both A and B) / (Total number of tuples containing )

FrequencyofWordCombinationsusingAprioriAlgorithm ...salsahpc.indiana.edu/csci-b649-2011/collection/Term... · 2011. 12. 19. · FrequencyofWordCombinationsusingAprioriAlgorithm)))))B649TermProjectA)Group9))

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FrequencyofWordCombinationsusingAprioriAlgorithm ...salsahpc.indiana.edu/csci-b649-2011/collection/Term... · 2011. 12. 19. · FrequencyofWordCombinationsusingAprioriAlgorithm)))))B649TermProjectA)Group9))

Frequency  of  Word  Combinations  using  Apriori  Algorithm                                                                                                B649  Term  Project  -­  Group  9  

 Hemanth  Gokavarapu                            Santhosh  Kumar  Saminathan  [email protected]                      [email protected]        Introduction:    

The  data  processing  and  data  mining  are  becoming  more  popular  and  vital  in  the   modern   days   of   data   intensive   computing.   We   execute   the   computations   in  parallel  in  order  to  increase  speed  and  performance.  In  this  scenario,  we  have  to  get  the  right  data  at  the  right  time  effectively.  Our  project  concentrates  on  searching  the  right  combination  of  words  in  a  big  data  set  to  find  the  correct  pair.  In  this  way  we  can  get  the  correct  combination  of  words  from  a  huge  data  set.  Our  approach  uses  the   Apriori   algorithm.   The   implementation   of   the   project   is   done   using   the  MapReduce  framework  and  HBase  database.    Motivation:       The   frequently   accessed   item   sets   are   very   much   useful   in   the   real   world  applications  to  provide  better  solutions.  For  example  an  e-­‐commerce  application  is  very  much   interested   to   know  what   the   customers   have   bought   frequently   in   the  past.  Similarly  the  frequently  asked  queries  or  questions  are  useful   in  many  of   the  modern   applications.     In   our   project   we   calculate   the   frequency   of   exact  combination  of  words  using  the  apriori  algorithm.    Approach:       The   apriori   algorithm   is   used   to   find   the   association   rules   for   a   given  transaction   set.   It   finds   the   most   frequent   subset   by   using   the   association   rule  mining.  It  follows  the  bottom  up  approach  when  frequent  subsets  are  extended  one  item  at  a  time  (the  candidate  generation  step),  and  groups  of  candidates  are  tested  against  the  data.    The  algorithm  terminates  when  no  further  extensions  are   found.  This  approach  is  explained  clearly  in  the  below  flow  chart.  Checking  the  number  of  times  each  item  has  occurred  in  the  given  set  forms  the  initial  set  of  frequent  sets.  From  this  set  the  candidate  item  sets  are  generated  for  each  of  the  combination  thus  leading  to  the  next  level.  Now  for  the  newly  formed  set,  the  frequency  is  determined  by   checking   against   the  data.  The   following   factors   are   considered   for   eliminating  the  itemsets.    Confidence  (A-­‐>B)=  (Number  of  tuples  containing  both  A  and  B)  /  (Number  of    tuples  containing  A)  Support(A-­‐>B)  =  (Number  of  tuples  containing  both  A  and  B)  /  (Total  number  of  tuples  containing  )      

Page 2: FrequencyofWordCombinationsusingAprioriAlgorithm ...salsahpc.indiana.edu/csci-b649-2011/collection/Term... · 2011. 12. 19. · FrequencyofWordCombinationsusingAprioriAlgorithm)))))B649TermProjectA)Group9))

The   next   step   is   to   find   the   next   set   of   items   for   the   next   level.   This   operation   is  repeated  till  the  set  becomes  empty.  Once  the  set  is  empty,  the  association  rules  are  formed.    

   

 Implementation:     In  this  project  we  use  MapReduce  framework  for  parallel  execution  and  HBase  to  store  the  results.      MapReduce:       MapReduce   is   a   framework   invented   by   Google   to   perform   large  computations  in  a  distributed  environment  to  perform  operations  over  a  large  set  of  data   on   cluster   or   cloud   platform.   By   implementing   the   MapReduce   parallel  executions   can   be   done   which   increases   the   performance.   It   has   Mapper   and  Reducer   Tasks.   The  Mapper   is   used   to   partition   the   input   into   sub-­‐problems   and  assigns  them  to  the  worker  nodes.  The  worker  nodes  process  these  sub-­‐problems.    The  Reducer  collects  all  the  output  sent  by  the  worker  nodes  and  forms  the  output.      

Page 3: FrequencyofWordCombinationsusingAprioriAlgorithm ...salsahpc.indiana.edu/csci-b649-2011/collection/Term... · 2011. 12. 19. · FrequencyofWordCombinationsusingAprioriAlgorithm)))))B649TermProjectA)Group9))

HBase:     HBse  is  a  non-­‐relational,  open  source  database  designed  for  Google  Big  Table.  It  runs  on  top  of  HDFS.  In  our  project,  every  time  we  calculate  the  frequent  sets  we  store  the  values  in  the  HBase  database.      We  use  3  map   tasks   in   our  project   viz,   FrequentItemsMap,  CandidateGenMap  and  AssociationRuleMap.   The   FrequentItemsMap   is   used   to   calculate   the   initial  frequency   of   the   items   and   then   the   CAndidateGenMap   is   for   calculating   the  candidate   sets   for   the   intermediate   results   and   the  AssociationRuleMap   is   for   the  calculation   of   association   rules.   Similarly   we   have   the   following   reducers,  FrequentItemsReduce,  CandidateGenReduce  and  AssociationRuleReduce.    Time  Schedule:     1  Week  –  Taking  to  the  experts  at  Futuregrid  to  understand  the  problem     2  Weeks  –  Survey  of  Hbase  and  Apriori  Algorithm     4  Weeks  –  Kick  start  and  work  on  implementing  Apriori  Algorithm     2  Weeks  –  Testing  the  code  and  get  the  results    Result:        

        We  implemented  the  project  using  Hadoop  and  found  the  computation  time  for  mappers   2,   4   and   6.  We   obtained   the   above   results   for   single   node   and  multi  node  environment.      We  use  the  results  from  the  above  program  and  developed  the  web  interface,  we  can  type   a   word   and   it   will   give   the   corresponding   combinations   along   with   their  frequencies.    

Page 4: FrequencyofWordCombinationsusingAprioriAlgorithm ...salsahpc.indiana.edu/csci-b649-2011/collection/Term... · 2011. 12. 19. · FrequencyofWordCombinationsusingAprioriAlgorithm)))))B649TermProjectA)Group9))

   Conclusion:     From  the  results  we  obtained,  we  conclude  that   the  execution  time   is  more  for   the   single   node.   As   the   number   of   mappers   are   increased   in   the   multimode  environment,   we   see   better   performance   in   terms   of   time.   When   the   data   is  extensively   large,   single   node   execution   takes  more   time   and   sometimes   behaves  weirdly.      Acknowledgement:     We   thank   Professor   Judy   Qiu   for   helping   us   by   clearing   the   doubts   and  kindling  our  thought  to  come  up  with  new  ideas  and  implementation.  We  also  thank  the  assistant   instructor,  Stephen  who  helped  us  by  giving  details  about  errors  that  we  got.    Future  Work:     The  project   can  be   further   enhanced   in  many  ways.  We  have   implemented  using   the  Hadoop.   The   same   project   can   also   be   implemented   in   Twister   and   the  performance   analysis   of   both   can   be   obtained   to   get   a   better   view   of   both.   This  simple   algorithm   can   be   molded   in   many   real   world   applications   that   involve  machine-­‐learning  techniques.      Reference:    [1]  http://en.wikipedia.org/wiki/Apriori_algorithm  [2]  http://en.wikipedia.org/wiki/Mapreduce  [3]  http://en.wikipedia.org/wiki/Text_mining  [4]  http://hbase.apache.org/book/book.html  [5]  http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html  [6]  http://www.codeproject.com/KB/recipes/AprioriAlgorithm.aspx  [7]  http://rakesh.agrawal-­‐family.com/papers/vldb94apriori.pdf