iFocus: A Vertical Search Engine for Domain Knowledge ...iFocus: A Vertical Search Engine for Domain Knowledge Syndication Ke Mao, University of Chinese Academy Sciences i Focus is

iFocus: A Vertical Search Engine forDomain Knowledge SyndicationKe Mao, University of Chinese Academy Sciences

iFocus is a vertical search engine for knowl-edge syndication, which can be used ina variety of domains. In this project we

showed its usefulness in the e-commerce field.We implemented a series of focused crawlersto collect online shopping websites in Chinasuch as Taobao, TMall, DangDang, etc. Thestructure of this report is organized in STARformat, i.e., Situation, Task, Action, Result.

Situation

The initial motivation for developing a vertical searchengine was to compete in the 1st China Software CupSoftware Development Contest in 2012. One of thecontest topics was about vertical search, this topicwas just fit with my laboratory’s strength, i.e., theInternet Software Technologies. I teamed up withthe other two classmates in my laboratory as weshared the same interest in implementing such avertical search engine. Since all 3 of us had dailyresearch assignments that time, we undertook thedevelopment work in our spare time, mainly in theweekends.

Task

[Goals of the System] The overall objective of thissystem is to meet the requirement for vertical searchtoward a specific domain, e.g., online shopping, whichneeds the system to index the focused knowledge inthis field and meanwhile present the user-requestedknowledge friendly via web browser. The overallobjective can be broken down into the following 3goals: Flexible storage structure, User-friendly data

acquisition and presentation, High performance dataextraction.

Figure 1: The framework of iFoucs system

[Design of the System] The design of the sys-tem was an agreement concluded from a brainstormand discussion process involved with all 3 team mem-bers. In order to satisfy above objectives, we em-ployed NoSQL techniques to support flexible storage,which is schema-free and can meet further incremen-tal requirement in requested data shapes. For user-friendly data acquisition, we used Apache Lucene tosupport the data acquisition functions such as fastindexing, ranked searching, etc. We also designed touse Learning to Rank algorithm to train the rankingmodel from user click logs. For data presentation,we adopted the MVC architecture. For high per-formance data extraction, we utilized the ScrapyFramework to implement focused Web Spiders. Theframework of the system is shown as Figure 1.

[My Duty] My duty for this project is to realizethe goal of high performance data extraction, whichrequired me to develop our web crawler module. Theweb crawler should be focused on the given specificdomain pages, extract the target information pre-cisely, interact with the NoSQL database, capable inrecovering from interrupted progress and meanwhile

linkedin.com/in/kemao • DeNA Requested Project Report page 1 of 3

be efficient in crawling large amount of web pages.

Action

In order to implement the focused web spider de-scribed in the Task section, I had read several re-search articles on focused crawler aiming to have aview of state-of-art from the academic perspec-tive, then I had researched a few open source frame-works supporting the crawler implementation aimingto understand how the crawler techniques were ap-plied from the industry perspective.

After a few days of investigation, I finally chosethe Scrapy framework to realize our focused spideras it had already exhibited many good features inprevious projects, e.g., it’s well-documented and well-tested, fast, extensible yet with simple Python code.Another reason I chose this framework because it’s100% Python and Python is a wise choice for webdata processing. The framework of original Scrapywas illustrated in Figure 2, without the redis part.

Figure 2: The framework of the distributed Scrapy spider

[The problems I ran into] The initial version ofthe focused crawler was implemented smoothly, I sim-ply used the Depth-First-Search method to traversethe URL path of the online shopping websites, thefocus on product information is granteed by definingthe URL regular expression and also by checking thecontent relevance of the target page. The attributesof the items were identified through XPath and thecrawled items were processed in a pipeline and wereinserted/updated to the Mongodb database.

However the problem is the crawler is not efficientenough for crawling large amount of domain websites,the average speed was only about 187 pages/minute,this speed is not acceptable as it would take usmonths to crawl all the products for the given domainwebsites.

[The solution I came up with] For tackling thechallenge of the efficiency issue. First I had dived

into this issue and analysed the root cause of the lowspeed, I found that the DFS traverse strategy was akey factor and secondly the single thread was anotherpoint that can be improved. Then to overcome thesechallenges, I had tried and experimented a varietysolutions. During this process, I kept communicatingwith my team members to hear about their opinionsand I also searched the Internet for acquiring theknowledge I need for solving the challenges. Finally Imanaged to speed up the focused crawler significantlyin the following 2 ways:

1.Strategic Spiders Group: Instead of DFS, Ihad implemented BFS crawler however the improve-ment was not much and the memory usage of BFScrawler was not affordable. Then I came up with theheuristic rules for optimizing the traverse URL paths,for example by analysing the anchor text of the URLs,assign proper priority to the queued URLs. Anothergood heuristic rule was to follow the menu structureof the websites, in this way we can reach the targetproduct pages perfectly. What’s more, instead ofusing one certain spider, I had implemented a groupof strategic spiders to satisfy different applicationcontext.

2.Redis + Mongodb: As shown in Figure 2, Ihad notified the original Scrapy framewrok by utiliz-ing Redis, which made the spiders to be distributed.Several node can run different strategic spiders andshare the Redis enhanced public scheduler. I storedthe URL queue in Redis, and kept the items data inMongoDB.

Figure 3: Performance Comparison of Spiders

Result

The performance comparison of different versionsof focused spiders is presented in Figure 4. After 2months of development using our spare time, andthrough team cooperation, our system was success-fully realized. Our iFocus system won the first prizeof China Software Cup Software Development Con-test (ranked 2nd among national 1700 teams).


Figure 4: iFocus screenshots


Documents

iFocus: A Vertical Search Engine for Domain Knowledge ...iFocus: A Vertical Search Engine for Domain Knowledge Syndication Ke Mao, University of Chinese Academy Sciences i Focus is