CSCI 311_V3.docx

Embed Size (px)

Citation preview

  • 8/10/2019 CSCI 311_V3.docx

    1/18

    1

    CSCI 311Software Process Management

    Building a Distributed Search Engine

    Submitted By

    No Name UOW ID Contributions Signature

    1 Wong Jun Hao Roy 4639327 Contributed

    2 Yee Mon Min Aung Ingrid 4639017 Contributed

    3 Ho Yee Ting 4639066 Contributed

    4 Goh Ming Ming 4395220 Contributed

    5 Tan Wei Yao 4395529 Contributed

  • 8/10/2019 CSCI 311_V3.docx

    2/18

    2

    Table of Content

  • 8/10/2019 CSCI 311_V3.docx

    3/18

    3

    Literature Review

    Introduction

    Search engines maybe one of the core reasons for the Internet to be

    successful and popular today. The search engine may look and it is indeed

    easy to use, however it is a lot more sophisticated than it actually is. In

    general, a search tool should consist of an Interface, Crawler, Index and

    Search.

    The Interface

    It is usually like any other web pages and is made up of text box for

    phrases or keyword(s) and other form of inputs that assists in fine-tuning

    of what the user wants the search engine to look for. It could also be in a

    form of an application or a desktop search engine that crawls and search on

    the hard drives of a computer.

    Crawler

    This is the gist and essence of all search engines that makes searching on

    the Internet possible. Crawlers are programs that crawl through theInternet just like a spider crawling on its web. The crawlers are constantly

    looking for new web pages that they are not aware of and have not been

    included in their index. This is usually done using and searching through

    pages that they are already aware of, looking for new links. And when they

    find new links, the crawlers follow it and collect the necessary data they

    would need for the index. After which the crawler goes home and add the

    new page to the index. As a result the search engines grow bigger

    automatically.

    Index

    An index is a database that works on the backend of the web page. Indexes

    are different portion of the Internet. Some sites have huge indexes and

    some have smaller ones. However regardless of the size, each has its own

    pros and cons.

  • 8/10/2019 CSCI 311_V3.docx

    4/18

    4

    Domain Risk Measures Risk level

    Actor Programming knowledge of

    team member and language

    known are different. There is

    a risk of choosing aprogramming language that

    someone in the team is not

    familiar with

    Majority of team members

    are familiar with C# so it

    will be choosen as a

    language to be used for theproject.

    High

    Actor Project may fail as member

    might spend a lot of times

    researching on how to

    develop.

    Choose web search engine

    as members are more

    familiar with web

    development

    Medium

    Technology Theres a chance that the

    program might not functionon the presentation day

    due to different machinesetting as the project is notdone on the schoolcomputer.

    Each team members tohave their laptops onstandby on the day of the

    presentation as well asmaking sure that theprogram is able to run ontheir laptop beforehand.

    Medium

    Technology Due to the complexity ofdeveloping a distributedsystem, theres a risk that

    membersmight not fully understandhow to make the systemdistributed

    Each members toconduct finding andresearch on howdistributed systems works

    Medium

    Structure Members in the team mightnot know what to do and

    what is discussed when they

    are not available for the

    meeting

    Minutes of meeting isbeing taken so what are

    discussed on meeting days

    are being recorded

    Low

    Technology Database is not secure when

    everything is on one

    machine.

    Separate database from

    crawler and website.

    Database will be on

    another machine.

    Medium

    Technology Poor information has been

    found by search engine.

    Search engine rules and

    algorithms for ranking

    pages might be noteffective

    High

    Technology System might not be able to

    handle multiple queries at a

    time.

    Conduct findings and

    research on requirements

    on handling multiple

    queries

    Medium

    Task Team members might not be

    able to meet deadlines

    Tasks are distributed

    evenly so there will not be

    having one member

    overloading with a lot of

    work.

    Medium

    Technology Website might not be

    compatible with multiple

    Conduct findings and

    research on requirements

    Medium

  • 8/10/2019 CSCI 311_V3.docx

    5/18

  • 8/10/2019 CSCI 311_V3.docx

    6/18

    6

    System Overview

    The system is designed and implemented by pertaining the original architecture. In

    the top level view, the system consists of crawler, searcher, and database components.

    Figure 1 shows overall design of the system, figure 2 shows more details of the

    system design, and figure 3 and 4 show structures of the crawler and searcher classes

    respectively. The following subsections describe the modifications to the original

    system.

    Figure 1: System Design (Overview)

    Improve Reliability of the Crawling Process

    Web crawling is a process to collect all the web pages that are interested to search

    engine. Its a usually a challenging task for general search engine [1]. But web

    crawling is quite easy for site-specific search engine because the developers of site-

    specific search engine usually have access to the web pages of the web site. In this

    project, I downloaded about 12,000 most frequently accessed web pages from ICS

    web site as the main data set used by ICS web search engine. Each web page is

    assigned a unique id (called as document id in this report). The mapping between

    unique ID and the URL of the web page is created and stored in a text file in crawling

    phase.

  • 8/10/2019 CSCI 311_V3.docx

    7/18

    7

    Web indexing is one of most complicated and critical process in building search

    engine. It includes the following steps:

    Web parsing In this step each web page is parsed into pure text without html tag and

    the tile of web page is extracted out. The pure text of each web page is used as the

    document to match against the users query in the search phase. The title of each web

    page is used as the description of the web link in the hitlist returned to user.

    Inverted Indexing Inverted Index is a mapping from key word to the documents in

    which it appears. The correspondence between a particular keyword in the document

    is called posting. In this step a posting data structure that recorded all the appearances

    of the keywords in the web pages are created. For each key word, the posting includes

    the following information: total frequency, document frequency, the frequency in

    some documents and document id. The document id is sorted in descending order of

    the frequency of keyword in the document. This will make the document with more

    occurrences of keyword to be retrieved first. In order to normalize the similarity

    according to the length of each document, the document length (the number of

    keyword the document have) are also counted and stored in the document length file

    at this step.

    Test Plans

    Testing Types

    Type of

    Testing

    Project

    Performing?

    If No, Rationale/ approach to mitigate

    risk

    Owner/Lead

    Unit Yes N/A Roy Wong

    Functional Yes N/A Ho Yee Ting

    Load/

    PerformanceNo

    May have a dual level static cache

    placed on the server, and act on behalf

    of web crawlers

    N/A

    Regression Yes N/A Ho Yee Ting

    Integration Yes N/A Roy Wong

    User

    AcceptanceNo

    We do not have any one that can test

    from an end user or client perspective

    N/A

  • 8/10/2019 CSCI 311_V3.docx

    8/18

    8

    Testing Schedule

    Test Design Process

    Testers are required to understand each requirement for preparing test cases

    Ensuring all requirements are covered

    Testers may use any use case and functional specifications to write test cases

    Testers will exchange the test cases for review Testers will then review the comments by peers and make amendments if necessary

    Test Execution Process

    Assign test cases to respective tester

    Tester to execute every single test case

    Tester to ensure every test case is marked and documented either Pass / Fail

    Any test case marked as Fail should be raise with respective severity level

    Provide step by step instructions with screen shots if necessary to replicate failure

    Any defects outside of test steps being detected should also be capture and raise afterconfirming with Project Manager / Test Lead

  • 8/10/2019 CSCI 311_V3.docx

    9/18

    9

    Responsibilities

    i) Test Team

    Develop test conditions, test cases, expected results and execution scripts

    Perform execution and validation

    Identify, document and prioritize defects Re-test after software modification have been made

    ii) Development Team

    Review test plan, cases, scripts, expected results and provide timely feedback

    Validating the test results

    Support test team when needed

    Implement fixes to defects that were raised during testing phrase

    Defect Severity References

    Severity Impact

    1 (Critical) System crashes, database corruption, or potential data loss Abnormal return to the operating system (crash or a system

    failure message appears)

    Application hang and requires re-booting the system

    2 (High) Lack of essential program functionality and work around

    3 (Medium) Inferior quality of the System. Though there is an workaround

    for achieving the desired functionality Prevents other areas of the product from being tested. Though

    other areas can be tested independently.

    4 (Low) Insufficient or unclear error message Minimum impact on product usage

    5(FYI) Insufficient or unclear error message that has no impact onproduct usage

    Test Environment

    Windows environment with Internet Explorer 8, 9 and 10 with Firefox 24.0and Google Chrome 32.0 and later should be available to the test team.

    Mac OS environment with Safari 5 and later will also be available for testing.

    Testing for the search engine should be on Visual studio 2010 webdevelopment on Windows if required

    Testing for the Database should be on SQL 2012 on Windows if required

    Testing for the crawler should be on Visual Studio 2012 on Windows ifrequired

  • 8/10/2019 CSCI 311_V3.docx

    10/18

    10

    Test cases & execution

    S1Actual search should be conducted on Database

    Purpose: So that user can get faster search resultsPre-Requisite:

    Test Data:

    Steps:

    Expected Result:

    Test Result: Pass/ Fail

    Remarks:

    S2System must be distributed and transparent to usersPurpose: So that multiple users can use the system concurrently

    Pre-Requisite:

    Test Data:

    Steps:

    Expected Result: System is running on multiple machines at the same time

    Test Result: Pass/ Fail

    Remarks:

    S3System should process text files

    Purpose:

    Pre-Requisite:

    Test Data:

    Steps:

    Expected Result:

    Test Result: Pass/ Fail

    Remarks:

    S3.1System should process HTML files

    Purpose:

    Pre-Requisite:

    Test Data:

    Steps:

    Expected Result:

    Test Result: Pass/ Fail

    Remarks:

  • 8/10/2019 CSCI 311_V3.docx

    11/18

    11

    S3.2System should process source code

    Purpose:

    Pre-Requisite:

    Test Data:

    Steps:

    Expected Result:Test Result: Pass/ Fail

    Remarks:

    S4Search should include both content and file name

    Purpose: So that search result from within a file would not be missed

    Pre-Requisite: Test result for S3 must be Pass

    Test Data: abc.txt with content 123

    Steps: 1) User to key in abc in search field2) User presses enter or click the search button

    OR

    1) User to key in 123 in search field2) User presses enter or click the search button

    Expected Result: 1) Both results would display abc.txt or any other files that

    contain 123

    Test Result: Pass/ Fail

    Remarks:

    S5Search should recognize a blank space for AND

    Purpose: So that search would include a combination of key words

    Pre-Requisite:

    Test Data:

    Steps:

    Expected Result:

    Test Result: Pass/ Fail

    Remarks:

    S5.1Search should recognize | for OR

    Purpose: So that search result would be either one or the other key

    word

    Pre-Requisite:

    Test Data:

    Steps:

    Expected Result:

    Test Result: Pass/ FailRemarks:

  • 8/10/2019 CSCI 311_V3.docx

    12/18

    12

    S5.3Search should recognized double quotation for exact-word search

    Purpose: So that results would be more specific to key word

    Pre-Requisite:

    Test Data:Steps:

    Expected Result:

    Test Result: Pass/ Fail

    Remarks:

    Conclusion

    We have presented web search engine software that is suitable for researches and

    learning purposes because of its simplicity, portability, and modifiability. The

    strength of our program is in the search function component since we provided many

    scores functions to sort relevant pages to user queries; especially, the inclusion of

    anchor text analysis makes our program can also find relevant pages that do not

    contain terms in the queries.

    In the crawler component, only small modification was made. However, this small

    modification can improve the crawling reliability significantly. Readers who have

    S5.4Search should recognized brackets to define sequence of operations when

    operators are combined

    Purpose:Pre-Requisite:

    Test Data:

    Steps:

    Expected Result:

    Test Result: Pass/ Fail

    Remarks:

    S5.2Search should recognize \ for exclusion

    Purpose: So that unwanted results can be excluded

    Pre-Requisite:Test Data:

    Steps:

    Expected Result:

    Test Result: Pass/ Fail

    Remarks:

  • 8/10/2019 CSCI 311_V3.docx

    13/18

    13

    read the documentation would notice that the crawling method is breadth first search

    without politeness policies (e.g., obeying robot.txt and controlling access to the

    servers), spam pages detection, priority URLs queue, and memory management to

    divide the load of crawling process between disk and RAM.

    Risk analysis and measures

    Risk classification

    Risk Level Risk Description & Necessary Actions

    High The loss of confidentiality, integrity, or availability could be expected to have a

    severe or catastrophic adverse effect on organizational operations,

    organizational assets or individuals.

    Moderate The loss of confidentiality, integrity, or availability could be expected to have a

    serious adverse effect on organizational operations, organizational assets or

    individuals.

    Low The loss of confidentiality, integrity, or availability could be expected to have a

    limited adverse effect on organizational operations, organizational assets or

    individuals.

    Domain Risk Measures Risk level

    Actor Programming knowledge of

    team member and language

    known are different. There is

    a risk of choosing a

    programming language that

    someone in the team is not

    familiar with

    Majority of team members

    are familiar with C# so it

    will be choosen as a

    language to be used for the

    project.

    High

    Actor Project may fail as member

    might spend a lot of times

    researching on how to

    develop.

    Choose web search engine

    as members are more

    familiar with web

    development

    Medium

    Technology Theres a chance that the

    program might not functionon the presentation daydue to different machinesetting as the project is notdone on the schoolcomputer.

    Each team members tohave their laptops onstandby on the day of thepresentation as well asmaking sure that theprogram is able to run ontheir laptop beforehand.

    Medium

    Technology Due to the complexity ofdeveloping a distributedsystem, theres a risk that

    members

    might not fully understandhow to make the system

    Each members toconduct finding andresearch on howdistributed systems works

    Medium

  • 8/10/2019 CSCI 311_V3.docx

    14/18

    14

    distributed

    Structure Members in the team might

    not know what to do and

    what is discussed when they

    are not available for the

    meeting

    Minutes of meeting is

    being taken so what are

    discussed on meeting days

    are being recorded

    Low

    Technology Database is not secure when

    everything is on one

    machine.

    Separate database from

    crawler and website.

    Database will be on

    another machine.

    Medium

    Technology Poor information has been

    found by search engine.

    Search engine rules and

    algorithms for ranking

    pages might be not

    effective

    High

    Technology System might not be able to

    handle multiple queries at a

    time.

    Conduct findings and

    research on requirements

    on handling multiple

    queries

    Medium

    Task Team members might not be

    able to meet deadlines

    Tasks are distributed

    evenly so there will not be

    having one member

    overloading with a lot of

    work.

    Medium

    Technology Website might not be

    compatible with multiple

    browser

    Conduct findings and

    research on requirements

    on each type of browser

    requirements

    Medium

    Technology Database might be corrupted

    causing potential data lost

    Raid can be implemented

    and backup plans to make

    sure all data are secured in

    the case of system crashes.

    High

    Technology Advanced search operators

    might not work properly.

    System cannot process text

    file

    Test team to do proper

    testing so can revert to

    developer when any of the

    search operator is not

    working

    Medium

  • 8/10/2019 CSCI 311_V3.docx

    15/18

    15

    Meeting Mintues

  • 8/10/2019 CSCI 311_V3.docx

    16/18

    16

  • 8/10/2019 CSCI 311_V3.docx

    17/18

    17

  • 8/10/2019 CSCI 311_V3.docx

    18/18

    18

    References

    [1] H.T. Lee, D. Leonard, X. Wang, and D. Logulnov, IRLbot: Scaling to 6 Billion Pages

    and Beyond, in Proc. 17th International WWW Conference, Beijing (2008) pp. 427436.International Journal of Multimedia and Ubiquitous Engineering Vol. 7, No. 1, January, 2012

    [2] A. Heydon and M. Najork, Mercator: A Scalable, Extensible Web Crawler, World

    Wide Web, Vol. 2, No. 4, (1999) pp. 219229.

    [3] V. Shkapenyuk and T. Suel, Design and Implementation of a High-Performance

    Distributed Web Crawler, in Proc. IEEE ICDE (2002) pp. 357368

    [4] T. Segaran, Programming Collective Intelligence: Building Smart Web 2.0 Applications,

    OReilly Media Inc. (2007)pp. 4952.

    [5] Z. Bar-Yossef, I. Keidar, and U. Schonfeld, Do Not Crawl in the DUST: different

    URLs with similar text, in Proc. 16th International WWW Conference (2007) pp. 111120.

    [6] R. Cai, J.M. Yang, W. Lai, Y. Wang, and L. Zhang, IRobot: An Intelligent Crawler for

    Web Forums, in Proc. 17th International WWW Conference (2008) pp. 447456.

    [7] M. Najork and A. Heydon, High-Performance Web Crawling, Compaq Systems

    Research Center, Tech. Report 173 (2001)

    [8] P. Boldi, M. Santini, and S. Vigna, UbiCrawler: A Scalable Fully Distributed Web

    Crawler, Software, Practices & Experience, Vol. 34, No. 8 (2004) pp. 711726.

    [9]Guide to conduct research on the Internet

    [10]Software Testing Help

    [11]Index of /methodology/documents - Test Approach

    http://www.gsn.org/web/research/internet/tool2.htmhttp://www.softwaretestinghelp.com/how-to-write-test-plan-document-software-testing-training-day3/http://www.nextgen.umich.edu/methodology/documents/http://www.nextgen.umich.edu/methodology/documents/http://www.softwaretestinghelp.com/how-to-write-test-plan-document-software-testing-training-day3/http://www.gsn.org/web/research/internet/tool2.htm