CSCI 311_V3.docx

8/10/2019 CSCI 311_V3.docx

1/18

1

CSCI 311Software Process Management

Building a Distributed Search Engine

Submitted By

No Name UOW ID Contributions Signature

1 Wong Jun Hao Roy 4639327 Contributed

2 Yee Mon Min Aung Ingrid 4639017 Contributed

3 Ho Yee Ting 4639066 Contributed

4 Goh Ming Ming 4395220 Contributed

5 Tan Wei Yao 4395529 Contributed

8/10/2019 CSCI 311_V3.docx

2/18

2

Table of Content

8/10/2019 CSCI 311_V3.docx

3/18

3

Literature Review

Introduction

Search engines maybe one of the core reasons for the Internet to be

successful and popular today. The search engine may look and it is indeed

easy to use, however it is a lot more sophisticated than it actually is. In

general, a search tool should consist of an Interface, Crawler, Index and

Search.

The Interface

It is usually like any other web pages and is made up of text box for

phrases or keyword(s) and other form of inputs that assists in fine-tuning

of what the user wants the search engine to look for. It could also be in a

form of an application or a desktop search engine that crawls and search on

the hard drives of a computer.

Crawler

This is the gist and essence of all search engines that makes searching on

the Internet possible. Crawlers are programs that crawl through theInternet just like a spider crawling on its web. The crawlers are constantly

looking for new web pages that they are not aware of and have not been

included in their index. This is usually done using and searching through

pages that they are already aware of, looking for new links. And when they

find new links, the crawlers follow it and collect the necessary data they

would need for the index. After which the crawler goes home and add the

new page to the index. As a result the search engines grow bigger

automatically.

Index

An index is a database that works on the backend of the web page. Indexes

are different portion of the Internet. Some sites have huge indexes and

some have smaller ones. However regardless of the size, each has its own

pros and cons.

8/10/2019 CSCI 311_V3.docx

4/18

4

Domain Risk Measures Risk level

Actor Programming knowledge of

team member and language

known are different. There is

a risk of choosing aprogramming language that

someone in the team is not

familiar with

Majority of team members

are familiar with C# so it

will be choosen as a

language to be used for theproject.

High

Actor Project may fail as member

might spend a lot of times

researching on how to

develop.

Choose web search engine

as members are more

familiar with web

development

Medium

Technology Theres a chance that the

program might not functionon the presentation day

due to different machinesetting as the project is notdone on the schoolcomputer.

Each team members tohave their laptops onstandby on the day of the

presentation as well asmaking sure that theprogram is able to run ontheir laptop beforehand.

Medium

Technology Due to the complexity ofdeveloping a distributedsystem, theres a risk that

membersmight not fully understandhow to make the systemdistributed

Each members toconduct finding andresearch on howdistributed systems works

Medium

Structure Members in the team mightnot know what to do and

what is discussed when they

are not available for the

meeting

Minutes of meeting isbeing taken so what are

discussed on meeting days

are being recorded

Low

Technology Database is not secure when

everything is on one

machine.

Separate database from

crawler and website.

Database will be on

another machine.

Medium

Technology Poor information has been

found by search engine.

Search engine rules and

algorithms for ranking

pages might be noteffective

High

Technology System might not be able to

handle multiple queries at a

time.

Conduct findings and

research on requirements

on handling multiple

queries

Medium

Task Team members might not be

able to meet deadlines

Tasks are distributed

evenly so there will not be

having one member

overloading with a lot of

work.

Medium

Technology Website might not be

compatible with multiple



Medium

8/10/2019 CSCI 311_V3.docx

5/18

8/10/2019 CSCI 311_V3.docx

6/18

6

System Overview

The system is designed and implemented by pertaining the original architecture. In

the top level view, the system consists of crawler, searcher, and database components.

Figure 1 shows overall design of the system, figure 2 shows more details of the

system design, and figure 3 and 4 show structures of the crawler and searcher classes

respectively. The following subsections describe the modifications to the original

system.

Figure 1: System Design (Overview)

Improve Reliability of the Crawling Process

Web crawling is a process to collect all the web pages that are interested to search

engine. Its a usually a challenging task for general search engine [1]. But web

crawling is quite easy for site-specific search engine because the developers of site-

specific search engine usually have access to the web pages of the web site. In this

project, I downloaded about 12,000 most frequently accessed web pages from ICS

web site as the main data set used by ICS web search engine. Each web page is

assigned a unique id (called as document id in this report). The mapping between

unique ID and the URL of the web page is created and stored in a text file in crawling

phase.

8/10/2019 CSCI 311_V3.docx

7/18

7

Web indexing is one of most complicated and critical process in building search

engine. It includes the following steps:

Web parsing In this step each web page is parsed into pure text without html tag and

the tile of web page is extracted out. The pure text of each web page is used as the

document to match against the users query in the search phase. The title of each web

page is used as the description of the web link in the hitlist returned to user.

Inverted Indexing Inverted Index is a mapping from key word to the documents in

which it appears. The correspondence between a particular keyword in the document

is called posting. In this step a posting data structure that recorded all the appearances

of the keywords in the web pages are created. For each key word, the posting includes

the following information: total frequency, document frequency, the frequency in

some documents and document id. The document id is sorted in descending order of

the frequency of keyword in the document. This will make the document with more

occurrences of keyword to be retrieved first. In order to normalize the similarity

according to the length of each document, the document length (the number of

keyword the document have) are also counted and stored in the document length file

at this step.

Test Plans

Testing Types

Type of

Testing

Project

Performing?

If No, Rationale/ approach to mitigate

risk

Owner/Lead

Unit Yes N/A Roy Wong

Functional Yes N/A Ho Yee Ting

Load/

PerformanceNo

May have a dual level static cache

placed on the server, and act on behalf

of web crawlers

N/A

Regression Yes N/A Ho Yee Ting

Integration Yes N/A Roy Wong

User

AcceptanceNo

We do not have any one that can test

from an end user or client perspective

N/A

8/10/2019 CSCI 311_V3.docx

8/18

8

Testing Schedule

Test Design Process

Testers are required to understand each requirement for preparing test cases

Ensuring all requirements are covered

Testers may use any use case and functional specifications to write test cases

Testers will exchange the test cases for review Testers will then review the comments by peers and make amendments if necessary

Test Execution Process

Assign test cases to respective tester

Tester to execute every single test case

Tester to ensure every test case is marked and documented either Pass / Fail

Any test case marked as Fail should be raise with respective severity level

Provide step by step instructions with screen shots if necessary to replicate failure

Any defects outside of test steps being detected should also be capture and raise afterconfirming with Project Manager / Test Lead

8/10/2019 CSCI 311_V3.docx

9/18

9

Responsibilities

i) Test Team

Develop test conditions, test cases, expected results and execution scripts

Perform execution and validation

Identify, document and prioritize defects Re-test after software modification have been made

ii) Development Team

Review test plan, cases, scripts, expected results and provide timely feedback

Validating the test results

Support test team when needed

Implement fixes to defects that were raised during testing phrase

Defect Severity References

Severity Impact

1 (Critical) System crashes, database corruption, or potential data loss Abnormal return to the operating system (crash or a system

failure message appears)

Application hang and requires re-booting the system

2 (High) Lack of essential program functionality and work around

3 (Medium) Inferior quality of the System. Though there is an workaround

for achieving the desired functionality Prevents other areas of the product from being tested. Though

other areas can be tested independently.

4 (Low) Insufficient or unclear error message Minimum impact on product usage

5(FYI) Insufficient or unclear error message that has no impact onproduct usage

Test Environment

Windows environment with Internet Explorer 8, 9 and 10 with Firefox 24.0and Google Chrome 32.0 and later should be available to the test team.

Mac OS environment with Safari 5 and later will also be available for testing.

Testing for the search engine should be on Visual studio 2010 webdevelopment on Windows if required

Testing for the Database should be on SQL 2012 on Windows if required

Testing for the crawler should be on Visual Studio 2012 on Windows ifrequired

8/10/2019 CSCI 311_V3.docx

10/18

10

Test cases & execution

S1Actual search should be conducted on Database

Purpose: So that user can get faster search resultsPre-Requisite:

Test Data:

Steps:

Expected Result:

Test Result: Pass/ Fail

Remarks:

S2System must be distributed and transparent to usersPurpose: So that multiple users can use the system concurrently

Pre-Requisite:

Test Data:

Steps:

Expected Result: System is running on multiple machines at the same time


Remarks:

S3System should process text files

Purpose:

Pre-Requisite:

Test Data:

Steps:

Expected Result:


Remarks:

S3.1System should process HTML files

Purpose:

Pre-Requisite:

Test Data:

Steps:

Expected Result:


Remarks:

8/10/2019 CSCI 311_V3.docx

11/18

11

S3.2System should process source code

Purpose:

Pre-Requisite:

Test Data:

Steps:

Expected Result:Test Result: Pass/ Fail

Remarks:

S4Search should include both content and file name

Purpose: So that search result from within a file would not be missed

Pre-Requisite: Test result for S3 must be Pass

Test Data: abc.txt with content 123

Steps: 1) User to key in abc in search field2) User presses enter or click the search button

OR

1) User to key in 123 in search field2) User presses enter or click the search button

Expected Result: 1) Both results would display abc.txt or any other files that

contain 123


Remarks:

S5Search should recognize a blank space for AND

Purpose: So that search would include a combination of key words

Pre-Requisite:

Test Data:

Steps:

Expected Result:


Remarks:

S5.1Search should recognize | for OR

Purpose: So that search result would be either one or the other key

word

Pre-Requisite:

Test Data:

Steps:

Expected Result:

Test Result: Pass/ FailRemarks:

8/10/2019 CSCI 311_V3.docx

12/18

12

S5.3Search should recognized double quotation for exact-word search

Purpose: So that results would be more specific to key word

Pre-Requisite:

Test Data:Steps:

Expected Result:


Remarks:

Conclusion

We have presented web search engine software that is suitable for researches and

learning purposes because of its simplicity, portability, and modifiability. The

strength of our program is in the search function component since we provided many

scores functions to sort relevant pages to user queries; especially, the inclusion of

anchor text analysis makes our program can also find relevant pages that do not

contain terms in the queries.

In the crawler component, only small modification was made. However, this small

modification can improve the crawling reliability significantly. Readers who have

S5.4Search should recognized brackets to define sequence of operations when

operators are combined

Purpose:Pre-Requisite:

Test Data:

Steps:

Expected Result:


Remarks:

S5.2Search should recognize \ for exclusion

Purpose: So that unwanted results can be excluded

Pre-Requisite:Test Data:

Steps:

Expected Result:


Remarks:

8/10/2019 CSCI 311_V3.docx

13/18

13

read the documentation would notice that the crawling method is breadth first search

without politeness policies (e.g., obeying robot.txt and controlling access to the

servers), spam pages detection, priority URLs queue, and memory management to

divide the load of crawling process between disk and RAM.

Risk analysis and measures

Risk classification

Risk Level Risk Description & Necessary Actions

High The loss of confidentiality, integrity, or availability could be expected to have a

severe or catastrophic adverse effect on organizational operations,

organizational assets or individuals.

Moderate The loss of confidentiality, integrity, or availability could be expected to have a

serious adverse effect on organizational operations, organizational assets or

individuals.

Low The loss of confidentiality, integrity, or availability could be expected to have a

limited adverse effect on organizational operations, organizational assets or

individuals.

Domain Risk Measures Risk level

Actor Programming knowledge of

team member and language

known are different. There is

a risk of choosing a

programming language that

someone in the team is not

familiar with

Majority of team members

are familiar with C# so it

will be choosen as a

language to be used for the

project.

High

Actor Project may fail as member

might spend a lot of times

researching on how to

develop.

Choose web search engine

as members are more

familiar with web

development

Medium

Technology Theres a chance that the

program might not functionon the presentation daydue to different machinesetting as the project is notdone on the schoolcomputer.

Each team members tohave their laptops onstandby on the day of thepresentation as well asmaking sure that theprogram is able to run ontheir laptop beforehand.

Medium

Technology Due to the complexity ofdeveloping a distributedsystem, theres a risk that

members

might not fully understandhow to make the system

Each members toconduct finding andresearch on howdistributed systems works

Medium

8/10/2019 CSCI 311_V3.docx

14/18

14

distributed

Structure Members in the team might

not know what to do and

what is discussed when they

are not available for the

meeting

Minutes of meeting is

being taken so what are

discussed on meeting days

are being recorded

Low

Technology Database is not secure when

everything is on one

machine.

Separate database from

crawler and website.

Database will be on

another machine.

Medium

Technology Poor information has been

found by search engine.

Search engine rules and

algorithms for ranking

pages might be not

effective

High

Technology System might not be able to

handle multiple queries at a

time.



on handling multiple

queries

Medium

Task Team members might not be

able to meet deadlines

Tasks are distributed

evenly so there will not be

having one member

overloading with a lot of

work.

Medium

Technology Website might not be

compatible with multiple

browser



on each type of browser

requirements

Medium

Technology Database might be corrupted

causing potential data lost

Raid can be implemented

and backup plans to make

sure all data are secured in

the case of system crashes.

High

Technology Advanced search operators

might not work properly.

System cannot process text

file

Test team to do proper

testing so can revert to

developer when any of the

search operator is not

working

Medium

8/10/2019 CSCI 311_V3.docx

15/18

15

Meeting Mintues

8/10/2019 CSCI 311_V3.docx

16/18

16

8/10/2019 CSCI 311_V3.docx

17/18

17

8/10/2019 CSCI 311_V3.docx

18/18

18

References

[1] H.T. Lee, D. Leonard, X. Wang, and D. Logulnov, IRLbot: Scaling to 6 Billion Pages

and Beyond, in Proc. 17th International WWW Conference, Beijing (2008) pp. 427436.International Journal of Multimedia and Ubiquitous Engineering Vol. 7, No. 1, January, 2012

[2] A. Heydon and M. Najork, Mercator: A Scalable, Extensible Web Crawler, World

Wide Web, Vol. 2, No. 4, (1999) pp. 219229.

[3] V. Shkapenyuk and T. Suel, Design and Implementation of a High-Performance

Distributed Web Crawler, in Proc. IEEE ICDE (2002) pp. 357368

[4] T. Segaran, Programming Collective Intelligence: Building Smart Web 2.0 Applications,

OReilly Media Inc. (2007)pp. 4952.

[5] Z. Bar-Yossef, I. Keidar, and U. Schonfeld, Do Not Crawl in the DUST: different

URLs with similar text, in Proc. 16th International WWW Conference (2007) pp. 111120.

[6] R. Cai, J.M. Yang, W. Lai, Y. Wang, and L. Zhang, IRobot: An Intelligent Crawler for

Web Forums, in Proc. 17th International WWW Conference (2008) pp. 447456.

[7] M. Najork and A. Heydon, High-Performance Web Crawling, Compaq Systems

Research Center, Tech. Report 173 (2001)

[8] P. Boldi, M. Santini, and S. Vigna, UbiCrawler: A Scalable Fully Distributed Web

Crawler, Software, Practices & Experience, Vol. 34, No. 8 (2004) pp. 711726.

[9]Guide to conduct research on the Internet

[10]Software Testing Help

[11]Index of /methodology/documents - Test Approach
http://www.gsn.org/web/research/internet/tool2.htmhttp://www.softwaretestinghelp.com/how-to-write-test-plan-document-software-testing-training-day3/http://www.nextgen.umich.edu/methodology/documents/http://www.nextgen.umich.edu/methodology/documents/http://www.softwaretestinghelp.com/how-to-write-test-plan-document-software-testing-training-day3/http://www.gsn.org/web/research/internet/tool2.htm

Documents

CSCI 311_V3.docx