Upload
yee-mon-min-aung
View
230
Download
0
Embed Size (px)
Citation preview
8/10/2019 CSCI 311_V3.docx
1/18
1
CSCI 311Software Process Management
Building a Distributed Search Engine
Submitted By
No Name UOW ID Contributions Signature
1 Wong Jun Hao Roy 4639327 Contributed
2 Yee Mon Min Aung Ingrid 4639017 Contributed
3 Ho Yee Ting 4639066 Contributed
4 Goh Ming Ming 4395220 Contributed
5 Tan Wei Yao 4395529 Contributed
8/10/2019 CSCI 311_V3.docx
2/18
2
Table of Content
8/10/2019 CSCI 311_V3.docx
3/18
3
Literature Review
Introduction
Search engines maybe one of the core reasons for the Internet to be
successful and popular today. The search engine may look and it is indeed
easy to use, however it is a lot more sophisticated than it actually is. In
general, a search tool should consist of an Interface, Crawler, Index and
Search.
The Interface
It is usually like any other web pages and is made up of text box for
phrases or keyword(s) and other form of inputs that assists in fine-tuning
of what the user wants the search engine to look for. It could also be in a
form of an application or a desktop search engine that crawls and search on
the hard drives of a computer.
Crawler
This is the gist and essence of all search engines that makes searching on
the Internet possible. Crawlers are programs that crawl through theInternet just like a spider crawling on its web. The crawlers are constantly
looking for new web pages that they are not aware of and have not been
included in their index. This is usually done using and searching through
pages that they are already aware of, looking for new links. And when they
find new links, the crawlers follow it and collect the necessary data they
would need for the index. After which the crawler goes home and add the
new page to the index. As a result the search engines grow bigger
automatically.
Index
An index is a database that works on the backend of the web page. Indexes
are different portion of the Internet. Some sites have huge indexes and
some have smaller ones. However regardless of the size, each has its own
pros and cons.
8/10/2019 CSCI 311_V3.docx
4/18
4
Domain Risk Measures Risk level
Actor Programming knowledge of
team member and language
known are different. There is
a risk of choosing aprogramming language that
someone in the team is not
familiar with
Majority of team members
are familiar with C# so it
will be choosen as a
language to be used for theproject.
High
Actor Project may fail as member
might spend a lot of times
researching on how to
develop.
Choose web search engine
as members are more
familiar with web
development
Medium
Technology Theres a chance that the
program might not functionon the presentation day
due to different machinesetting as the project is notdone on the schoolcomputer.
Each team members tohave their laptops onstandby on the day of the
presentation as well asmaking sure that theprogram is able to run ontheir laptop beforehand.
Medium
Technology Due to the complexity ofdeveloping a distributedsystem, theres a risk that
membersmight not fully understandhow to make the systemdistributed
Each members toconduct finding andresearch on howdistributed systems works
Medium
Structure Members in the team mightnot know what to do and
what is discussed when they
are not available for the
meeting
Minutes of meeting isbeing taken so what are
discussed on meeting days
are being recorded
Low
Technology Database is not secure when
everything is on one
machine.
Separate database from
crawler and website.
Database will be on
another machine.
Medium
Technology Poor information has been
found by search engine.
Search engine rules and
algorithms for ranking
pages might be noteffective
High
Technology System might not be able to
handle multiple queries at a
time.
Conduct findings and
research on requirements
on handling multiple
queries
Medium
Task Team members might not be
able to meet deadlines
Tasks are distributed
evenly so there will not be
having one member
overloading with a lot of
work.
Medium
Technology Website might not be
compatible with multiple
Conduct findings and
research on requirements
Medium
8/10/2019 CSCI 311_V3.docx
5/18
8/10/2019 CSCI 311_V3.docx
6/18
6
System Overview
The system is designed and implemented by pertaining the original architecture. In
the top level view, the system consists of crawler, searcher, and database components.
Figure 1 shows overall design of the system, figure 2 shows more details of the
system design, and figure 3 and 4 show structures of the crawler and searcher classes
respectively. The following subsections describe the modifications to the original
system.
Figure 1: System Design (Overview)
Improve Reliability of the Crawling Process
Web crawling is a process to collect all the web pages that are interested to search
engine. Its a usually a challenging task for general search engine [1]. But web
crawling is quite easy for site-specific search engine because the developers of site-
specific search engine usually have access to the web pages of the web site. In this
project, I downloaded about 12,000 most frequently accessed web pages from ICS
web site as the main data set used by ICS web search engine. Each web page is
assigned a unique id (called as document id in this report). The mapping between
unique ID and the URL of the web page is created and stored in a text file in crawling
phase.
8/10/2019 CSCI 311_V3.docx
7/18
7
Web indexing is one of most complicated and critical process in building search
engine. It includes the following steps:
Web parsing In this step each web page is parsed into pure text without html tag and
the tile of web page is extracted out. The pure text of each web page is used as the
document to match against the users query in the search phase. The title of each web
page is used as the description of the web link in the hitlist returned to user.
Inverted Indexing Inverted Index is a mapping from key word to the documents in
which it appears. The correspondence between a particular keyword in the document
is called posting. In this step a posting data structure that recorded all the appearances
of the keywords in the web pages are created. For each key word, the posting includes
the following information: total frequency, document frequency, the frequency in
some documents and document id. The document id is sorted in descending order of
the frequency of keyword in the document. This will make the document with more
occurrences of keyword to be retrieved first. In order to normalize the similarity
according to the length of each document, the document length (the number of
keyword the document have) are also counted and stored in the document length file
at this step.
Test Plans
Testing Types
Type of
Testing
Project
Performing?
If No, Rationale/ approach to mitigate
risk
Owner/Lead
Unit Yes N/A Roy Wong
Functional Yes N/A Ho Yee Ting
Load/
PerformanceNo
May have a dual level static cache
placed on the server, and act on behalf
of web crawlers
N/A
Regression Yes N/A Ho Yee Ting
Integration Yes N/A Roy Wong
User
AcceptanceNo
We do not have any one that can test
from an end user or client perspective
N/A
8/10/2019 CSCI 311_V3.docx
8/18
8
Testing Schedule
Test Design Process
Testers are required to understand each requirement for preparing test cases
Ensuring all requirements are covered
Testers may use any use case and functional specifications to write test cases
Testers will exchange the test cases for review Testers will then review the comments by peers and make amendments if necessary
Test Execution Process
Assign test cases to respective tester
Tester to execute every single test case
Tester to ensure every test case is marked and documented either Pass / Fail
Any test case marked as Fail should be raise with respective severity level
Provide step by step instructions with screen shots if necessary to replicate failure
Any defects outside of test steps being detected should also be capture and raise afterconfirming with Project Manager / Test Lead
8/10/2019 CSCI 311_V3.docx
9/18
9
Responsibilities
i) Test Team
Develop test conditions, test cases, expected results and execution scripts
Perform execution and validation
Identify, document and prioritize defects Re-test after software modification have been made
ii) Development Team
Review test plan, cases, scripts, expected results and provide timely feedback
Validating the test results
Support test team when needed
Implement fixes to defects that were raised during testing phrase
Defect Severity References
Severity Impact
1 (Critical) System crashes, database corruption, or potential data loss Abnormal return to the operating system (crash or a system
failure message appears)
Application hang and requires re-booting the system
2 (High) Lack of essential program functionality and work around
3 (Medium) Inferior quality of the System. Though there is an workaround
for achieving the desired functionality Prevents other areas of the product from being tested. Though
other areas can be tested independently.
4 (Low) Insufficient or unclear error message Minimum impact on product usage
5(FYI) Insufficient or unclear error message that has no impact onproduct usage
Test Environment
Windows environment with Internet Explorer 8, 9 and 10 with Firefox 24.0and Google Chrome 32.0 and later should be available to the test team.
Mac OS environment with Safari 5 and later will also be available for testing.
Testing for the search engine should be on Visual studio 2010 webdevelopment on Windows if required
Testing for the Database should be on SQL 2012 on Windows if required
Testing for the crawler should be on Visual Studio 2012 on Windows ifrequired
8/10/2019 CSCI 311_V3.docx
10/18
10
Test cases & execution
S1Actual search should be conducted on Database
Purpose: So that user can get faster search resultsPre-Requisite:
Test Data:
Steps:
Expected Result:
Test Result: Pass/ Fail
Remarks:
S2System must be distributed and transparent to usersPurpose: So that multiple users can use the system concurrently
Pre-Requisite:
Test Data:
Steps:
Expected Result: System is running on multiple machines at the same time
Test Result: Pass/ Fail
Remarks:
S3System should process text files
Purpose:
Pre-Requisite:
Test Data:
Steps:
Expected Result:
Test Result: Pass/ Fail
Remarks:
S3.1System should process HTML files
Purpose:
Pre-Requisite:
Test Data:
Steps:
Expected Result:
Test Result: Pass/ Fail
Remarks:
8/10/2019 CSCI 311_V3.docx
11/18
11
S3.2System should process source code
Purpose:
Pre-Requisite:
Test Data:
Steps:
Expected Result:Test Result: Pass/ Fail
Remarks:
S4Search should include both content and file name
Purpose: So that search result from within a file would not be missed
Pre-Requisite: Test result for S3 must be Pass
Test Data: abc.txt with content 123
Steps: 1) User to key in abc in search field2) User presses enter or click the search button
OR
1) User to key in 123 in search field2) User presses enter or click the search button
Expected Result: 1) Both results would display abc.txt or any other files that
contain 123
Test Result: Pass/ Fail
Remarks:
S5Search should recognize a blank space for AND
Purpose: So that search would include a combination of key words
Pre-Requisite:
Test Data:
Steps:
Expected Result:
Test Result: Pass/ Fail
Remarks:
S5.1Search should recognize | for OR
Purpose: So that search result would be either one or the other key
word
Pre-Requisite:
Test Data:
Steps:
Expected Result:
Test Result: Pass/ FailRemarks:
8/10/2019 CSCI 311_V3.docx
12/18
12
S5.3Search should recognized double quotation for exact-word search
Purpose: So that results would be more specific to key word
Pre-Requisite:
Test Data:Steps:
Expected Result:
Test Result: Pass/ Fail
Remarks:
Conclusion
We have presented web search engine software that is suitable for researches and
learning purposes because of its simplicity, portability, and modifiability. The
strength of our program is in the search function component since we provided many
scores functions to sort relevant pages to user queries; especially, the inclusion of
anchor text analysis makes our program can also find relevant pages that do not
contain terms in the queries.
In the crawler component, only small modification was made. However, this small
modification can improve the crawling reliability significantly. Readers who have
S5.4Search should recognized brackets to define sequence of operations when
operators are combined
Purpose:Pre-Requisite:
Test Data:
Steps:
Expected Result:
Test Result: Pass/ Fail
Remarks:
S5.2Search should recognize \ for exclusion
Purpose: So that unwanted results can be excluded
Pre-Requisite:Test Data:
Steps:
Expected Result:
Test Result: Pass/ Fail
Remarks:
8/10/2019 CSCI 311_V3.docx
13/18
13
read the documentation would notice that the crawling method is breadth first search
without politeness policies (e.g., obeying robot.txt and controlling access to the
servers), spam pages detection, priority URLs queue, and memory management to
divide the load of crawling process between disk and RAM.
Risk analysis and measures
Risk classification
Risk Level Risk Description & Necessary Actions
High The loss of confidentiality, integrity, or availability could be expected to have a
severe or catastrophic adverse effect on organizational operations,
organizational assets or individuals.
Moderate The loss of confidentiality, integrity, or availability could be expected to have a
serious adverse effect on organizational operations, organizational assets or
individuals.
Low The loss of confidentiality, integrity, or availability could be expected to have a
limited adverse effect on organizational operations, organizational assets or
individuals.
Domain Risk Measures Risk level
Actor Programming knowledge of
team member and language
known are different. There is
a risk of choosing a
programming language that
someone in the team is not
familiar with
Majority of team members
are familiar with C# so it
will be choosen as a
language to be used for the
project.
High
Actor Project may fail as member
might spend a lot of times
researching on how to
develop.
Choose web search engine
as members are more
familiar with web
development
Medium
Technology Theres a chance that the
program might not functionon the presentation daydue to different machinesetting as the project is notdone on the schoolcomputer.
Each team members tohave their laptops onstandby on the day of thepresentation as well asmaking sure that theprogram is able to run ontheir laptop beforehand.
Medium
Technology Due to the complexity ofdeveloping a distributedsystem, theres a risk that
members
might not fully understandhow to make the system
Each members toconduct finding andresearch on howdistributed systems works
Medium
8/10/2019 CSCI 311_V3.docx
14/18
14
distributed
Structure Members in the team might
not know what to do and
what is discussed when they
are not available for the
meeting
Minutes of meeting is
being taken so what are
discussed on meeting days
are being recorded
Low
Technology Database is not secure when
everything is on one
machine.
Separate database from
crawler and website.
Database will be on
another machine.
Medium
Technology Poor information has been
found by search engine.
Search engine rules and
algorithms for ranking
pages might be not
effective
High
Technology System might not be able to
handle multiple queries at a
time.
Conduct findings and
research on requirements
on handling multiple
queries
Medium
Task Team members might not be
able to meet deadlines
Tasks are distributed
evenly so there will not be
having one member
overloading with a lot of
work.
Medium
Technology Website might not be
compatible with multiple
browser
Conduct findings and
research on requirements
on each type of browser
requirements
Medium
Technology Database might be corrupted
causing potential data lost
Raid can be implemented
and backup plans to make
sure all data are secured in
the case of system crashes.
High
Technology Advanced search operators
might not work properly.
System cannot process text
file
Test team to do proper
testing so can revert to
developer when any of the
search operator is not
working
Medium
8/10/2019 CSCI 311_V3.docx
15/18
15
Meeting Mintues
8/10/2019 CSCI 311_V3.docx
16/18
16
8/10/2019 CSCI 311_V3.docx
17/18
17
8/10/2019 CSCI 311_V3.docx
18/18
18
References
[1] H.T. Lee, D. Leonard, X. Wang, and D. Logulnov, IRLbot: Scaling to 6 Billion Pages
and Beyond, in Proc. 17th International WWW Conference, Beijing (2008) pp. 427436.International Journal of Multimedia and Ubiquitous Engineering Vol. 7, No. 1, January, 2012
[2] A. Heydon and M. Najork, Mercator: A Scalable, Extensible Web Crawler, World
Wide Web, Vol. 2, No. 4, (1999) pp. 219229.
[3] V. Shkapenyuk and T. Suel, Design and Implementation of a High-Performance
Distributed Web Crawler, in Proc. IEEE ICDE (2002) pp. 357368
[4] T. Segaran, Programming Collective Intelligence: Building Smart Web 2.0 Applications,
OReilly Media Inc. (2007)pp. 4952.
[5] Z. Bar-Yossef, I. Keidar, and U. Schonfeld, Do Not Crawl in the DUST: different
URLs with similar text, in Proc. 16th International WWW Conference (2007) pp. 111120.
[6] R. Cai, J.M. Yang, W. Lai, Y. Wang, and L. Zhang, IRobot: An Intelligent Crawler for
Web Forums, in Proc. 17th International WWW Conference (2008) pp. 447456.
[7] M. Najork and A. Heydon, High-Performance Web Crawling, Compaq Systems
Research Center, Tech. Report 173 (2001)
[8] P. Boldi, M. Santini, and S. Vigna, UbiCrawler: A Scalable Fully Distributed Web
Crawler, Software, Practices & Experience, Vol. 34, No. 8 (2004) pp. 711726.
[9]Guide to conduct research on the Internet
[10]Software Testing Help
[11]Index of /methodology/documents - Test Approach
http://www.gsn.org/web/research/internet/tool2.htmhttp://www.softwaretestinghelp.com/how-to-write-test-plan-document-software-testing-training-day3/http://www.nextgen.umich.edu/methodology/documents/http://www.nextgen.umich.edu/methodology/documents/http://www.softwaretestinghelp.com/how-to-write-test-plan-document-software-testing-training-day3/http://www.gsn.org/web/research/internet/tool2.htm