bñÅÉêéí=Ñêçã=íÜÉ== mêçÅÉÉÇáåÖë=Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for

^Åèìáëáíáçå=oÉëÉ~êÅÜ=mêçÖê~ã==doâr^qb=pèlli=lc=_rpfkbpp=C=mr_if`=mlif`v=k^sî=mlpqdoâr^qb=pèlli=

Approved for public release, distribution unlimited.

Prepared for: Naval Postgraduate School, Monterey, California 93943

NPS-AM-08-035

bñÅÉêéí=Ñêçã=íÜÉ==

mêçÅÉÉÇáåÖë=çÑ=íÜÉ=

ÑáÑíÜ=^ååì~ä=^Åèìáëáíáçå=

oÉëÉ~êÅÜ=póãéçëáìã=

ReSEARCH: A REQUIREMENTS SEARCH ENGINE

Published: 23 April 2008

by

Dr. Craig Martell, LT Paige H. Adams, Dr. Pranav Anand, ENS Grant Gehrke, Dr. Ralucca Gera, CPT Marco Draeger, Dr. Kevin Squire

5th Annual Acquisition Research Symposium of the Naval Postgraduate School:

Acquisition Research: Creating Synergy for Informed Change

May 14-15, 2008

Report Documentation Page Form ApprovedOMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.

1. REPORT DATE 23 APR 2008 2. REPORT TYPE

3. DATES COVERED 00-00-2008 to 00-00-2008

4. TITLE AND SUBTITLE ReSEARCH: A Requirements Search Engine

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Naval Postgraduate School,Department of Computer Science,Monterey,CA,93943

8. PERFORMING ORGANIZATIONREPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited

13. SUPPLEMENTARY NOTES 5th Annual Acquisition Research Symposium: Creating Synergy for Informed Change, May 14-15, 2008 inMonterey, CA

14. ABSTRACT

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as

Report (SAR)

18. NUMBEROF PAGES

53

19a. NAME OFRESPONSIBLE PERSON

a. REPORT unclassified

b. ABSTRACT unclassified

c. THIS PAGE unclassified

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

^Åèìáëáíáçå=oÉëÉ~êÅÜ=mêçÖê~ã==doâr^qb=pèlli=lc=_rpfkbpp=C=mr_if`=mlif`v=k^sî=mlpqdoâr^qb=pèlli=

The research presented at the symposium was supported by the Acquisition Chair of the Graduate School of Business & Public Policy at the Naval Postgraduate School. To request Defense Acquisition Research or to become a research sponsor, please contact: NPS Acquisition Research Program Attn: James B. Greene, RADM, USN, (Ret) Acquisition Chair Graduate School of Business and Public Policy Naval Postgraduate School 555 Dyer Road, Room 332 Monterey, CA 93943-5103 Tel: (831) 656-2092 Fax: (831) 656-2253 E-mail: [email protected] Copies of the Acquisition Sponsored Research Reports may be printed from our website www.acquisitionresearch.org Conference Website: www.researchsymposium.org

====================^Åèìáëáíáçå=oÉëÉ~êÅÜW=ÅêÉ~íáåÖ=ëóåÉêÖó=Ñçê=áåÑçêãÉÇ=ÅÜ~åÖÉ=======- i - =

=

Proceedings of the Annual Acquisition Research Program

The following article is taken as an excerpt from the proceedings of the annual

Acquisition Research Program. This annual event showcases the research projects

funded through the Acquisition Research Program at the Graduate School of Business

and Public Policy at the Naval Postgraduate School. Featuring keynote speakers,

plenary panels, multiple panel sessions, a student research poster show and social

events, the Annual Acquisition Research Symposium offers a candid environment

where high-ranking Department of Defense (DoD) officials, industry officials,

accomplished faculty and military students are encouraged to collaborate on finding

applicable solutions to the challenges facing acquisition policies and processes within

the DoD today. By jointly and publicly questioning the norms of industry and academia,

the resulting research benefits from myriad perspectives and collaborations which can

identify better solutions and practices in acquisition, contract, financial, logistics and

program management.

For further information regarding the Acquisition Research Program, electronic

copies of additional research, or to learn more about becoming a sponsor, please visit

our program website at:

www.acquistionresearch.org

For further information on or to register for the next Acquisition Research

Symposium during the third week of May, please visit our conference website at:

www.researchsymposium.org

====================^Åèìáëáíáçå=oÉëÉ~êÅÜW=ÅêÉ~íáåÖ=ëóåÉêÖó=Ñçê=áåÑçêãÉÇ=ÅÜ~åÖÉ=======- ii - =

=

THIS PAGE INTENTIONALLY LEFT BLANK

====================^Åèìáëáíáçå=oÉëÉ~êÅÜW=ÅêÉ~íáåÖ=ëóåÉêÖó=Ñçê=áåÑçêãÉÇ=ÅÜ~åÖÉ=======- 213 - =

=

ReSEARCH: A Requirements Search Engine

Presenter: Craig Martell holds a PhD from the University of Pennsylvania and is an Associate Professor in Computer Science at the Naval Postgraduate School. He specializes in natural-language processing applied to on-line chat, weblogs and semantic search.

Dr. Craig Martell Associate Professor Department of Computer Science Naval Postgraduate School Monterey, CA 93943 E-mail: [email protected]

Author: Paige H. Adams, Lieutenant, is a US Navy Information Warfare Officer. He has served as a Russian and Burmese linguist aboard a variety of surface, subsurface, and airborne platforms in the Atlantic, Pacific, and Arabian Gulf, as well as at joint field sites and on flag staff. Most recently, he served as Operations Officer and Joint Operations Department Head at the Misawa Security Operations Center in Japan. He is a 2000 graduate (summa cum laude) of Hawaii Pacific University with a BS/BA in Computer Information Systems. LT Paige Adams, USN Graduate Student Department of Computer Science Naval Postgraduate School Monterey, CA 93943 E-mail: [email protected]

Author: Dr. Pranav Anand is an Assistant Professor of Linguistics at UC Santa Cruz. He specializes in theoretical syntax and semantics, with an emphasis on how contextual factors affect the form and meaning of linguistic expressions. He received his AB from Harvard in 2001 and his PhD from MIT in 2006.

Dr. Pranav Anand Assistant Research Professor Department of Computer Science Naval Postgraduate School Monterey, CA 93943 E-mail: [email protected]

Author: Grant Gehrke, Ensign, is a native of Louisville, Kentucky. He graduated in 2007 from the Naval Academy with a degree in Computer Science. His current research focuses on using NLP approaches for authorship attribution in blogs. He will be graduating from the Naval Postgraduate School in June and reporting to Forrest Sherman, a destroyer homeported in Norfolk, Virginia. ENS Grant Gehrke, USN Graduate Student Department of Computer Science Naval Postgraduate School Monterey, CA 93943 E-mail: [email protected]


=

Author: Dr. Ralucca Gera received her BS (2000), MA (2002), and PhD (2005) in Mathematics, all from Western Michigan University. She has been an Assistant Professor in the Applied Mathematics Department at NPS since 2005. Gera’s research interests are in combinatorics and graph theory, particularly domination and alliances in graphs.

Dr. Ralucca Gera Assistant Professor Department of Computer Science Naval Postgraduate School Monterey, CA 93943 E-mail: [email protected]

Author: Marco Draeger is a Captain in the German Army and studied Computer Science at the University of the Federal Armed Forces from 2000 to 2004. From 2004 to 2005, he was assigned to the Modeling and Simulation Department of the Center for Transformation of the Federal Armed Forces. Following this, he served at the German Command Support School until 2007. Since June 2007, Draeger has been a student in the Operations Research Department at the Naval Postgraduate School.

Marco Draeger Graduate Student Department of Computer Science Naval Postgraduate School Monterey, CA 93943 E-mail: [email protected]

Author: Dr. Kevin Squire received his PhD degree from the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign (UIUC), Urbana, in 2004. He is currently an Assistant Professor of Computer Science at the Naval Postgraduate School, Monterey, CA. His research focuses on developing and applying machine-learning algorithms and models to many different domains—including natural language processing, autonomous language acquisition, scheduling, robotics, computer vision, and remote sensing.

Dr. Kevin Squire Assistant Professor Department of Computer Science Naval Postgraduate School Monterey, CA 93943 E-mail: [email protected]

Abstract This research address three closely related problems. (1) Most current search

technology is based on a popularity metric (e.g., PageRank or ExpertRank), but not on the semantic content of the document. (2) When building components in a service-oriented architecture (SOA), developers must investigate whether components that meet certain requirements already exist. (3) There is no easy way for writers of requirements documents to formally specify the meaning and domain of their requirements. Our goal in the research presented here is to address these concerns by designing a search-engine that searches over the “meanings” of requirements documents. In this paper, we present the current state of the ReSEARCH project.

Keywords: Semantic Search, Requirements, Open Architecture, Information Systems Technology


=


=


=


=


=


=


=


=


=


=


=


=


=


=


=

List of References Apache Lucene. (2008). Lucene-java wiki. Retrieved March 23, 2008, from http://wiki.apache.org/lucene-

java

Ask.com. (2008). Ask search technology. Retrieved March 28, 2008, from http://about.ask.com/en/docs/about/ask technology.shtml

Brin, S., & Page, L. (1998, April). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the seventh international world wide web conference (pp. 107-117). Published in Computer Networks and ISDN Systems, 30(1-7).

Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7-15.

Harris, Z.S. (1954). Distributional structure. Word, 10, 146-162.

Hearst, M.A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on computational linguistics (pp. 539-545). Morristown, NJ: Association for Computational Linguistics.

Hersh, W.R. (2003). Information retrieval: A health and biomedical perspective. New York: Springer.

Johnson, J., & Blais, C. (2008). Share repository framework: Component specific and ontology. In Proceedings of the fifth annual acquisition research symposium. Monterey, CA: Naval Postgraduate School.

Langville, A.N., & Meyer, C.D. (2006, July). Google's PageRank and beyond: The science of search engine rankings. Princeton, NJ: Princeton University Press.

Lin, D. (1998). Extracting collocations from text corpora. In Workshop on computational terminology (pp. 57-63). Montreal, Canada.

Lin, D. (2008). MINIPAR. Retrieved March 28, 2008, from http://www.cs.ualberta.ca/~lindek/minipar.htm

Lin, D., & Pantel, P. (2001). DIRT—Discovery of inference from text. In KDD '01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 323-328). New York: ACM.

Notess, G.R. (n.d.). Review of altavista. Retrieved March 23, 2008, from http://www.searchengineshowdown.com/features/av/review.html

Sapp, G. (2000). Altavista proffers updated search technology to online businesses. Retrieved March 23, 2008, from http://www.infoworld.com/articles/ic/xml/00/03/21/000321icalta.html

Schmidt, M. (2005). Lucene wikipedia indexer. Retrieved March 23, 2008, from http://schmidt.devlib.org/software/lucene-wikipedia.html

Schwartz, C. (1998, September). Web search engines. Journal of the American Society of Information Science, 49(11): 973-982.

Snow, R., Jurafsky, D., & Ng, A.Y. (2005). Learning syntactic patterns for automatic hypernym discovery. In L.K. Saul, Y. Weiss & L. Bottou (Eds.), Advances in neural information processing systems 17 (pp. 1297-1304). Cambridge, MA: MIT Press.

Vajda, A. (2005). Pulling java lucene into python: Pylucene. Retrieved March 23, 2008, from http://chandlerproject.org/PyLucene/Paper

Zhang, D., & Dong, Y. (2000). An efficient algorithm to rank web resources. Computer Networks, 33(1-6), 449-455.


=

THIS PAGE INTENTIONALLY LEFT BLANK

====================^Åèìáëáíáçå=oÉëÉ~êÅÜW=ÅêÉ~íáåÖ=ëóåÉêÖó=Ñçê=áåÑçêãÉÇ=ÅÜ~åÖÉ =

=

2003 - 2008 Sponsored Research Topics

Acquisition Management

Software Requirements for OA Managing Services Supply Chain Acquiring Combat Capability via Public-Private Partnerships (PPPs) Knowledge Value Added (KVA) + Real Options (RO) Applied to Shipyard

Planning Processes Portfolio Optimization via KVA + RO MOSA Contracting Implications Strategy for Defense Acquisition Research Spiral Development BCA: Contractor vs. Organic Growth

Contract Management

USAF IT Commodity Council Contractors in 21st Century Combat Zone Joint Contingency Contracting Navy Contract Writing Guide Commodity Sourcing Strategies Past Performance in Source Selection USMC Contingency Contracting Transforming DoD Contract Closeout Model for Optimizing Contingency Contracting Planning and Execution

Financial Management

PPPs and Government Financing Energy Saving Contracts/DoD Mobile Assets Capital Budgeting for DoD Financing DoD Budget via PPPs ROI of Information Warfare Systems Acquisitions via leasing: MPS case Special Termination Liability in MDAPs

====================^Åèìáëáíáçå=oÉëÉ~êÅÜW=ÅêÉ~íáåÖ=ëóåÉêÖó=Ñçê=áåÑçêãÉÇ=ÅÜ~åÖÉ =

=

Human Resources

Learning Management Systems Tuition Assistance Retention Indefinite Reenlistment Individual Augmentation

Logistics Management

R-TOC Aegis Microwave Power Tubes Privatization-NOSL/NAWCI Army LOG MOD PBL (4) Contractors Supporting Military Operations RFID (4) Strategic Sourcing ASDS Product Support Analysis Analysis of LAV Depot Maintenance Diffusion/Variability on Vendor Performance Evaluation Optimizing CIWS Lifecycle Support (LCS)

Program Management

Building Collaborative Capacity Knowledge, Responsibilities and Decision Rights in MDAPs KVA Applied to Aegis and SSDS Business Process Reengineering (BPR) for LCS Mission Module

Acquisition Terminating Your Own Program Collaborative IT Tools Leveraging Competence

A complete listing and electronic copies of published research are available on our website: www.acquisitionresearch.org

^Åèìáëáíáçå=êÉëÉ~êÅÜ=mêçÖê~ã=dê~Çì~íÉ=ëÅÜççä=çÑ=ÄìëáåÉëë=C=éìÄäáÅ=éçäáÅó=k~î~ä=éçëíÖê~Çì~íÉ=ëÅÜççä=RRR=avbo=olâI=fkdboplii=eîi=jlkqbobvI=`îfclokf^=VPVQP=

www.acquisitionresearch.org

1

A Semantic-Based Search Engine for Open Architecture Requirements

DocumentsDr. Craig Martell

Associate Professor, Naval Postgraduate School

Our Team

• Paige Adams, LT USN• Pranav Anand, Assistant Prof, Linguistics, UCSC• Grant Gehrke, ENS USN• Ralucca Gera, Assistant Prof, Mathematics, NPS• Marco Draeger, CPT German Army• Craig Martell, Associate Prof, Computer Science, NPS• Kevin Squre, Assistant Prof, Computer Science, NPS

2

The ReSEARCH Project: What we’re up to.

• Using open-sourced components, we want to design a semantic search engine for requirements documents that supports the SHARE repository

• We need to match over the meaning of a requirement, not a question or a query string.

• Do processing to “enrich” both the query and the documents with semantic information.

• Automatically augment ontologies with new hypernomy (“is a”) and mereology (“is a part of”) relations.– That is, do we have to be told that a Hummer is a vehicle, and one

of its parts is a steering wheel, or can we discover this from the text.

• Etc.

3

4

Why use semantic search?

• Existing keyword-based search engines do not take into account the semantics of the documents they are searching.

• This is important when trying to find components that do what you need, not what you type.

5

Why use semantic search?

• The query string and the desired documents may not use the same phrases.

Q: What fuel does the F-22A consume?A: The F-22A’s Raptor uses JP-8.

• Query and answer convey same meaning, but use different forms– Here, “consume” and “uses” are synonymous.

autoenginebonnettyreslorryboot

caremissions

hood makemodeltrunk

makehiddenMarkovmodel

emissionsnormalize

Synonymy Problem Polysemy Problem

Prior and current strategies

Keyword Search Model

7

Prior and current strategies

• Brin and Page (1998) revolutionized search by using PageRank, which changes the order in which the pages that match the keywords in the query are returned.

• The essence of the Google innovation is in how the PageRank algorithm works.

PageRank algorithm

The rank of a particular page depends on:• The number of pages pointing to it, • The rank of each page pointing to it,• The number of outgoing links on each those

pages.

8

PageRank algorithm• PageRank metric PR(P) defines recursively

the rank/importance of each page P by

where • T1, T2, … are all the pages pointing to P • each Ti has C(Ti) outgoing links.

9

)()(...

)()(

)()()(

2

2

1

1

n

n

TCTPR

TCTPR

TCTPRPPR +++∝

Random Surfer

To further determine the rank of all web pages Google simulates the behavior of virtual surfers randomly surfing the web.

A page's rank is then updated based on how frequently the random surfers visit that page.

This pre-existing rank of each individual website is assigned independently of any query.

10

Expert Rank

Ask.com (Ask Jeeves) uses the ExpertRank algorithm:

• uses the number of incoming links as well• attempts to identify topic clusters related to

search• find experts within these topics to “seed” the

rank of some websites as “expert” sites.• PageRank has the problem that “correct” is

not the same as “highly-ranked.”

11

Current Online Semantic Search

• Powerset Labs has emerged as a forerunner in online semantic search using natural language to extract facts from text.

• On 11 May 08, Powerset’s search moved from beta to a public release

• Currently, Powerset searches only Wikipedia documents, but intends to expand search to the Internet in the future.

Powerset Indexing System

• Powerset’s algorithms are not publicly available, but their behavior can be inferred from publicly available demos– Powerset parses documents to

extract “factz”– “Factz” are generally triples of

subject-verb-object– Search is performed over these

“factz” rather than the full text

Question Answering• Keywords such as “When” tell the system how to narrow results• “W” words such as “Who” and “When” act as wildcards for matching

“factz,” allowing many searches to be matched exactly

Question Answering• Other functional words such as “From” in the search “Politician from

Virginia” improve results significantly over searching on just the keywords “Politician” and “Virginia”

Question Answering

• Question Answering task does not align exactly with requirements document search

• Requirements documents do not hold “Answers” to questions

• Encoding of facts is, however, useful– Computationally less demanding– Efficient use of storage space for the index– Allows domain specific constructs of facts to be

formulated and recognized in the corpus

Query Expansion with Synonymy• The search “What do zombies eat?” suggests that Powerset searches

for synonyms of query terms, matching “devour”• Additionally, stemming matches the inverted form “eaten by”

Query Expansion with Synonymy• Stemming of terms is found in most search engines

and is fairly easy to perform• Matching synonyms allows “close” matches on

meaning without requiring an exact keyword match• Using a structured ontology, expansion is not limited

to synonyms but may be extended to hypernyms, hyponyms, and meronyms as well– Ontology based query expansion does not appear to be

used in current Powerset searches– One of our primary approaches:

• Research Question: Can we automatically augment a given ontology using the text of the documents?

Discovering Synonymous Sentences

19

• Harris (1954): Synonymous words will occur in the same kinds of environments

• Lin & Pantel (2001): Synonymous sentences will contain the same kinds of words

The F-22A consumes JP-8The F-22A’s engine uses JP-8

• Idea: construct sentence similarity metric


• Sentence similarity is the geometric average of the similarity of the positions in the sentence:


21

• Position similarity is a normalized sum of the pointwise mutual information of all words that appear in both positions of the respective paths:


22

• Lin & Pantel evaluated system against TREC-8 Question Answering Task question set.

The ReSEARCH Project: Work for us.• Using open-sourced components, design a semantic search

engine for requirements documents that supports the SHARE repository

• Match over the meaning of a requirement, not a question.• Do processing to “enrich” both the query and the documents

with semantic information.• Automatically augment ontologies with new hypernomy (“is a”)

and mereology (“is a part of”) relations.– That is, do we have to be told that a Hummer is a vehicle, and one

of its parts is a steering wheel, or can we discover this from the text.

• Lots more!

23

Backup Slides

24

Building the Index

A Sample Search with Lucene

Using an Augmented Search String

28

GOAL: Design an alternative method to explicitly store/represent semantic metadata in order to enable semantic search.

Our Work