Transcript
Page 1: Google Search Appliance November 2, 2010 Susan Fagan

Google Search Appliance

November 2, 2010

Susan Fagan

Page 2: Google Search Appliance November 2, 2010 Susan Fagan

2

Why Google Search Appliance?

• A different approach to search at EPA

• Smarter ranking

• Improved indexing

• Easier operations

• A future

We’re going to call it GSA from here on in

Page 3: Google Search Appliance November 2, 2010 Susan Fagan

3

How GSA ranks documents

• It’s a secret, but we know some things

– Page rank

– Self learning

• We can control some things

– Date biasing

– Source biasing

– Metadata biasing

– Best bets

• We’re going to let it do its thing before we tune it too much

Page 4: Google Search Appliance November 2, 2010 Susan Fagan

4

How GSA ranks documents: Page Rank

• Who links to your pages?

• Who links to pages that link to your pages?

• How does everybody link?

– What does it say in the link text?

– Is the link always the primary URL (because if it isn’t,

you don’t get any points)?

A primary URL is a URL that contains no aliases

that are not primary. Primary as defined by what

you put in the TSSMS Alias Tool.

Page 5: Google Search Appliance November 2, 2010 Susan Fagan

5

How GSA Ranks Documents: Things We Can Control

• Date biasing

– Newer is better

– We control how much better

• Source biasing

– Boost or decrease chunks of our website

– Regions are slightly decreased for Agency search

• Metadata biasing

– We control how much each metadata field counts

– We can turn up the bias as metadata quality improves

Page 6: Google Search Appliance November 2, 2010 Susan Fagan

6

How GSA Ranks Documents: More Things We Can Control

• Best Bets

– Like buying keywords from Google.com

– Specific pages for specific keywords or phrases

– Always featured at the top

– Take effect immediately

Page 7: Google Search Appliance November 2, 2010 Susan Fagan

7

How GSA Indexes Documents

• Continuous crawl

• Learns by experience

• Crawl rates tunable by host and time

• Requires some starting points (seeds)

• Restricted by Do Not Crawl list

A manually maintained list in the GSA Admin UI,

of URL patterns that the crawler should not visit.

• Respects robots.txt (in it’s own way)

Page 8: Google Search Appliance November 2, 2010 Susan Fagan

8

How EPA is implementing GSA

• Same Java webapp on the same servers

• Your search form will stay the same

• Area search won’t change much

• Your XML search application may change (most

won’t)

• Smart, fast indexing, with some help

• Only indexing primary URLs

Page 9: Google Search Appliance November 2, 2010 Susan Fagan

9

Implementing GSA: Your search form will stay the same

• Implemented Northern Light via an object-oriented Java

application

– We get to keep our code this time

– 6 weeks to change it, instead of 6 months

– Nothing changes for client pages

• Two Model 7007 Google Search Appliances -

- Primary

- Hot spare for failover

- Parallel indexes

• 2,000,000 document license

Page 10: Google Search Appliance November 2, 2010 Susan Fagan

10

Implementing GSA: Your search form

• URL is the same

• All common elements work the same

• Some obscure elements go away

– weighted_search, search_crumbs

• Custom result templates work the same

• Advanced search works the same

Page 11: Google Search Appliance November 2, 2010 Susan Fagan

11

Implementing GSA: Area Search

• Area search is here for now

• If you search by TSSMS

– We will translate it on the fly to URL

– We will only translate TSSMS to primary alias

• If you search by URL

– Nothing changes…

– …. But aliases are your problem

• Contact Peter to test your area search

Page 12: Google Search Appliance November 2, 2010 Susan Fagan

12

Implementing GSA: Your XML search app

• Parameters and templates are unchanged

• GSA response packet automatically transformed

to original NL format

• Only 1,000 results are available for a single query

• 3 applications have been observed exceeding

that limit

Page 13: Google Search Appliance November 2, 2010 Susan Fagan

13

Implementing GSA: Smart, fast indexing

• Continuous crawl – scans the website at least

daily for new links

• If it’s not linked, it won’t be found

• Librarian looks daily for new content

• If all this doesn’t work (quickly), tell the librarian

• Notes databases do not require Verity Views

Page 14: Google Search Appliance November 2, 2010 Susan Fagan

14

Implementing GSA: Indexing your primary URL

• Search engines think different URLs are different

documents

• This means duplicates in search results

• All non-primary aliases are being placed in the Do

Not Crawl list

Page 15: Google Search Appliance November 2, 2010 Susan Fagan

15

What will our customers see?

• The same thing…. At first.

• Breadcrumbs are gone…what were they,

anyway?

• Folders replaced by Related Searches

• FAQ will come back

• Best Bets for top documents

• The document they’re looking for!

Page 16: Google Search Appliance November 2, 2010 Susan Fagan

16

What do we have to do?

• Plan our November 19 public access

implementation

• Test (with your help)

• Implement

• Make it better

Page 17: Google Search Appliance November 2, 2010 Susan Fagan

17

What do you have to do?

• Keep working on ROT

• Keep working on metadata

• Don’t change your search form…

• … Area search will work, if you want it

• Tell us what you think

Page 18: Google Search Appliance November 2, 2010 Susan Fagan

18

What are we leaving out … for now?

• EPA thesaurus

– Contains only general terms

– We will add EPA vocabulary

• Google’s spellchecker

– We’ll use our own for now

– We’ll compare and use the winner

• RSS presentation – delivers only raw XML in search

results, for now

• Recent searches

Page 19: Google Search Appliance November 2, 2010 Susan Fagan

19

What’s in our future?

• Marketplace of One Box modules

– Faceted search?

– Contextual search?

– Business intelligence?

• More social media

• OneEPA integration

• Web CMS integration

• Advanced analytics

• Special collections

• Geographic search?

• GSA for intranet

Page 20: Google Search Appliance November 2, 2010 Susan Fagan

20

Contact:

Susan [email protected]

202-566-2021

Page 21: Google Search Appliance November 2, 2010 Susan Fagan

21


Recommended