Google Search Appliance November 2, 2010 Susan Fagan

  • View

  • Download

Embed Size (px)

Text of Google Search Appliance November 2, 2010 Susan Fagan

  • Slide 1

Google Search Appliance November 2, 2010 Susan Fagan Slide 2 2 Why Google Search Appliance? A different approach to search at EPA Smarter ranking Improved indexing Easier operations A future Were going to call it GSA from here on in Slide 3 3 How GSA ranks documents Its a secret, but we know some things Page rank Self learning We can control some things Date biasing Source biasing Metadata biasing Best bets Were going to let it do its thing before we tune it too much Slide 4 4 How GSA ranks documents: Page Rank Who links to your pages? Who links to pages that link to your pages? How does everybody link? What does it say in the link text? Is the link always the primary URL (because if it isnt, you dont get any points)? A primary URL is a URL that contains no aliases that are not primary. Primary as defined by what you put in the TSSMS Alias Tool. Slide 5 5 How GSA Ranks Documents: Things We Can Control Date biasing Newer is better We control how much better Source biasing Boost or decrease chunks of our website Regions are slightly decreased for Agency search Metadata biasing We control how much each metadata field counts We can turn up the bias as metadata quality improves Slide 6 6 How GSA Ranks Documents: More Things We Can Control Best Bets Like buying keywords from Specific pages for specific keywords or phrases Always featured at the top Take effect immediately Slide 7 7 How GSA Indexes Documents Continuous crawl Learns by experience Crawl rates tunable by host and time Requires some starting points (seeds) Restricted by Do Not Crawl list A manually maintained list in the GSA Admin UI, of URL patterns that the crawler should not visit. Respects robots.txt (in its own way) Slide 8 8 How EPA is implementing GSA Same Java webapp on the same servers Your search form will stay the same Area search wont change much Your XML search application may change (most wont) Smart, fast indexing, with some help Only indexing primary URLs Slide 9 9 Implementing GSA: Your search form will stay the same Implemented Northern Light via an object-oriented Java application We get to keep our code this time 6 weeks to change it, instead of 6 months Nothing changes for client pages Two Model 7007 Google Search Appliances - -Primary -Hot spare for failover -Parallel indexes 2,000,000 document license Slide 10 10 Implementing GSA: Your search form URL is the same All common elements work the same Some obscure elements go away weighted_search, search_crumbs Custom result templates work the same Advanced search works the same Slide 11 11 Implementing GSA: Area Search Area search is here for now If you search by TSSMS We will translate it on the fly to URL We will only translate TSSMS to primary alias If you search by URL Nothing changes . But aliases are your problem Contact Peter to test your area search Slide 12 12 Implementing GSA: Your XML search app Parameters and templates are unchanged GSA response packet automatically transformed to original NL format Only 1,000 results are available for a single query 3 applications have been observed exceeding that limit Slide 13 13 Implementing GSA: Smart, fast indexing Continuous crawl scans the website at least daily for new links If its not linked, it wont be found Librarian looks daily for new content If all this doesnt work (quickly), tell the librarian Notes databases do not require Verity Views Slide 14 14 Implementing GSA: Indexing your primary URL Search engines think different URLs are different documents This means duplicates in search results All non-primary aliases are being placed in the Do Not Crawl list Slide 15 15 What will our customers see? The same thing. At first. Breadcrumbs are gonewhat were they, anyway? Folders replaced by Related Searches FAQ will come back Best Bets for top documents The document theyre looking for! Slide 16 16 What do we have to do? Plan our November 19 public access implementation Test (with your help) Implement Make it better Slide 17 17 What do you have to do? Keep working on ROT Keep working on metadata Dont change your search form Area search will work, if you want it Tell us what you think Slide 18 18 What are we leaving out for now? EPA thesaurus Contains only general terms We will add EPA vocabulary Googles spellchecker Well use our own for now Well compare and use the winner RSS presentation delivers only raw XML in search results, for now Recent searches Slide 19 19 Whats in our future? Marketplace of One Box modules Faceted search? Contextual search? Business intelligence? More social media OneEPA integration Web CMS integration Advanced analytics Special collections Geographic search? GSA for intranet Slide 20 20 Contact: Susan Fagan 202-566-2021 Slide 21 21