Upload
patrick-obrien
View
2.318
Download
0
Tags:
Embed Size (px)
DESCRIPTION
At the DLF Forum in 2010 we gave a general presentation about search engine optimization for digital repositories. In that presentation we revealed some new and surprising information about Google Scholar harvesting requirements, and how they affect institutional repositories’ visibility in the GS index. We learned, for instance, that the Webmaster Inclusion Guidelines for Google Scholar cautions us to “use Dublin Core only as a last resort” for metadata tags. One reason for instruction this is that Dublin Core cannot represent publication citation information very well. We have also learned that getting indexed in Google Scholar results in higher ranking for that same item in Google’s main index. Working with OCLC, we have continued to research SEO practices for Google Scholar as well as for the main Google index, and that research has resulted in a book contract with Neal-Schumann. We also gave a similar presentation at CNI last spring: http://content.lib.utah.edu/u?/ir-main,60502. In this year’s research update we offer a solid set of practices that can be applied broadly to institutional repositories to improve the percentage of items that are indexed by Google Scholar.
Citation preview
Invisible Ins*tu*onal Repositories: Addressing the Low Indexing Ra*o of IRs in Google Scholar by Transforming Metadata Schema Kenning Arlitsch & Patrick OBrien October 31, 2011 2011 Fall DLF, Baltimore, MD
Today’s Objec*ves
u Discuss Marriott Library SEO program v Program Priorities & Results v Issues & Opportunity v Google Scholar
MarrioE Library SEO program priori*es
u Digital repositories vs. general websites v Millions of objects in databases v Include IR
u Priority 1 – Increase Reach v Get objects indexed in search engines
u Priority 2 – Increase Visibility v Provide robust descriptive content
Collec*on Google Index Ra*os have increased across the board…
100%
74%
87%
51%
37%
12%
0% 25% 50% 75% 100%
High**
Average
07/05/10 04/04/11 10/16/11
Google Index Ratio - All Collections*
* Google Index Ratio = URLs submitted / URLs Indexed by Google for about 150 collections containing ~170,00 URLs **Highest index ratio achieved for Collections with over 500 URLs submitted to Google
…increasing Google referrals by 200% and total visitors by 79%.
12 week year-over-year
However, Google Scholar Index Ra*os ??
Google Scholar Index Ratio
0% You can find Marriott IR papers in Google now, but can
not find them in Google Scholar. Why?
Today’s Objec*ves
u Discuss Marriott Library SEO program v Program Priorities & Results v Issues & Opportunity v Google Scholar
College Students Begin Research -‐ 2005
DeRosa, Cathy, et al. “Perceptions of Libraries, 2010: Context and Community: A Report to the OCLC Membership”, OCLC, 2010.
College Students Begin Research -‐ 2010
Start with the 800 pound gorilla – Google.
MarrioE Library Management Experiences
u Large digital collections built over a decade v 1.3+ million items
u Why weren’t we getting indexed? v Harvesting/indexing rates as low as 8% v Non-‐existent IR showing in Google Scholar
u Sitemaps generated for Google
MWDL Repositories Survey
0% 25% 50% 75% 100%
Utah State Library University of Nevada, Las Vegas Health Education Assets Library
Weber State University Utah Valley University Utah State University Utah State Archives
Utah State University Brigham Young University Southern Utah University
University of Utah University of Nevada, Reno
Utah Digital Newspapers Repository
% w/ Indirect URL
October 2010
MWDL Repositories Survey
0% 25% 50% 75% 100%
Utah Digital Newspapers Repository Utah State Archives Utah State Library
Southern Utah University Health Education Assets Library
Weber State University Brigham Young University
Utah Valley University University of Nevada, Las Vegas
Utah State University University of Utah
Utah State University University of Nevada, Reno
% w/ Direct URL
October 2010
Literature Lessons
u Most are dated u Most deal with general websites u Few deal with digital collections in db’s u Some suggest duplicating the content outside the database
Today’s Objec*ves
u Discuss Marriott Library SEO program v Program Priorities & Results v Issues & Opportunity v Google Scholar
Why does Google Scholar MaEer ??
u “researchers find Google and Google Scholar to be amazingly effective” and accept the results as “good enough in many cases” (Kroll & Forsman 2010)
u “broader awareness of specialized Google tools such as Google Scholar and Google Book among faculty members and graduate students” (Rieger 2009)
u “the amount of qualified scholarly content has increased considerably in Google Scholar since it was launched in 2004 (Mikki 2009)
u 4% -‐ 27% use increase in four-‐year U Miss study (Herrera 2010)
USpace IR Google Index Ra*os baseline
4%
23%
0%
12%
0% 25% 50% 75% 100%
Board of Regents
UScholar Works
ETD 2
ETD 1 07/05/10
11/19/10
10/16/11
Google Index Ratio
*Weighted Average Google Index Ratio = 18.33% (1,188/6,482)
USpace IR Google Index Ra*os baseline
4%
23%
0%
12%
0% 25% 50% 75% 100%
Board of Regents
UScholar Works
ETD 2
ETD 1 07/05/10
11/19/10
10/16/11
Google Index Ratio
Google Scholar Index Ratio
0% *Weighted Average Google Index Ratio = 18.33% (1,188/6,482)
Low GS indexing ra*os cut across ins*tu*ons
3%
6%
10%
12%
13%
13%
16%
17%
18%
28%
29%
34%
34%
38%
40%
47%
56%
60%
60%
89%
UW -‐ ResearchWorks Archive
Univ of Rochester Research
CaltechAuthors
D-‐Scholarship@Pitt
Columbia Univ -‐ Academic
IU Scholarworks
Texas A&M Repository
UW Madison -‐ Minds@UW
eCommons@Cornell
Harvard Univ -‐ DASH
Univ of Oregon -‐ Scholars Bank
Michigan -‐ Deep Blue
BYU Scholars Archive
IUPUI Scholar
Cornell -‐ Digital Commons@ILR
Cornell -‐ arXiv
Aquatic Commons
Virginia Tech -‐ CS Tech Reports
Digital Commons@UNLincoln
Baylor U -‐ BearDocs
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Google Scholar Indexing Ratio for Selected Institutional and Disciplinary Repositories October 2011
Survey Methodology Key Points
u Selected from OpenDOAR v Only IRs from the U.S.
n “Pure” institutional or disciplinary repositories v Different software types
n DSpace, Digital Commons, EPrints, IR+, CONTENTdm, DigiTool, arXiv
u Calculated total items in each repository u Site operator search
v Site:repositoryURL v Shows Approximation
GS “site” operator provides a close approxima*on for indexing ra*o
Repository so_ware does not appear to be the deciding factor
Repository Name Repository So_ware Repository URL Repository items Items in Google Scholar Indexing Ra*o
Boston College -‐ eScholarship@BC DigiTool dcollec7ons.bc.edu 1,635 1 0%
UW -‐ ResearchWorks Archive Dspace digital.lib.washington.edu/dspace 11,285 304 3%
Univ of Rochester Research IR+ urresearch.rochester.edu 16,184 983 6%
CaltechAuthors Eprints authors.library.caltech.edu 22,000 2,290 10%
D-‐Scholarship@PiT Eprints d-‐scholarship.piT.edu 5,888 686 12%
Columbia Univ -‐ Academic Commons Digital Commons academiccommons.columbia.edu 4,631 586 13%
IU Scholarworks Dspace scholarworks.iu.edu/dspace 7,782 1,030 13%
Texas A&M Repository Dspace repository.tamu.edu 46,324 7,250 16%
UW Madison -‐ Minds@UW Dspace minds.wisconsin.edu 15,078 2,520 17%
eCommons@Cornell Dspace ecommons.library.cornell.edu 18,544 3,410 18%
Harvard Univ -‐ DASH Dspace dash.harvard.edu 6,193 1,710 28%
Univ of Oregon -‐ Scholars Bank Dspace scholarsbank.uoregon.edu/xmlui 9,740 2,840 29%
Michigan -‐ Deep Blue Dspace deepblue.lib.umich.edu 66,038 22,200 34%
BYU Scholars Archive CONTENTdm scholarsarchive.lib.byu.edu 7,421 2,520 34%
IUPUI Scholar Dspace scholarworks.iupui.edu 2,109 800 38%
Cornell -‐ Digital Commons@ILR Digital Commons digitalcommons.ilr.cornell.edu 14,669 5,880 40%
Cornell -‐ arXiv Other (arXiv) arxiv.org 706,906 330,000 47%
Aqua7c Commons Eprints aqua7ccommons.org 5,722 3,230 56%
Virginia Tech -‐ CS Tech Reports Eprints eprints.cs.vt.edu 983 586 60%
Digital Commons@UNLincoln Digital Commons digitalcommons.unl.edu 50,657 30,200 60%
Baylor U -‐ BearDocs Dspace beardocs.baylor.edu 928 829 89%
Google Scholar wants the right metadata tags used consistently and accurately.
"Use Dublin Core tags (e.g., DC.title) as a last resort -‐they work poorly for journal papers...”
-‐ Google Scholar Inclusion Guidelines for Webmasters
… there's a good chance that many of your papers aren't included at all, because documents with the same title are often considered duplicates.
-‐ Google Scholar Inclusion Guidelines for Webmasters
“… incorrect identification of references could lead to exclusion of your papers from Google Scholar or to low ranking of your papers in the search results.”
-‐ Google Scholar Inclusion Guidelines for Webmasters
“…the most common cause of indexing problems is incorrect extraction of bibliographic data by the automated parser software.
-‐ Google Scholar Inclusion Guidelines for Webmasters
Challenge is presen*ng bibliographic cita*ons GS can iden*fy, parse and digest
10/31/11 Thanks for nothing: changes in income and labor force participation for never-married mothers since 1982
3/3content.lib.utah.edu/cdm4/document.php?CISOROOT=/ir-main&CISOPTR=824&REC=3
Title Thanks for nothing: changes in income and labor force participation for never-married mothers since 1982University of Utah creator Wolfinger, Nicholas H.Other Creator McKeever, MatthewSubject.Keyword Motherhood; Single Mothers; Income; Population surveys;Subject.LCSH Single mothers
IncomeDescription This study examines whether the changing social and economic characteristics of
women who give birth out of wedlock have led to higher family incomes. Using CurrentPopulation Survey data collected between 1982 and 2002, we find that never-marriedmothers remain poor. They have made modest economic gains, but these have disproportionatelyoccurred at the top of the income distribution. Yet there is no evidence ofa burgeoning class of "Murphy Browns" middle-class professional women who givebirth out of wedlock. Surprisingly, never-married mothers' incomes have stagnated inspite of impressive gains in education and other personal and vocational characteristicsthat should have resulted in greater economic progress than has been the case.These gains cast doubt on various stereotypes about women who give birth out ofwedlock.
Publisher University of UtahDate.Original 2006-07-26Type TextFormat.Extent 370,155 BytesFormat.Medium application/pdfResource Identifier ir-main,824Language engSeries Institute of Public and International Affairs Working PapersRelation McKeever, M. & Wolfinger, N.H. (2006). Thanks for Nothing: Changes in Income and Labor Force Participation for
Never-Married Mothers since 1982. Institute of Public & International Affairs (IPIA), 4, 1-43.Rights Management (c) Matthew McKeever and Nicholas H. WolfingerResearch Institute Institute of Public and International Affairs (IPIA)Department Family & Consumer Studies
SociologySchool / College College of Social & Behavioral ScienceContributing Institution University of UtahPublication Type working paper
UNIVERSITY OF UTAH | ECCLES HEALTH SCIENCES LIBRARY | MARRIOTT LIBRARY | QUINNEY LAW LIBRARY | DISCLAIMER | COPYRIGHT | CONTACTIN ACCORDANCE WITH THE AMERICANS WITH DISABILITIES ACT, THE INFORMATION IN THIS SITE IS AVAILABLE IN ALTERNATE FORMATS UPON REQUEST.
UNIVERSITY OF UTAH LIBRARIES, 295 S 1500 E, SALT LAKE CITY, UTAH 84112 | PHONE: 801-581-8558 | FAX: 801-585-3464
First step was to begin aligning Highwire Press with exis*ng Dublin Core fields
Google Scholar HTML speak
Google Scholar Pilot 1 tested importance of Metadata model
u 6,482 URLs in Sitemaps submitted via Google Webmaster Tools.
u Errors generated during Google crawls were analyzed and addressed.
u Updated & corrected metadata for 20 pilot articles v Ensured full-‐text PDF met GS inclusion guideline requirements.
v Provided a “landing page” per GS inclusion guidelines, containing links to the 20 IR pilot papers that was within a few clicks of the home page.
USpace IR Google Index Ra*os increased
Google Index Ratio
97%
98%
98%
97%
47%
51%
68%
69%
4%
23%
0%
12%
0% 25% 50% 75% 100%
Board of Regents
UScholar Works
ETD 2
ETD 1 07/05/10
11/19/10
10/16/11
*October 16, 2011 Weighted Average Google Index Ratio = 97.82% (10,306/10,536).
USpace IR Google Index Ra*os increased
Google Index Ratio
97%
98%
98%
97%
47%
51%
68%
69%
4%
23%
0%
12%
0% 25% 50% 75% 100%
Board of Regents
UScholar Works
ETD 2
ETD 1 07/05/10
11/19/10
10/16/11
*October 16, 2011 Weighted Average Google Index Ratio = 97.82% (10,306/10,536).
Google Scholar Index Ratio
0%
GS Pilot 2 U*lized OCLC’s rela*onship with Google Scholar
u 19 Papers in GS Pilot 2 v 6 of 7 GS paper types represented v 19 Full Text PDFs
u Augmented CONTENTdm v.6 v Highwire Press Meta tags v Browse By Year v Recently Added v College & Department
Google Scholar Index Ratio
62%
A Pre-‐Print Author Manuscript is not the Journal Ar*cle.
Meta Tag Pre-‐Print Journal Article 1 -‐ citation_author Maloney, Krisellen; Antelman, Kristin;
Arlitsch, Kenning; Butler, John Maloney, Krisellen; Antelman, Kristin; Arlitsch,
Kenning; Butler, John 2 -‐ citation_date 2009 2010 3 -‐ citation_title Future leaders' views on organizational
culture Future leaders' views on organizational culture
4 -‐ citation_publisher N/A Association of College & Research Libraries 5 -‐ citation_journal_title N/A College and Research Libraries 6 -‐ citation_volume 71 7 -‐ citation_issue 4 8 -‐ citation_firstpage 1 322 9 -‐ citation_lastpage 56 347 10 -‐ citation_doi 11 -‐ citation_issn 12 -‐ citation_isbn 13 -‐ citation_keywords Organizational culture Organizational culture 16 -‐ citation_technical_report_institution Uspace Ins7tu7onal Repository,
University of Utah N/A
17 -‐ citation_technical_report_number N/A 18 -‐ citation_language en en 21 -‐ citation_pdf_url hTp://cdm6gs.lib.utah.edu/u7ls/geeile/
collec7on/uspace/id/10/filename/3.pdf hTp://cdm6gs.lib.utah.edu/u7ls/geeile/collec7on/
uspace/id/16/filename/17.pdf 22 -‐ citation_abstract_html_url hTp://cdm6gs.lib.utah.edu/cdm/singleitem/
collec7on/uspace/id/10/rec/1 hTp://cdm6gs.lib.utah.edu/cdm/singleitem/
collec7on/uspace/id/16/rec/2 Not Relevant 14 - citation_dissertation_institution 15 - citation_dissertation_name 19 - citation_conference_title 20 - citation_inbook_title
A minor nuance is the difference between Books and Book Chapters
Meta Tag Book Chapter Book 1 -‐ citation_author Riloff, Ellen M. Ram, Ashwin 2 -‐ citation_date 1999 1999 3 -‐ citation_title Information extraction as a stepping stone toward
story understanding Understanding Language: Understanding
Computational Models of Reading 4 -‐ citation_publisher MIT Press MIT Press 8 -‐ citation_firstpage 435 1 9 -‐ citation_lastpage 460 519 12 -‐ citation_isbn 0-‐262-‐18192-‐4 0-‐262-‐18192-‐4 13 -‐ citation_keywords Information extraction; Story understanding; Information extraction; Story understanding; 18 -‐ citation_language en en 20 -‐ citation_inbook_title Understanding Language: Understanding
Computational Models of Reading N/A
21 -‐ citation_pdf_url hTp://cdm6gs.lib.utah.edu/u7ls/geeile/collec7on/uspace/id/9/filename/5.pdf
22 -‐ citation_abstract_html_url
hTp://cdm6gs.lib.utah.edu/cdm/singleitem/collec7on/uspace/id/9/rec/1
Not Relevant 5 - citation_journal_title 6 - citation_volume 7 - citation_issue 10 - citation_doi 11 - citation_issn 14 - citation_dissertation_institution 15 - citation_dissertation_name 16 - citation_technical_report_institution 17 - citation_technical_report_number 19 - citation_conference_title
ETDs use very different metadata tags
Meta Tag PhD Masters 1 -‐ citation_author Rague, Brian William Wu, Shangduan 2 -‐ citation_date 2010/08 2010/07 3 -‐ citation_title A CS1 pedagogical approach to parallel thinking Electronic structure and transport property of
disordered graphene 8 -‐ citation_firstpage 1 1 9 -‐ citation_lastpage 234 84 13 -‐ citation_keywords Computer; CS1; Educa7on; Parallel; Programming; Disorder; Electronic structure; Graphene; Transport
property; Electronic structure; 14 -‐ citation_dissertation_institution University of Utah, College of Engineering University of Utah, College of Science 15 -‐ citation_dissertation_name PhD MS 18 -‐ citation_language en en 21 -‐ citation_pdf_url hTp://cdm6gs.lib.utah.edu/u7ls/geeile/collec7on/
uspace/id/5/filename/19.pdf hTp://cdm6gs.lib.utah.edu/u7ls/geeile/collec7on/uspace/id/0/filename/4.pdf
22 -‐ citation_abstract_html_url hTp://cdm6gs.lib.utah.edu/cdm/singleitem/collec7on/uspace/id/5/rec/1
hTp://cdm6gs.lib.utah.edu/cdm/singleitem/collec7on/uspace/id/0/rec/1
Not Relevant 4 - citation_publisher 5 - citation_journal_title 6 - citation_volume 7 - citation_issue 10 - citation_doi 11 - citation_issn 12 - citation_isbn 16 - citation_technical_report_institution 17 - citation_technical_report_number 19 - citation_conference_title 20 - citation_inbook_title
Working papers have a unique combina*on of metadata tags.
Meta Tag Working Paper 1 -‐ citation_author Wolfinger, Nicholas H.; McKeever, Matthew 2 -‐ citation_date 2006-‐07-‐26 3 -‐ citation_title Thanks for nothing: changes in income and labor force participation for never-‐married
mothers since 1982 6 -‐ citation_volume 7 -‐ citation_issue 8 -‐ citation_firstpage 1 9 -‐ citation_lastpage 43 10 -‐ citation_doi 13 -‐ citation_keywords Motherhood; Single Mothers; Income; Population surveys; 16 -‐ citation_technical_report_institution Institute of Public & International Affairs (IPIA), University of Utah 17 -‐ citation_technical_report_number 2006-‐07-‐04 18 -‐ citation_language en 19 -‐ citation_conference_title 101st American Sociological Associa7on (ASA) Annual Mee7ng; 2006 Aug 11-‐14; Montreal,
Canada 21 -‐ citation_pdf_url hTp://cdm6gs.lib.utah.edu/u7ls/geeile/collec7on/uspace/id/7/filename/21.pdf 22 -‐ citation_abstract_html_url hTp://cdm6gs.lib.utah.edu/cdm/singleitem/collec7on/uspace/id/7/rec/1
Not Relevant 4 - citation_publisher 5 - citation_journal_title 11 - citation_issn 12 - citation_isbn 14 - citation_dissertation_institution 15 - citation_dissertation_name 20 - citation_inbook_title
Conferece Ar*cles may or may not have published proceedings
Meta Tag Conference Article 1 -‐ citation_author Balasubramonian, Rajeev; Awasthi, Manu; Sudan, Kshitij; Carter, John 2 -‐ citation_date 2009/02/14 3 -‐ citation_title Dynamic hardware-‐assisted software-‐controlled page placement to manage capacity allocation and
sharing within large caches 4 -‐ citation_publisher Institute of Electrical and Electronics Engineers (IEEE) 5 -‐ citation_journal_title High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on 6 -‐ citation_volume 7 -‐ citation_issue 8 -‐ citation_firstpage 250 9 -‐ citation_lastpage 261 10 -‐ citation_doi 10.1109/HPCA.2009.4798260 11 -‐ citation_issn 1530-‐0897 12 -‐ citation_isbn 978-‐1-‐4244-‐2932-‐5 13 -‐ citation_keywords Page coloring; Shadow-‐memory addresses; Cache capacity allocation; Data/page migration
18 -‐ citation_language en 19 -‐ citation_conference_title 15th Interna7onal Symposium on High Performance Computer Architecture (HPCA-‐15 2009) [14-‐18 Feb.
2009, Raleigh, NC, USA] 21 -‐ citation_pdf_url hTp://cdm6gs.lib.utah.edu/u7ls/geeile/collec7on/uspace/id/1/filename/11.pdf
22 -‐ citation_abstract_html_url hTp://cdm6gs.lib.utah.edu/cdm/ref/collec7on/uspace/id/1 Not Relevant 14 - citation_dissertation_institution 15 - citation_dissertation_name 16 - citation_technical_report_institution 17 - citation_technical_report_number 20 - citation_inbook_title
Ques*ons?
Kenning Arlitsch [email protected] Patrick OBrien www.RevXcorp.com [email protected] 805.509.2586
Ques*ons?
Kenning Arlitsch [email protected] Patrick OBrien www.RevXcorp.com [email protected]