66
Productivity of the crowd Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia. Slides @ http://bit.ly/crowdsourceacrl2013 Frederick Zarndt Chair, IFLA Newspapers Section CCS / Digital Divide Data / DL Consulting @cowboyMontana, #crowdsourceacrl2013 [email protected] Brian Geiger Director, Center for Bibliographic Studies and Research [email protected]

20130412 Productivity of the crowd [acrl indianapolis]

Embed Size (px)

Citation preview

Page 1: 20130412 Productivity of the crowd [acrl indianapolis]

Productivity of the crowd

Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.

Slides @ http://bit.ly/crowdsourceacrl2013

Frederick ZarndtChair, IFLA Newspapers Section

CCS / Digital Divide Data / DL Consulting@cowboyMontana, #crowdsourceacrl2013

[email protected]

Brian GeigerDirector, Center for Bibliographic Studies and Research

[email protected]

Page 2: 20130412 Productivity of the crowd [acrl indianapolis]

News

Page 3: 20130412 Productivity of the crowd [acrl indianapolis]

Crowds

Page 4: 20130412 Productivity of the crowd [acrl indianapolis]

NewsCrowds

+

Page 5: 20130412 Productivity of the crowd [acrl indianapolis]

Demographics

“British volunteers for "Kitchener's Army" waiting for their pay in the churchyard of St. Martin-in-the-Fields, Trafalgar Square, London” Public domain photo from Imperial War Museum

Page 6: 20130412 Productivity of the crowd [acrl indianapolis]

purpose / motive /reason

50%

Page 7: 20130412 Productivity of the crowd [acrl indianapolis]

purpose / motive /reason

50%

Page 8: 20130412 Productivity of the crowd [acrl indianapolis]

purpose / motive /reason

72%

Page 9: 20130412 Productivity of the crowd [acrl indianapolis]

purpose / motive /reason

80%

Page 10: 20130412 Productivity of the crowd [acrl indianapolis]

purpose / motive /reason

67%

Page 11: 20130412 Productivity of the crowd [acrl indianapolis]

?Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.

Page 12: 20130412 Productivity of the crowd [acrl indianapolis]

age50%

Page 13: 20130412 Productivity of the crowd [acrl indianapolis]

age80%

Page 14: 20130412 Productivity of the crowd [acrl indianapolis]

age67%

Page 15: 20130412 Productivity of the crowd [acrl indianapolis]

?Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.

Page 16: 20130412 Productivity of the crowd [acrl indianapolis]

User Demographicgenealogists and family historians 50+ years old

• In 2012 the National Library of Australia reported that ~50% of Trove users are family historians

• National Library of New Zealand survey found that ~50% of PapersPast users are genealogists

• A 2013 California Digital Newspaper Collection survey shows that more than 65% of its users are genealogists; 75% are 50 years old or older

• A 2012 Utah Digital Newspapers survey showed that 72% of its users are genealogists*

• A 2013 Cambridge Public Library survey shows that more than 80% of its users are genealogists; 73% are 50 years old or older

PAPERSPAST

*John Herbert and Randy Olsen. “Small town papers: Still delivering the news”. Paper given at 2012 World Library and Information Congress. Helsinki. August 2012.

Page 17: 20130412 Productivity of the crowd [acrl indianapolis]

Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4\irJi.- ~ ; ;✓ ' • * On ijfr r inn l j j j i l F i i j ' 1 1 f H a v o d i v y d , Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . •On Tncsdav last , Mr. Charles. IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. A s b t C n v H a l l , m a r L a n c a s t e r , Mr.,Geo. Worn ick , many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol "through Ins bead, 1 which instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week,

raw OCR text

Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.

newspaper image

Page 18: 20130412 Productivity of the crowd [acrl indianapolis]

Edwin Kiljin (Koninklijke Bibliotheek the Netherlands) reports raw OCR character accuracies of 68% for early 20th century newspapers

Rose Holley (National Library of Australia) reports raw OCR character accuracy varied from 71% to 98% on a sample Trove digitized newspapers

Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine. March/April 2009.

Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008.

Graphic is logo for Accuracy in Media (http://www.aim.org/)Public domain graphic images courtesy of Wikimedia Commons.

Page 19: 20130412 Productivity of the crowd [acrl indianapolis]

Crowdsourcing is the practice of obtaining needed services, ideas, or content by

soliciting contributions from a large group of people, and especially from an online

community, rather than from traditional employees or suppliers. ... [It] is different

from ordinary outsourcing since it is a task or problem that is outsourced to an undefined public rather than a specific, named group.

Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Crowdsourcing (accessed March 17, 2013)

Page 20: 20130412 Productivity of the crowd [acrl indianapolis]

Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in Crowdsourcing – A Study on Mechanical Turk.”

Motivation

Page 21: 20130412 Productivity of the crowd [acrl indianapolis]

You can make a difference

Graphic courtesy of TYPEinspire (http://typeinspire.com/)

Page 22: 20130412 Productivity of the crowd [acrl indianapolis]

User Lines corrected1 242,9652 87,5153 31,3184 24,1445 23,1846 19,2407 18,8988 16,8759 11,78410 9,762

Lines corrected User1,456,906 11,385,369 21,010,360 3960,230 4847,340 5786,147 6657,187 7600,513 8582,276 9565,384 10

Statistics from Oct 2012

Page 23: 20130412 Productivity of the crowd [acrl indianapolis]

uncorrected OCR accuracy by newspaper title

Title raw character accuracy

~raw word accuracy*

PRP Pacific Rural Press 1871 - 1922 92.6% 68.1%

SFC San Francisco Call 1890 - 1913 92.6% 68.1%

LAH Los Angeles Herald 1873 - 1910 88.7% 54.9%

LH Livermore Herald 1877 - 1899 88.6% 54.6%

DAC Daily Alta California 1841 - 1891 88.2% 53.4%

CFJ California Farmer and Journalof Useful Sciences 1855 - 1880 86.5% 48.4%

SN Sausalito News 1885 - 1922 70.4% 17.3%

*Word accuracy assumes average word length is 5 characters

Page 24: 20130412 Productivity of the crowd [acrl indianapolis]

corrected OCR accuracy by newspaper title

Title raw character accuracy

corrected accuracy

PRP Pacific Rural Press 1871 - 1922 92.6% 99.3%

SFC San Francisco Call 1890 - 1913 92.6% 99.6%

LAH Los Angeles Herald 1873 - 1910 88.7% 99.1%

LH Livermore Herald 1877 - 1899 88.6% 99.9%

DAC Daily Alta California 1841 - 1891 88.2% 99.9%

CFJ California Farmer and Journalof Useful Sciences 1855 - 1880 86.5% 99.8%

SN Sausalito News 1885 - 1922 70.4% 100.0%

Page 25: 20130412 Productivity of the crowd [acrl indianapolis]

corrected OCR accuracy by newspaper title

Title raw character accuracy

~raw word accuracy*

corrected accuracy

~corrected word accuracy*

PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5%

SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0%

LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6%

LH 1877 - 1899 88.6% 54.6% 99.9% 99.5%

DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5%

CF 1855 - 1880 86.5% 48.4% 98.3% 91.8%

SN 1885 - 1922 70.4% 17.3% 100.0% 100.0%

*Word accuracy assumes average word length is 5 characters

Page 26: 20130412 Productivity of the crowd [acrl indianapolis]

correction accuracy by user

User average uncorrected text accuracy

average corrected text accuracy

A 70.4% 100.0%

B 87.1% 99.5%

C 95.4% 99.5%

D 86.5% 98.3%

E 95.3% 100.0%

F 91.0% 100.0%

G 91.0% 99.8%

H 90.5% 99.0%

I 96.6% 99.8%

J 94.8% 100.0%

K 86.8% 99.3%

Page 27: 20130412 Productivity of the crowd [acrl indianapolis]

Crowdsourcing benefits

Public domain photo courtesy of US Navy

Page 28: 20130412 Productivity of the crowd [acrl indianapolis]

$Economics

Financial value of outsourced OCR text correction for newspapers?

The Assumptions

$ 25 to 50 characters per line in a newspaper column: Assume 40 characters per line (CDNC sample average)

$ Outsourced text transcription or correction costs USD $0.35 to $1.20 per 1000 characters: Assume $0.50 per 1000 characters

Page 29: 20130412 Productivity of the crowd [acrl indianapolis]

$Economics

$ 578,000 lines x 40 characters per line x 1/1000 x $0.50 = $11,560

$ 68,908,757 lines x 40 characters per line x 1/1000 x $0.50 = $1,378,175

Page 30: 20130412 Productivity of the crowd [acrl indianapolis]

$Economics

Financial value of in-house OCR text correction?

The Assumptions

$ Correction takes 15 seconds per line

$ Cost is hourly wage plus benefits of lowest level employee, $10 for CDNC, $41.88* for Australia

AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report.

Page 31: 20130412 Productivity of the crowd [acrl indianapolis]

$Economics

$ 578,000 lines x 15 seconds per line x 1/3600 hrs per second x $10.00 per hr = $24,083

$ 68,908,757 lines x 15 seconds per line x 1/3600 hrs per second x $41.88 per hr = $12,024,578

Page 32: 20130412 Productivity of the crowd [acrl indianapolis]

Accuracy

“His Accuracy Depends on Ours!"Office for Emergency Management. Office of War Information. Domestic Operations Branch. Bureau of Special Services. [Photo held at US National Archives and Records Administration]

Page 33: 20130412 Productivity of the crowd [acrl indianapolis]

How does low text accuracy affect search recall?

The Facts Average uncorrected OCR character accuracy of the

CDNC data is ~89%

Average length of an English word is 5 characters

Average word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to 60% or 6 out of 10 words correct

Accuracy

Public domain graphic images courtesy of Wikimedia Commons.

Page 34: 20130412 Productivity of the crowd [acrl indianapolis]

Search recall no text correction

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARND

T

instances of “ARNDT” found instances of “ARNDT” not found

Image © Nevit Dilmen found at Wikimedia commonsPublic domain graphic images courtesy of Wikimedia Commons.

Page 35: 20130412 Productivity of the crowd [acrl indianapolis]

Accuracy

The Facts Average corrected character accuracy of the CDNC

data is ~99.4%

Average word accuracy of the CDNC corrected text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%

Public domain graphic images courtesy of Wikimedia Commons.

Page 36: 20130412 Productivity of the crowd [acrl indianapolis]

Search recall with text correction

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

instances of “ARNDT” found instances of “ARNDT” not found

Image © Nevit Dilmen found at Wikimedia commonsPublic domain graphic image courtesy of Wikimedia Commons.

ARNDT

ARNDT

ARNDT

Page 37: 20130412 Productivity of the crowd [acrl indianapolis]

Accuracy

A search for my grandmother’s maiden name “Arndt” gives 11,154 results*

* Search performed 8 April 2013

Public domain graphic image courtesy of Wikimedia Commons.

Page 38: 20130412 Productivity of the crowd [acrl indianapolis]

A search for my grandmother’s maiden name “Arndt” gives 11,154 results*

If text accuracy is 55.8% (same as uncorrected CDNC sample), then 8,835 instances of “Arndt” were not found

Accuracy

* Search performed 8 April 2013

Public domain graphic images courtesy of Wikimedia Commons.

Page 39: 20130412 Productivity of the crowd [acrl indianapolis]

A search for my grandmother’s maiden name “Arndt” gives 11,154 results*

If text accuracy is 55.8% (same as uncorrected CDNC sample), then 8,835 instances of “Arndt” were not found

If text accuracy is 97.0%, then 345 instances of “Arndt” were not found

Accuracy

* Search performed 8 April 2013

Public domain graphic images courtesy of Wikimedia Commons.

Page 40: 20130412 Productivity of the crowd [acrl indianapolis]

Suppose the name is longer than 5 characters?

The Facts Assume that average uncorrected / corrected OCR

character accuracy is ~89% / ~99% same as CDNC.

Accuracy

Name name length raw text accuracy corrected text accuracy

Eklund 6 49.7% 94.2%

Kennedy 7 44.2% 93.25

Espinosa 8 39.4% 92.3%

Bonaparte 9 35.0% 91.4%

Chatterjee 10 31.2% 90.4%

Public domain graphic images courtesy of Wikimedia Commons.

Page 41: 20130412 Productivity of the crowd [acrl indianapolis]

Accuracy

Name Number of search results

Missing results with raw text accuracy

Missing results with corrected text accuracy

Eklund 2,951 2,987 182

Kennedy 360,723 455,392 26,111

Espinosa 1,918 2,950 160

Bonaparte 44,664 82,947 4,203

Chatterjee 19 42 2

Searches done 19-Mar-2013 (6,025,474 pages from 1836 to 1922).

Public domain graphic images courtesy of Wikimedia Commons.

Page 42: 20130412 Productivity of the crowd [acrl indianapolis]

Hard-to-measure-but-shouldn’t-be-overlooked

benefits

Public domain photo “A useful instruction for young sailors from the Royal Hospital School, Greenwich” from the National Maritime Museum.

Page 43: 20130412 Productivity of the crowd [acrl indianapolis]

“when someone transcribes a document, they are actually better fulfilling the mission of a cultural

heritage organization than someone who simply stops by to flip through the pages”

HTMBSBO benefit

Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/

Page 44: 20130412 Productivity of the crowd [acrl indianapolis]

“in addition to increasing search accuracy or lowering the costs of document transcription, crowdsourcing is

the single greatest advancement in getting people using and interacting with library collections”

Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/

HTMBSBO benefit

Page 45: 20130412 Productivity of the crowd [acrl indianapolis]

Cognitive surplus

... people are learning to use their free time for creative activities rather than consumptive ones [such as watching TV] ...

... the total human cognitive effort in creating all of Wikipedia in every language is about one hundred million hours ...

... Americans alone watch two hundred billion hours of TV every year, or enough time, if it would be devoted to projects similar to Wikipedia, to create about 2000 of them ...

Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010.

Page 46: 20130412 Productivity of the crowd [acrl indianapolis]

Conclusion of the Sonata for piano #32, opus 111 by Ludwig van Beethoven

Page 47: 20130412 Productivity of the crowd [acrl indianapolis]

?Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.

Slides @ http://bit.ly/crowdsourceacrl2013

Frederick ZarndtChair, IFLA Newspapers Section

CCS / Digital Divide Data / DL Consulting@cowboyMontana, #crowdsourceacrl2013

[email protected]

Brian GeigerDirector, Center for Bibliographic Studies and Research

[email protected]

Page 48: 20130412 Productivity of the crowd [acrl indianapolis]

Correct California newspapers at http://cdnc.ucr.edu

Correct Cambridge MA newspapers http://bit.ly/cambridgepublic

Correct Australian newspapers http://trove.nla.gov.au

Correct Tennessee newspapers http://tndp.lib.utk.edu

Correct Virginia newspapers http://virginiachronicle.com

Try crowdsourcing!

Page 49: 20130412 Productivity of the crowd [acrl indianapolis]

Hãy thử crowdsourcing!

Or try Russian language periodicals http://bit.ly/russianperiodicals

Correct Vietnamese newspapers http://bit.ly/nationallibraryofvietnam

Попробуйте краудсорсинга!

Or try Finnish newspapers http://digi.lib.helsinki.fi/sanomalehti

Kokeile crowdsourcing!

Page 50: 20130412 Productivity of the crowd [acrl indianapolis]

Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in Crowdsourcing – A Study on Mechanical Turk.”

Motivation

Page 51: 20130412 Productivity of the crowd [acrl indianapolis]

• “I enjoy the correction - it’s a great way to learn more about past history and things of interest whilst doing a ‘service to the community’ by correcting text for the benefit of others.”

• “I have recently retired from IT and thought that I could be of some assistance to the project. It benefits me and other people. It helps with family research.”

From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009.

MotivationTrove users’ report

Page 52: 20130412 Productivity of the crowd [acrl indianapolis]

“I am interested in all kinds of history. I have pursued genealogy as a hobby for many years. I correct text at CDNC because I see it as a

constructive way to contribute to a worthwhile project. Because I am interested in history, I enjoy it.”

Wesley, California

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 53: 20130412 Productivity of the crowd [acrl indianapolis]

“I only correct the text on articles of local interest - nothing at state, national or international level, no advertisements, etc.  The objective

is to be able to help researchers to locate local people, places, organizations and events using the on-line search at CDNC.  I correct local news & gossip, personal items, real estate transactions, superior

court proceedings, county and local board of supervisors meetings, obituaries, birth notices, marriages, yachting news, etc.”

Ann, California

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 54: 20130412 Productivity of the crowd [acrl indianapolis]

“I am correcting text for the Coronado Tent City Program for 1903.  It is important to correct any problems with personal names and

other information so that researchers will be able to search by keyword and be assured of retrieving desired results. ... type fonts cause a great deal of difficulty in digitizing the text and can cause problems for searchers.  Also, many of the guests' names at Tent

City and Hotel Del Coronado were taken from the registration books and reported in the Program.  This led to many problems in spelling of last names and the editors were not careful to be consistent in the spellings.  This Program is an important resource since it provides an excellent picture of daily life in Tent City and captures much of

the history of Coronado itself.”Gene, California

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 55: 20130412 Productivity of the crowd [acrl indianapolis]

“I have always been interested in history, especially the development of the American West, and nothing brings it alive

better than newspapers of the time. I believe them to be an invaluable source of knowledge for us and future generations.”

David, United Kingdom

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 56: 20130412 Productivity of the crowd [acrl indianapolis]

CDNC is an excellent source of information matching my personal interest in such topics as sea history, development of

shipbuilding, clippers and other ships etc. ... Unfortunately, the quality of text ... is rather poor I’m afraid. This is why I started to

do all corrections necessary for myself ... and to leave the corrected text for use of others. .... I am not doing this very

regularly as this is just my hobby and pleasure.Jerzey, Poland

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 58: 20130412 Productivity of the crowd [acrl indianapolis]

As of 17-Mar-2013 the National Library of Australia’s (http://trove.nla.gov.au/) Alexa Internet traffic rank is 14,490 (global) / 330 (Australia). Trove gets ~75% of all National Library web traffic.

Page 59: 20130412 Productivity of the crowd [acrl indianapolis]

National Library of Australia

Statistics from private communication with the National Library of Australia Oct 2012

• Online since 2008• 8,000,000+ pages• Top text corrector 1,772,090 lines• 2,400,000+ lines corrected each month (average for

Mar 2012 to Mar 2013)• 90,489,875 lines corrected as of Mar 2013, up from

61,682,883 lines corrected Mar 2012• 88,935 total registered users • 8,743 active users

Page 60: 20130412 Productivity of the crowd [acrl indianapolis]

As of 17-Mar-2013 National Library of Finland’s (http://www.nationallibrary.fi/) Alexa Internet global traffic rank is 4,303,901. Its Internet traffic rank for Finland was 199 as of 2-Apr-2012.

Page 61: 20130412 Productivity of the crowd [acrl indianapolis]

National Library of Finland

• Digitalkoot is a project to improve OCR text in digitized newspapers -- by playing games!

• Digitalkoot is a collaboration between the National Library and Microtask

• Players correct OCR text by playing Myyräsillassa (Mole Bridge) or Myyräjahdissa (Mole Hunt)

• National Library has 4,000,000+ digitized pages• 109,321 registered players (October 2012)• Since February 2011 8,024,530 micro-tasks have been

completed

Page 62: 20130412 Productivity of the crowd [acrl indianapolis]

As of 17-Mar-2013 UC Riverside’s Alexa Internet traffic rank is 11,782 (global) / 4,120 (USA).CDNC gets ~3.30% of all UC Riverside web traffic.

Page 63: 20130412 Productivity of the crowd [acrl indianapolis]

California Digital Newspaper Collection

• CDNC began digitizing newspapers in 2005 as part of the Library of Congress National Digital Newspapers Program (NDNP)

• Newspapers digitized to article-level in addition to page-level as required by NDNP (same as Utah Digital Newspapers)

• Since 2009 hosted on Veridian at http://cdnc.ucr.edu

• Collection size 55,970 issues, 495,175 pages, 5,658,224 articles, 498,000,000+ lines (Mar 2013)

Page 64: 20130412 Productivity of the crowd [acrl indianapolis]

OCR text correction

• OCR text correction added August 2011

• Corrections are done line by line

• ~578,000+ lines of text corrected Oct 2012

• ~935,398+ lines of text corrected Mar 2013

• ~2% of the collection corrected, 98% to go!

• Top corrector 327,244 lines > 2x 2nd corrector

Page 65: 20130412 Productivity of the crowd [acrl indianapolis]
Page 66: 20130412 Productivity of the crowd [acrl indianapolis]

Cambridge Public Library Historic Newspaper Collection

• Cambridge Historic Newspapers online since Jan 2012.

• Cambridge Massachusetts Public Library digitized local newspapers (http://cambridge.dlconsulting.com/)

• Newspapers digitized to article-level

• Collection size 6,346 issues, 59,070 pages, 669,406 articles (Mar-2013)

• Collection includes 13,099 obituary cards