22
Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

Embed Size (px)

Citation preview

Page 1: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

Data Mining at Scale(s):Collaborating to Build Sharable

Skill Sets and Data Sets

SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

Page 2: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

2

Who we are» Doug Duhaime

Proquest» Text and Data Mining Product Manager

» Scott Warren Syracuse University Libraries

» Associate Dean for Research and Scholarship

» Patrick Williams Syracuse University Libraries

» Librarian for Literature, Communication & Rhetorical Studies, Composition & Cultural Rhetoric, English/Textual Studies

Page 3: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

3

Roles

» Patrick Work with researcher, identify resources

» Scott Negotiate, review licenses, allocate funds, process

» Doug Put together data set, data mine

Page 4: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

4

Why seek data?

Page 5: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

5

Knowledge can be distributed in a collection

Page 6: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

6

Seeing the big picture – developing new insights – new questions

Sebastian Opitz

Page 7: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

7

Researcher questions» What don’t I know?

» Is this something possibly answered by (enough) data?

» What data?

» Where does it come from?

» What or who is the source of the data?

» Do I have access to this data in a form I can use?

Page 8: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

8

Process questions – library and vendor» Who owns that data? Who provides that data?

» Is their a way to (meaningfully) define that data?

» Once defined, is there a way to extract that data?

» What sort of costs are attached to that process?

» Are legal issues attached to that process?

Page 9: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

9

We have the data, what next?!?

» Who does the actual mining? Library or researcher?

» Is statistical, analytical, computing, or visualizing help needed? Who provides that?

» Who preserves the data? Library, researcher, or vendor? No one? Re-use would be nice

Page 10: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

10

Broken Pencil

» Earlier request by same faculty member

» PQ does not own perpetual rights to this current title Aggregator title that is licensed, rather than sold Titles is from 1990s to present

» Ultimately unable to deliver data set Publisher sent all print copies to researcher Old School TDM!!

Page 11: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

11

Bookman

» The Bookman was a monthly magazine published in London from 1891 until 1934 by Hodder & Stoughton.

» Part of PQ’s British Periodicals Collection 1 PQ has permanent ownership to this – unlike Broken Pencil Which is why they can pass it on to us

» “It was a catalogue of the current publications that also contained reviews, advertising and illustrations.” -Wikipedia

Page 12: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

12

Why Bookman?

» Need for a very specific type of information From a specialized historical periodical Not so much just a giant dataset.

» A periodical to which we already have “access” But the question is not served by our purchased access. Access distributed article by article Good for reading, not analyzing at scale

Page 13: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

13

Service

» TDM is a new type of service provision

» Is it sustainable? For libraries? For vendors/publishers?

» The 80/20 rule Built in applications on platforms for ‘basic’ mining? Or customized data dumps that researcher can explore?

Page 14: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

14

Cost recovery issues for vendors» How many TDM requests going on at same time?

» Impact on other projects, routine maintenance, development?

» Does the data sit in one central database? Or is it distributed? Normalized or not?

» What third party storage or delivery costs are involved in the process?

» Will the size impact delivery mode?

» Is the timeframe reasonable – or realistic?

Page 15: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

15

Cost recovery issues for libraries» Price based on???

» Number of users likely to be 1, or perhaps a couple

» Libraries generally do not license or purchase for 1-off use At least not at scale and not at 4+ figures per transaction

» Content already licensed so fee is seen as process fee Rather than a content fee

Page 16: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

16

What did we learn? The SU Libraries…» How things work on the vendor side. Way harder than thought.

» How PQ is anticipating needs we’re seeing locally

» We need to be thinking about Preservation & reuse Policies on maintaining “medium data” collections like this Beyond Patrick having it on a thumbdrive in his desk!!

» So far nothing scales – this is all case by case

Page 17: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

17

What did we learn? ProQuest…» Publishing rights are complex.

Two fundamental licenses come into play when discussing data mining» The original publisher’s contract with ProQuest,

» ProQuest’s contract with the university.

Various national laws

» Legacy software is expensive to maintain. Simple tasks - retrieving all files in a given newspaper collection –become incredibly difficult.  

» Data mining drives algorithmic analysis against the platform

» Publisher worries - researchers sometimes post data EEBO placed on the Internet Archive

» A solution that meets the needs of researchers, librarians, and publishers is an important task, but not easy

Page 18: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

Preserving data - remember Patrick’s thumbdrive?!?

Page 19: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

19

Centralize data

Page 20: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

20

Liblicense model license» Section 3. Authorized Users and Uses

» Clause J. Text and Data Mining. » Authorized Users may use the Licensed Materials to perform and engage in text

and/or data mining activities for academic research, scholarship, and other educational purposes, utilize and share the results of text and/or data mining in their scholarly work, and make the results available for use by others, so long as the purpose is not to create a product for use by third parties that would substitute for the Licensed Materials. Licensor will cooperate with Licensee and Authorized Users as reasonably necessary in making the Licensed Materials available in a manner and form most useful to the Authorized User. If Licensee or Authorized Users request the Licensor to deliver or otherwise prepare copies of the Licensed Materials for text and data mining purposes, any fees charged by Licensor shall be solely for preparing and delivering such copies on a time and materials basis.

Page 21: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

21

Some further thoughts

» Association of Research Libraries Issue brief in June 2015

» Text and Data Mining and Fair use in the United States Krista L. Cox, Director of Public Policy Initiatives.  

» http://www.arl.org/storage/documents/TDM-5JUNE2015.pdf

Page 22: Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS

22