Upload
philipp-mayr
View
111
Download
1
Tags:
Embed Size (px)
Citation preview
Assessing a human mediated
current awareness service
International Symposium of Information Science (ISI 2015)
Zadar, 2015-05-20
Zeljko Carevic1, Thomas Krichel2 and Philipp Mayr1
[email protected]@openlib.org
Outline
1. Introduction
2. RePEc and NEP
3. Results
3.1 Editing time
3.2 Indicators for report success
3.3 Editing effort
4. Conclusion and Outlook
Slide 2 / 31
Motivation
• Thomas Krichel, the founder of
RePEc, visited GESIS – Cologne
in Oct. 2014
• Sharing his Russian souvenir
• ~100 GB of XML log files
Slide 3 / 31
1. Introduction• Current awareness in digital libraries
– To inform users / subscribers about new / relevant acquisitions in their libraries [1].
• Current awareness services allow subscribers to keep up to date with new additions in a certain area of research.
• Selection of relevant documents can be done (semi-)automatically or manually.
• For this work we focus on the intellectual editing process
• Aim of this work:
How do editors work when creating a subject specific report in Digital Libraries (DL)?
Slide 4 / 31
2. Use case: RePEc• RePEc (Research Papers in Economics)
is a DL for working papers in economics research.
• Covers metadata for working papers and journal articles.
• Usually document metadata contains links to full texts
Slide 5 / 31
2. RePEc statistics
0
200
400
600
800
1000
1200
1400
1600
1800
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Nu
mb
er
of d
ocum
en
ts
Year
Contr. Archives Documents Full text Documents
Regist. Authors Abstract views(April 2015)
~1,700 1.77 mio 1.63 mio ~45,000 >2 mio
Slide 6 / 31
2. Current awareness service NEP
• NEP (New Economics Papers) is a current awareness service for
new additions in RePEc.
• NEP covers subject specific reports from over 90 specific fields.
– Business, Economic and Financial History
– Public Economics
– Social Norms and Social Capital
• Issues are sent to subscribers via E-Mail, RSS and Twitter
• Reports to new additions are generated by subject specific editors.
• Relevant document selection is done manually by the editor!
Slide 7 / 31
Nep-acc Nep-afr
Nep-all
• Contains all new RePEcdocs
• Created roughly on weekly base
• Contains avg. 488 doc
Selects
Nep-upt Nep-ure
Selects Selects Selects
Sends issue Sends issue Sends issue Sends issue
Manual selection of relevant documents is a time consuming task.
Slide 8 / 31
ERNAD
• ERNAD (Editing Reports on New Academic Documents) is a purposed built system
• Re-rank nep-all for each editor based on the specific report topic
• Looking at past issues of a report to produce a ranked nep-all
• If presorting works well editors select highly ranked documents from nep-all
Slide 9 / 31
ERNAD example for Nep-Africa
(NEP-AFR)
1. Tax compliance.. 2. Mental accounting..…212. Ethnic ..in Africa317. Sino-African relations:
Nep-all unsorted Nep-all presorted
Slide 10 / 31
1. Ethnic ..in Africa2. Sino-African relations:…50. Tax compliance.. 51. Mental accounting..
Research questions
• RQ 1: How long is the editing duration?
• RQ 2: What influences the success of a report?
– Editing duration
– Issue size
• RQ 3: How much effort is invested for selecting and sorting papers per issue?
– Precision @ N
– Relative search length
Slide 12 / 31
Pre-selection
• Editing an issue can be interrupted
• This would distort the results
• Exclude interrupted issues by separating
the edit duration in 3-minute chunks
Slide 14 / 31
Pre-selection
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 >90
Nu
mb
er
of is
su
es
3-minute chunks
Limit edit time < 90 min
Slide 15 / 31
0
10
20
30
40
50
60
nep-ets
nep-gro
nep-opm
nep-pke
nep-cba
nep-hea
nep-rmg
nep-geo
nep-hap
nep-tid
nep-dem
nep-soc
nep-cse
nep-net
nep-ifn
nep-lab
nep-ltv
nep-for
nep-law
nep-mig
nep-cdm
nep-mon
nep-exp
nep-neu
nep-ino
nep-mst
nep-ore
nep-fmk
nep-ara
nep-mkt
Ave
rage
ed
itin
g t
ime
in
min
ute
s
Report
Avg. editing time
RQ 1: Editing time
Avg. 15.5 minutes. (sd = 10.1)
Min. 2.5 minutes NEP-RES (Resource economics)
Max. 53 minutes NEP-ETS (Economic time series)
Slide 16 / 31
Summarize RQ 1
• Average editing time is comparable low
with 15.5 minutes
• Huge scattering between the reports:
–Min. 2.5 minutes
–Max. 53 minutes
Slide 17 / 31
RQ 2: Influences to successful
reports • Popularity of a report can be measured by the number of
subscribers.
• Huge scattering between number of subscribers per report – Max. 6859 NEP-HIS Business, Economic and Financial History
– Min. 75 NEP-CIS Confederation of Independent States
• Factors influencing reports success for example: topic, age of a report..
• Does the issue size or the editing time influence the report success?
Slide 18 / 31
Editing time
0
1000
2000
3000
4000
5000
6000
7000
0 10 20 30 40 50 60
Num
be
r of
sub
scribe
rs
Average editing time
Avg. edit timeAvg. number of subscribers
Education 2198 sub. (avg. 836)
Project, Program and Portfolio Management
43,5 min (avg. 15.5)
Slide 19 / 31
Issue size
0
1000
2000
3000
4000
5000
6000
7000
0 10 20 30 40 50 60
Num
be
r of
sub
scribe
rs
Average issue size
Avg. issue sizeAvg. number of subscribers
Sportsissue size
2.5 (avg. 12.4)
Demographic Economic
issue size 21 (avg. 12.4)
Slide 20 / 31
Summarize RQ 2
• There is no correlation between:
– Issue size and number of subscribers
– Editing time and number of subscribers
• We assume that the success of a report is
mainly driven by topic and age.
Slide 21 / 31
RQ 3: Effort in selecting and
sorting
How much effort is invested in selecting and
sorting relevant documents from nep-all?
Two measures are used:
Precision @N
Relative search length
Slide 22 / 31
Precision @ N
• How many of the top n documents from pre-sorted
nep-all are selected for the issue?
• N set to: 5, 10, 15, 20
• We only consider issues where issue size > N
• A document is relevant if its index position in nep-all
is < N.
Slide 23 / 31
Example: P@ 5
• M={(D1, 4), (D2, 1), (D3, 7), (D4, 3), (D5, 9)}
• P@5 for issue I in report J = ⅗
• Editors vary between using pre-sorted and
un-sorted nep-all. Therefore:
– Only consider issues with pre-sort usage > 50
Slide 24 / 31
Results for P@N
Avg. P@5(82 rep)
Avg. P@10 (64 rep)
Avg. P@15(50rep)
Avg. P@20 (31 rep)
0.77 0.80 0.80 0.82
• Max. found for nep-env (Environmental Economics) with P@5 = 0.99
• Min. found for nep-cba (Central Bank) with P@5 = 0.35
Slide 25 / 31
Summarize P@N
• Editors work comfortably with the
presorting in nep-all.
• The number of papers per issue has no
significant influence for the precision.
Slide 26 / 31
Relative Search Length
• We know how many of the top N
document from nep-all selected.
• To what depth do editors inspect nep-all?
• Ratio between the highest index position
(hin) of the last relevant document in nep-
all and the length of nep-all
Slide 27 / 31
Example RSL
• Editor is given a nep-all containing 300 documents.
• M={(D1, 4), (D2, 10), (D3, 7)}
• RSL = 10/300
• We assume that the editor has inspected nep-all to document 10.
Slide 28 / 31
Relative Search Length
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
nep-mac
nep-demnep-cw
anep-eurnep-iuenep-cbenep-afrnep-m
icnep-becnep-intnep-knmnep-comnep-regnep-ifnnep-cdmnep-tidnep-effnep-inonep-uptnep-edunep-fornep-neunep-cisnep-ltvnep-netnep-devnep-ppmnep-spo
Ave
rag
e R
SL
per
Re
po
rt
Report
Avg. RSL
NEP-MAC (Macroeconomics)
RSL = 0.35
NEP-SPO (Sports and Economics)
RSL = 0.01
Avg. RSL = 0.08
Slide 29 / 31
Summarize RSL
• The relative search length is comparable
low with 0.08
• Editors select papers from the very upper
part of nep-all.
Slide 30 / 31
Conclusion
• Focused on observable system features– Editing time
– Influences on report success
– Effort in creating an issue
• Summarize: The system supports the editor well in creating an issue
• A complete view requires a more user-centred observation.
• Future work:– Why and under what conditions is a document relevant?
• NEP provides many opportunities for further research on data that is relatively easily available.
Slide 31 / 31