Upload
open-analytics
View
366
Download
0
Tags:
Embed Size (px)
Citation preview
Using data to improvestudent research
EasyBib is an automatic bibliography composer.
Students use it to cite sources for their
research.
We teach information literacy.
18%of all student papers include
plagiarism1
Source: (1) TurnItIn; (2) Both Sides Now: Librarians Looking at Information Literacy from High School and College.
50%likelihood of using a credible vs.
non-credible source1
4%increase in the use of paper
mills and cheating sites1
~16%of students are adequately
prepared for college.2
That’s how we felt too..
The problem is becoming bigger.
Unprepared students make for unprepared
adults.It’s not just students who plagiarize:
•Pal Schmitt, former president of Hungary•German education minister•Jayson Blair (former New York Times writer)•Jonah Lehrer, journalist and author•Fareed Zakaria (reporter, author, host)
We are in the right place to figure it out.
Over half of all students in the US (40M)
Over half a billion
citations
We asked ourselves the following questions:
•What are students using in their research?
•How good are their sources?•How can we help them?
We started with the basics.
_gaq.push([ 'citations._trackEvent', citationTitle, citationPublisher, citationId]);
Here’s what we found.Top sources 2010
•Wikipedia•Google1.The New York Times2.CIA World Factbook3.Oracle Thinkquest4.Buzzle5.US BLS6.Dictionary.com7.CDC8.PBS9.eHow
Source: EasyBib Google Analytics Oct 2010-Nov 2010 data.
What could we do?•Warn them when their source’s
credibility is in question•Analyze the quality of their full
bibliography•Make it easier to not plagiarize•Suggest better sources
Define credibility.
Improve citation quality
Gave students access to their own analytics
To combat plagiarism, we built an audit trail for
notes
So after all this...Does it blend (tm) ?
1. Wikipedia2. Bio.com3. History.com4. PBS5. Mayo Clinic6. CDC7. The New York Times8. BBC9. CNN10.WebMD11.US BLS
•Wikipedia still on top, but ...
•No content farms, no Google..
•WebMD is questionable, but its credibility can be argued for.
Source: Apr-May 2013 Google Analytics data
We have to admit, it’s getting better...
We have to admit, it’s getting better...
Help students find better sources
How does the Research engine currently work?
Cloudant (CouchDB)MySQLLucene/Solr
Slow, asynchronous, lots of moving parts.
Starting to do a bit more
StatsD::increment($metrics);
$response = $rediska->publish( array('realtime'), $citation );
There’s a lot more we can do, and data will help us.
Cloudant Search•Full-text search integrated into Cloudant
•Lucene syntax
•Indexing is easy
function(doc){ index("title", doc.title, {"store": "yes"});}•Grouping of sources via chained map-reduce
map: function(doc){ if (doc.title){ emit({"title": doc.title}, 1); }}reduce: _sumdbcopy: citationGroup------map: function(doc){ if (doc.title && doc.key.title){ emit(doc.value, doc.key.title); }}
Live data analysis. Crowdsourcing.
•Use Cloudant Search to power feedback on sources (# of times cited in real time, quality of bibliographies derived from)
•Allow users to submit their own credibility evaluations and aggregate results
SourceRank!
Credibility weighting + crowdsourcing
Synchronous & realtime via Cloudant Search
Value nodes based on nearest neighbors
And other things...
Driving growth
We have the largest UGC
citation set. Making this
searchable creates a “moat.”
The more people that use
EasyBib, the better the tool
becomes.
What about other data analytics tools?
Too stretched to learn more complex tools (looking for easy answers)
Costs (GA is free!)
EMR, Hadoop, Redshift, Cloudant Search: This is what’s next.
Questions?