Digital Humanities 2010 London. About the CKCC project: Dutch Republic of Letters. With Charles van den Heuvel.

  • 1. Letters, Ideas and scholarly communicationInformation Technology @ 1650 Using digital corpora of letters todisclose the circulation of knowledge in the 17th centuryErik-Jan Bos, Univ. Utrecht,erik-jan.bos@phil.uu.nl scholarly communicationCharles van den Heuvel, VKS, @ 2050charles.vandenheuvel@vks.knaw.nlDirk Roorda (thats me), DANS, dirk.roorda@dans.knaw.nl

2. http://ckcc.huygens.knaw.nl/ 3. NotaBeeckmanCats STEVINrelation disciplinesdirect - waterindirect - literatureHuygens STEVINLangeren 4. Corpora of17th century scholars Constantijn Huygens Christiaan Huygens Grotius Descartes Swammerdam Leeuwenhoek Barleaus Spinoza 4 and more? 5. CorpusNumber InFormat MetadataNormalized?of letters: posession?Grotius 7946YesTEIIn Interp Yes, DBNLelement codesVan 337 YesTEIIn Interp Yes, DBNLLeeuwenhoek element codesDescartes 750 YesXML (noother No, plain text TEI) markupBarlaeus1200300 readyWord unknown unknownSwammerdam80YesWord unknown unknownConstantijn 7295YesxmlProbablyDBNL codesHuygens InterpelementChristiaan2900? Medio 2010 probably ProbablyDBNL codesHuygensTEIInterpelement 6. CEN -MetadataCatalogus Epistularum Neerlandaricum265,000 descriptions of approximately1,000,000 lettersfrom 1600 now of which100,000 letters in 17th century 7. Research Questions History of science: How did knowledge circulate in the 17th-century Dutch Republic? Patterns in knowledge growth: How can we visualise sets of letters thatexhibit features of knowledge circulation? Re-use: How can we expose the sources, annotations,and resulting patterns to further research? 8. ChallengeTraditional scholarship interpretation close readingEast solving puzzlesComputational methods Wedealing with patterns stgleaned from large quantities of textsby automatic toolsEast is east andWest is west and ... 9. Issues to deal with making the sources uniformly available well coded in TEI, access rights overcoming the language barrier (17th cent varieties of French, Latin, Dutch) named entity recognition & concepts people, places, dates, concepts, instruments mixture of interpretation and algorithms creating useful visualisations aiding exploration by historians of science 10. ICT in Humanities Research collaboratory e-Laborate as starting point algorithmic pipelines from source material to visualisation infrastructure archiving results re-using data developing new algorithms disseminating the methodology 11. collaboratory 12. pipelines 13. pipelines (current) language detection, usingLanguage Identification from Text Using N-gram BasedCumulative Frequency AdditionBashir Ahmed, Sung-Hyuk Cha, and Charles Tappert 2004 resultslatindutchfrenchgerman 14. pipelines (current) spelling normalisation VARD (http://www.comp.lancs.ac.uk/~barona/vard2/) with help from (http://www.dicollecte.org/home.php?prj=fr) results French: VARD works (after improvements),although designed for historical English Dutch: still on the lookout for a combination ofresources, tools, and dexterity Latin: later 15. pipelines (current) 16. pipelines (current) named entity recognition known tools get 70% search for optimal tools in the next stage 17. pipelines (insights) expect the most from statistical methods language technology may boost results it remains to be seen by how much 18. Source: ScottTopic-Author-Time Weingart UIA 19. infrastructure 20. the projects legacy more than publications curated sources, annotations, visualisations more than algoritms a framework for analysis of historical texts more than a piece of historical research data and (intermediate) results worthwhile to linguists, computer scientists, sociologists more than a passive dataset extensible, dynamic, interactive 21. preserving the results part of the CLARIN infrastructure http://www.clarin.eu/ http://www.clarin.nl/ materials in a Trusted Digital Repository(DANS) http://easy.dans.knaw.nl/dms 22. working with CLARIN CLARIN-EU Outreach to humanities: use cases CKCC one of 10 selected projects received expert input for choice of languagetools CLARIN-NL CKCC one of 10 initial projects in the Dutchnational construction effort support for applying language technology 23. Adapting to CLARIN Conforming to standards CLARIN standards are in evolution (and will remain evolvable) Common MetaData Infrastructure a registry of metadata components defined by the community with explicit semantics (http://www.isocat.org/ ) Data in TEI (as export/import format) 24. Trusted Digital Repository materials reliable (provenance metadata) findable (CMDI metadata) referable (persistent identifiers) accessible (viewable in webbrowser) usable (downloadable) sooner or later: high-performance computing memento: a time-sensitive webinterface to the dynamic contents of the collaboratory(http://arxiv.org/abs/0911.1112 ) 25. http://www.clarin.eu/node/3073 26. http://ckcc.huygens.knaw.nl/