Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Data and SlidesDownload @: http://marcedit.reeset.net/workshops/um_marcedit7.zipDownload, Open and Extract saving to your desktop (or wherever)
MarcEdit 7!MarcEdit 7 was released over the U.S. Thanksgiving Holiday
The release:1. Has been in development for close to 9 months with ~20 testers in 7 countries using 4 different MARC
flavors providing direct feedback2. Touched nearly every part of the program – when finished, the release updated a shade under
350,000 lines of code3. Was tested against almost 20 million records 4. Is the first version of MarcEdit designed with Accessibility in mind5. Is fast (I’m going to show you a few places where)
MarcEdit 7 highlights
Lite‐weight cluster has been added directly into
the program
New way to process XML/JSON data
A new linked data engine, with support for locally
defined rdf vocabularies in reconcillation
New task processing Consolidated Z39.50/SRU client
Added Editing Functions•New Add/Delete Field Tools (deduplication)•Expanded Regular Expression options•Updated OCLC Integrations
Integrated Help
Today’s topics
Quick overview of MarcEdit 7 Changes
Explore MarcEdit 7’s new Clustering Functionality
Working with non‐MARC data using
known and unknown metadata formats
Explore MarcEdit 7’s Linked Data Platform
MarcEdit Regular Expressions Primer
Integration opportunities with• Alma or other ILS Systems• OCLC• Connexion
Let’s look at what’s newWelcome to Project Hazel, your friendly (and sometimes helpful) installation agent◦ Hazel is there to help highlight important options, and make sure you can work with Unicode data by making sure you have a Unicode font.
Accessibility◦ MarcEdit 7 includes an improved font/sizing engine for improved layout on different screen sizes and resolutions
◦ All images are tagged with text and accessibility via screen readers or using the operating system’s accessibility tooling
◦ Availability of themes, to allow you to customize windowing and contracts to ease eye strain◦ Keyboard shortcuts (everywhere)◦ Sound cues◦ Window transparency
Let’s look at what’s newMore International◦ MarcEdit 7 uses an intelligent machine translation service, providing an interface in close to 26 languages at this point
It’s Faster◦ Lists have been virtualized (lower overhead)◦ Pages load quicker◦ Tasks have been super‐charged
It’s leaner – in part because Windows XP support is no longer provided
Let’s look at what’s newProgram is easier to manage◦ The program has 4 installation modes
◦ 32‐bit Administrator and non‐Administrator installation modes◦ 64‐bit Administrator and non‐Administrator installation modes
◦ How do I choose?◦ Depends on your needs: ◦ http://marcedit.reeset.net/downloads
Let’s talk about task changesHow they worked in MarcEdit 6
Task ChangesHOW THEY WORK IN MARCED IT 7
So what does that mean to me?In MarcEdit 6, the optimal task size was ~20 operations or lower. Once the operation count began to get higher, the time that it would take to process data would become exponentially slower. In MarcEdit 7, that performance line actually goes the other direction. The tool processes records faster, and handles more records per second, the more task actions completed.
Real‐World Example
Library in Greece has a task list with over 1000 task actions. They would use this task to clean up large portions of their database in one pass. Generally, this would mean processing ~300,000‐500,000 records at a time. In MarcEdit 6, this process would take as many as 10 hours to complete. Using the MarcEdit 7 task processing, this process now takes less than 20 minutes.
But seeing is believingLet’s compare processing using one record, but with a task list that uses north of 100 task actions in MarcEdit 6 and MarcEdit 7.
Other comparisons
•Comparing the Extract Selected Records Tooling
Virtual Lists
•Loading large data files
Loading Files
MarcEdit 7 continues growingNear term planned additions◦ Completion of the Updated MarcEdit Mac 3.0 Upgrade (to include the new functionality)◦ New plugins for individual record creation templates◦ Support for HDT and linked data fragments (this is awesome stuff)◦ Additional clustering algorithms
Clustering in MarcEdit 7How people clustered MARC data in the past1. Export the fields considered for investigation into a tabbed delimited format2. Import into OpenRefine3. Cluster the data4. Make Edits5. Export the delimited data out of OpenRefine6. Develop a process to merge the changed data back into MARC
If you need to have your data start or end up in MARC, working with OpenRefine can be challenging because there isn’t a natural process to move between these two formats
Clustering in MarcEdit 7MarcEdit’s built‐in clustering tools support native grouping and batch editing and works well on file sizes of a million records and smaller (can work on large sets, but the larger the file, the longer the cluster operation takes)
Clustering OptionsClustering Algorthms◦ Levenshtein Distance
◦ This algorithm is best for people, places, and subjects ◦ This algorithm builds clusters based on the number of positions/character difference between a word or phase◦ This algorithm is generally faster
◦ Composite Coefficient◦ This algorithm is best for highly variable data where a great deal of fuzziness is desired.
Clustering ChangesClustered changes are queued and stacked. Changes happen once all edits have been set.
Clustered changes can be made by group, across groups, or selected items within a group
Clustering EnhancementsThings I’m thinking about:◦ Enabling clustering support to be run on non‐MARC data
I’d like to hear your ideas as well
XML Conversions
MarcEdit: crosswalking design
MarcEdit model:◦ So long as a schema has been mapped to MARCXML, any
metadata combination could be utilized. This means that no more than two tranformations will ever take place. Example: MODS MARCXML EAD
MarcEdit Crosswalkingmodel
MARC21XML
EAD
FGDC
MODSMARC
Dublin Core
MarcEdit: Crosswalks for everyone
What’s MarcEdit doing?◦ Facilitates the crosswalk by:
1. Performing character translations (MARC8‐UTF8)
2. Facilitates interaction between binary and XML formats.
Setting up Crosswalks
XML Function WizardThe wizard was created to help fill a gap – to enable metadata crosswalking when a user doesn’t have a lot of expertise building XSLT or Xquery transformations
OAI HarvestingMarcEdit’s OAI Harvester can run in two modes ◦ User Initiated ◦ Scheduled
Let’s look at both!
OAI Harvesting – User InitiatedHarvesting supports the following verbs◦ GetRecord◦ ListRecords◦ ResumptionToken
Any metadataPrefix can be accommodated, but by default, the tool has XSLT crosswalks for:◦ MARCXML◦ OAIMARC◦ Dublin Core◦ MODS
OAI Harvesting ‐‐ ScheduledUsing scheduler on Windows, or cron on Linux, or whatever the equivalent is on MacOS, you can create Harvesting Jobs and schedule them for regular harvest
Working with Linked Data In MarcEdit
Objects not stringsProbably the biggest reason people talk about linked data is the notion of moving from strings to objects
Strings
Objects not stringsProbably the biggest reason people talk about linked data is the notion of moving from strings to objects
Objects
Objects Not StringsURIs provide actionable data◦ Controlled terms can be updated without user intervention (generally)◦ And URIs can provide access to more information
◦ I.E. – a URI to VIAF provides access not just to author information, but to all their related works and collaborators as well.
So why aren’t we doing this already?Great Question!◦ We aren’t ready
So why aren’t we doing this already?
1
Changing Strings to Objects is hard and expensive
2
We have some folks, like OCLC, that could be in a position to help us, but our current systems are not setup to use (and in some cases) store the data.
3
Many of our controlled vocabularies are not designed to support reconciliation work•And those that are aren’t production ready•Or – are proprietary
So what can we do right now?A lot –◦ Many of the large national services are making resources and infrastructure available to enable libraries to begin doing this work
◦ OCLC has been largely supportive, and provides their own tools with output linked data content◦ We can start lobbying our systems to not just store the data, but make use of the information when provided
◦ We can start the reconciliation process (because this process takes time)
MarcEdit and Linked DataMarcEdit 6 and 7 include a linked data plaftform ‐‐ this is an integration platform that enables MarcEdit to work with various linked data services, and provides a way to build new services around this functionality◦ Designed to support RDF, JSON‐LD, SPARQL – and a wide range of library specific services currently providing one off access to controlled data
◦ The framework has been utilized in MarcEdit for the development of a toolset called MARCNext
MARCNextThese are Experimental services that allow catalogers to play with their data and visualize it through the BibFrame lens – as well as begin the process of turning strings to objects.
Linked Data ToolLinked data tool enables reconciliation services
Works from a rules file, which enables users to customize the output provided◦ MarcEdit 7 provides a rules file optimized for
MARC21, but I have rules files being tested for a number of MARC formats (including UNIMARC)
Currently supports the insertion of $0 and $1 into bibliographic and authority data
Includes support from ~25 remote linked data endpoints
Can use local rdf files as locally mounted SPARQL stores
Allows for targeted, or automatic processing
Does this currently scale?I get asked this question, because the Library of Congress actively throttles data request made against their service. So too do many other service providers. They have to, it’s a method of self preservation. When I test reconciled Ohio State’s entire database (~6 million records), I estimated that I would end up making on the low end, 48,000,000 requests, just to the Library of Congress. Over a very short period – that’s a lot of requests, and can overwhelm their services.
However…◦ I work closely with many of the large data providers, and they give MarcEdit some leeway because:
◦ MarcEdit follows some established patterns…in LCs case, they can provide an HTTP status code that let’s the application know that their service is under load, and MarcEdit will start slowing down requests for a specified time period.
◦ MarcEdit does its own internal caching – this way an item is only retrieved once per reconciliation session. Using this method, I can likely cut the number of requests to a service like LC by over a 1/3 or more. In fact, the more data that’s processed, the faster it goes and the less requests it has to make to the source vocabulary
We can build new servicesUSING LINKED DATA TOOLS FOR HEADING VALIDATION
Validate HeadingsHow it works◦ Working directly with the U.S. Library of Congress – MarcEdit queries the NACO and SACO headings directly◦ Returning information about URIs and variants/changes
◦ MarcEdit then generates a report, automatically corrects headings (when possible) and can generate brief authority records or downloads the existing authority record
QuestionsAgain – I would like to hear from you?
I’ve been working with members of the PCC task force looking at how we embed linked data into MARC records (and outside of MARC records), and I’ve been actively building these tools into MarcEdit (both for research and production).
How would you like to work with linked data recordset in your library?
What could MarcEdit do to make this easier for you?
MarcEdit Regular Expression Primer
MarcEdit Regular Expression SupportFunctions that presently support regular expressions◦ Delete Field◦ Edit Field◦ Copy Field◦ Swap Field◦ Build New Field◦ Validation◦ Extract/Delete Selected Records
Expression ScopeDeciding which function to use depends on the scope of data needing to be evaluated◦ Add/Delete Field – Regular Expressions have access to the entire field (from the “=“ to the end of line (eol)
◦ Edit Subfield – Regular Expressions have access to the subfield code, to the end of the subfield◦ Edit Field – Regular Expression has access to all subfield data, but *not* indicator data◦ Edit Indicators – only access to indicator data◦ Copy Field – Regular Expression has access to indicator data + all subfield data◦ Replace Function – Regular Expression has access to all record data
Microsoft’s Regular Expression languageConcepts:◦ Character escapes◦ Anchors◦ Character classes◦ Grouping◦ Qualifiers◦ Substitutions
Let’s open Regular Expression Language ‐ Quick Reference.html or https://msdn.microsoft.com/en‐us/library/az24scfc(v=vs.110).aspx
How we use Regular Expressions in MarcEditYour most important parts of the regular expression language are:
1. Character escapes: \d\r\n\$\x##2. Character Classes [] & [^]3. Grouping Elements ()4. Anchors: ^$5. Quantifiers: *?+{#}6. Substitutions: $#
ExamplesLooking at regex_example.mrk using the replace function:
◦ Add a period to the 500 if it is missing◦ Update the 300 to reflect electronic information◦ Split the 856 into two fields, breaking on the $u.
Examples 1◦ Add a period to the 500 if it is missing◦ Find What: (=500 ..)(.*[^\W]$)◦ Replace With: $1$2.
Explanation:◦ (=500 ..)
◦ Searches for the 500 field. We leave two blanks because there are always 2 blank characters as part of the mnemonic format. The two periods which stand for any character. If we want to search for exact indicators, you’d place those values rather than the periods.
◦ (.*[^W.]$)◦ Take any characters, and match on a field where the last character in the field isn’t a period.
Examples 2Add online resource information to the 300 field
Example: ◦ Change: 300 \\$a 32 p.◦ To: 300 \\$a1 online resource (32 p.)
Explanation:◦ (=500 ..)
◦ Searches for the 500 field. We leave two blanks because there are always 2 blank characters as part of the mnemonic format. The two periods which stand for any character. If we want to search for exact indicators, you’d place those values rather than the periods.
◦ (?<one>\$a)([^$]*)◦ Capture the $a and then all data in the subfield until you get to the next subfield (if there is one)
Example 3Split the 856 into two fields, breaking on the $u.◦ Find What: (=856.{4})(\$u.*[^$])(\$u.*)
◦ (=856.{4}) ◦ Matches the 856 field
◦ (\$u.*[^$])◦ Match $u, but stop at the end of the subfield
◦ (\$u.*)◦ Match reminder of field
◦ Replace With: $1$2\n=856 41$3
lcase/ucaseMarcEdit’s regular expression engine includes to extension functions for dealing with case switching of characters. ◦ lcase & ucase
◦ Usage: (=450.{4})(\$a.)(.*)◦ $1$2lcase($3)
◦ Example: Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case.
Example (lcase)Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case.
◦ Find What: (=500.{4})(\$a.)([A‐Z .]*)◦ Replace With: $1$2lcase($3)
Multi‐Field ReplacementsBy default, MarcEdit handles one field at a time when doing regular expressions. ◦ However, when you need to do evaluations against multiple fields, you can by adding /m to the end of your replacement in the Replace Function in the MarcEditor
◦ This is a special function added to the MarcEdit regular expression engine
ExampleUsing regex_example.mrk
Changing video disc to blue‐ray in the 300 if the 538 is marked as blue‐ray
Multi‐Line Example
PlaceholderAre there specific editing tasks that folks are interested in?
We can talk about these now
Questions
Integrations
ILS IntegrationILS Integration currently supports direct integration with Koha, Alma, and a local option.
Are other integrations possible? ◦ http://blog.reeset.net/archives/2133
Let’s talk about ALMA Integration
How MarcEdit Works with AlmaMarcEdit works through the following API endpoints:◦ https://developers.exlibrisgroup.com/alma/apis/bibs◦ Because the API is rate limited (i.e., you can only process so many transactions concurrently through the API, and all Alma operations use the API), MarcEdit limits API processes to a single thread. It takes a little longer, but eliminates the possibility that using MarcEdit to automate workflows will bring down your system because the tool is trying to communicate with the system too quickly.
This this API, MarcEdit can:◦ Edit holdings data (and Holdings Records)◦ Create and Update bibliographic data◦ Extract Records
◦ Though discovery should be done via Z39.50 or SRU (which is preferred)
Working with OCLC Connexion
https://youtu.be/a7Cen0gxFCw?list=PLrHRsJ91nVFScJLS91SWR5awtFfpewMWg
Working with OCLC’s Metadata APIMARCEDIT CAN WORK DIRECTLY WITH WORLDCAT VIA THE METADATA API .
MarcEdit: Batch WorldCat Holdings Management
MarcEdit: Batch Bibliographic Record Upload
More Information
OCLC’s Developer Network: ◦ http://oclc.org/developer/
OCLC Metadata API Documentation:◦ http://oclc.org/developer/services/worldcat‐metadata‐api
Notes on MarcEdit Integration: ◦ http://blog.reeset.net/archives/1245
C# OCLC API Library◦ https://github.com/reeset/oclc_api