Advanced Tooling in MarcEdit [Read-Only]Advanced Tooling in MarcEdit TERRY REESE ... Real‐World Example Library in Greece has a task list with over 1000 task actions. ... MarcEdit

Advanced Tooling in MarcEdit

TERRY REESE

THE OHIO STATE UNIVERSITY

[email protected]

Data and SlidesDownload @: http://marcedit.reeset.net/workshops/um_marcedit7.zipDownload, Open and Extract saving to your desktop (or wherever)

MarcEdit 7!MarcEdit 7 was released over the U.S. Thanksgiving Holiday

The release:1. Has been in development for close to 9 months with ~20 testers in 7 countries using 4 different MARC

flavors providing direct feedback2. Touched nearly every part of the program – when finished, the release updated a shade under

350,000 lines of code3. Was tested against almost 20 million records 4. Is the first version of MarcEdit designed with Accessibility in mind5. Is fast (I’m going to show you a few places where)

MarcEdit 7 highlights

Lite‐weight cluster has been added directly into

the program

New way to process XML/JSON data

A new linked data engine, with support for locally

defined rdf vocabularies in reconcillation

New task processing Consolidated Z39.50/SRU client

Added Editing Functions•New Add/Delete Field Tools (deduplication)•Expanded Regular Expression options•Updated OCLC Integrations

Integrated Help

Today’s topics

Quick overview of MarcEdit 7 Changes

Explore MarcEdit 7’s new Clustering Functionality

Working with non‐MARC data using

known and unknown metadata formats

Explore MarcEdit 7’s Linked Data Platform

MarcEdit Regular Expressions Primer

Integration opportunities with• Alma or other ILS Systems• OCLC• Connexion

Let’s look at what’s newWelcome to Project Hazel, your friendly (and sometimes helpful) installation agent◦ Hazel is there to help highlight important options, and make sure you can work with Unicode data by making sure you have a Unicode font.

Accessibility◦ MarcEdit 7 includes an improved font/sizing engine for improved layout on different screen sizes and resolutions

◦ All images are tagged with text and accessibility via screen readers or using the operating system’s accessibility tooling

◦ Availability of themes, to allow you to customize windowing and contracts to ease eye strain◦ Keyboard shortcuts (everywhere)◦ Sound cues◦ Window transparency

Let’s look at what’s newMore International◦ MarcEdit 7 uses an intelligent machine translation service, providing an interface in close to 26 languages at this point

It’s Faster◦ Lists have been virtualized (lower overhead)◦ Pages load quicker◦ Tasks have been super‐charged

It’s leaner – in part because Windows XP support is no longer provided

Let’s look at what’s newProgram is easier to manage◦ The program has 4 installation modes

◦ 32‐bit Administrator and non‐Administrator installation modes◦ 64‐bit Administrator and non‐Administrator installation modes

◦ How do I choose?◦ Depends on your needs: ◦ http://marcedit.reeset.net/downloads

Let’s talk about task changesHow they worked in MarcEdit 6

Task ChangesHOW THEY WORK IN MARCED IT 7

So what does that mean to me?In MarcEdit 6, the optimal task size was ~20 operations or lower. Once the operation count began to get higher, the time that it would take to process data would become exponentially slower. In MarcEdit 7, that performance line actually goes the other direction. The tool processes records faster, and handles more records per second, the more task actions completed.

Real‐World Example

Library in Greece has a task list with over 1000 task actions. They would use this task to clean up large portions of their database in one pass. Generally, this would mean processing ~300,000‐500,000 records at a time. In MarcEdit 6, this process would take as many as 10 hours to complete. Using the MarcEdit 7 task processing, this process now takes less than 20 minutes.

But seeing is believingLet’s compare processing using one record, but with a task list that uses north of 100 task actions in MarcEdit 6 and MarcEdit 7.

Other comparisons

•Comparing the Extract Selected Records Tooling

Virtual Lists

•Loading large data files

Loading Files

MarcEdit 7 continues growingNear term planned additions◦ Completion of the Updated MarcEdit Mac 3.0 Upgrade (to include the new functionality)◦ New plugins for individual record creation templates◦ Support for HDT and linked data fragments (this is awesome stuff)◦ Additional clustering algorithms

Clustering in MarcEdit 7How people clustered MARC data in the past1. Export the fields considered for investigation into a tabbed delimited format2. Import into OpenRefine3. Cluster the data4. Make Edits5. Export the delimited data out of OpenRefine6. Develop a process to merge the changed data back into MARC

If you need to have your data start or end up in MARC, working with OpenRefine can be challenging because there isn’t a natural process to move between these two formats

Clustering in MarcEdit 7MarcEdit’s built‐in clustering tools support native grouping and batch editing and works well on file sizes of a million records and smaller (can work on large sets, but the larger the file, the longer the cluster operation takes)

Clustering OptionsClustering Algorthms◦ Levenshtein Distance

◦ This algorithm is best for people, places, and subjects ◦ This algorithm builds clusters based on the number of positions/character difference between a word or phase◦ This algorithm is generally faster

◦ Composite Coefficient◦ This algorithm is best for highly variable data where a great deal of fuzziness is desired.

Clustering ChangesClustered changes are queued and stacked. Changes happen once all edits have been set.

Clustered changes can be made by group, across groups, or selected items within a group

Clustering EnhancementsThings I’m thinking about:◦ Enabling clustering support to be run on non‐MARC data

I’d like to hear your ideas as well

XML Conversions

MarcEdit: crosswalking design

MarcEdit model:◦ So long as a schema has been mapped to MARCXML, any

metadata combination could be utilized. This means that no more than two tranformations will ever take place. Example: MODS MARCXML EAD

MarcEdit Crosswalkingmodel

MARC21XML

EAD

FGDC

MODSMARC

Dublin Core

MarcEdit: Crosswalks for everyone

What’s MarcEdit doing?◦ Facilitates the crosswalk by:

1. Performing character translations (MARC8‐UTF8)

2. Facilitates interaction between binary and XML formats.

Setting up Crosswalks

XML Function WizardThe wizard was created to help fill a gap – to enable metadata crosswalking when a user doesn’t have a lot of expertise building XSLT or Xquery transformations

OAI HarvestingMarcEdit’s OAI Harvester can run in two modes ◦ User Initiated ◦ Scheduled

Let’s look at both!

OAI Harvesting – User InitiatedHarvesting supports the following verbs◦ GetRecord◦ ListRecords◦ ResumptionToken

Any metadataPrefix can be accommodated, but by default, the tool has XSLT crosswalks for:◦ MARCXML◦ OAIMARC◦ Dublin Core◦ MODS

OAI Harvesting ‐‐ ScheduledUsing scheduler on Windows, or cron on Linux, or whatever the equivalent is on MacOS, you can create Harvesting Jobs and schedule them for regular harvest

Working with Linked Data In MarcEdit

Objects not stringsProbably the biggest reason people talk about linked data is the notion of moving from strings to objects

Strings

Objects not stringsProbably the biggest reason people talk about linked data is the notion of moving from strings to objects

Objects

Objects Not StringsURIs provide actionable data◦ Controlled terms can be updated without user intervention (generally)◦ And URIs can provide access to more information

◦ I.E. – a URI to VIAF provides access not just to author information, but to all their related works and collaborators as well.

So why aren’t we doing this already?Great Question!◦ We aren’t ready

So why aren’t we doing this already?

1

Changing Strings to Objects is hard and expensive

2

We have some folks, like OCLC, that could be in a position to help us, but our current systems are not setup to use (and in some cases) store the data.

3

Many of our controlled vocabularies are not designed to support reconciliation work•And those that are aren’t production ready•Or – are proprietary

So what can we do right now?A lot –◦ Many of the large national services are making resources and infrastructure available to enable libraries to begin doing this work

◦ OCLC has been largely supportive, and provides their own tools with output linked data content◦ We can start lobbying our systems to not just store the data, but make use of the information when provided

◦ We can start the reconciliation process (because this process takes time)

MarcEdit and Linked DataMarcEdit 6 and 7 include a linked data plaftform ‐‐ this is an integration platform that enables MarcEdit to work with various linked data services, and provides a way to build new services around this functionality◦ Designed to support RDF, JSON‐LD, SPARQL – and a wide range of library specific services currently providing one off access to controlled data

◦ The framework has been utilized in MarcEdit for the development of a toolset called MARCNext

MARCNextThese are Experimental services that allow catalogers to play with their data and visualize it through the BibFrame lens – as well as begin the process of turning strings to objects.

Linked Data ToolLinked data tool enables reconciliation services

Works from a rules file, which enables users to customize the output provided◦ MarcEdit 7 provides a rules file optimized for

MARC21, but I have rules files being tested for a number of MARC formats (including UNIMARC)

Currently supports the insertion of $0 and $1 into bibliographic and authority data

Includes support from ~25 remote linked data endpoints

Can use local rdf files as locally mounted SPARQL stores

Allows for targeted, or automatic processing

Does this currently scale?I get asked this question, because the Library of Congress actively throttles data request made against their service. So too do many other service providers. They have to, it’s a method of self preservation. When I test reconciled Ohio State’s entire database (~6 million records), I estimated that I would end up making on the low end, 48,000,000 requests, just to the Library of Congress. Over a very short period – that’s a lot of requests, and can overwhelm their services.

However…◦ I work closely with many of the large data providers, and they give MarcEdit some leeway because:

◦ MarcEdit follows some established patterns…in LCs case, they can provide an HTTP status code that let’s the application know that their service is under load, and MarcEdit will start slowing down requests for a specified time period.

◦ MarcEdit does its own internal caching – this way an item is only retrieved once per reconciliation session. Using this method, I can likely cut the number of requests to a service like LC by over a 1/3 or more. In fact, the more data that’s processed, the faster it goes and the less requests it has to make to the source vocabulary

We can build new servicesUSING LINKED DATA TOOLS FOR HEADING VALIDATION

Validate HeadingsHow it works◦ Working directly with the U.S. Library of Congress – MarcEdit queries the NACO and SACO headings directly◦ Returning information about URIs and variants/changes

◦ MarcEdit then generates a report, automatically corrects headings (when possible) and can generate brief authority records or downloads the existing authority record

QuestionsAgain – I would like to hear from you?

I’ve been working with members of the PCC task force looking at how we embed linked data into MARC records (and outside of MARC records), and I’ve been actively building these tools into MarcEdit (both for research and production).

How would you like to work with linked data recordset in your library?

What could MarcEdit do to make this easier for you?

MarcEdit Regular Expression Primer

MarcEdit Regular Expression SupportFunctions that presently support regular expressions◦ Delete Field◦ Edit Field◦ Copy Field◦ Swap Field◦ Build New Field◦ Validation◦ Extract/Delete Selected Records

Expression ScopeDeciding which function to use depends on the scope of data needing to be evaluated◦ Add/Delete Field – Regular Expressions have access to the entire field (from the “=“ to the end of line (eol)

◦ Edit Subfield – Regular Expressions have access to the subfield code, to the end of the subfield◦ Edit Field – Regular Expression has access to all subfield data, but *not* indicator data◦ Edit Indicators – only access to indicator data◦ Copy Field – Regular Expression has access to indicator data + all subfield data◦ Replace Function – Regular Expression has access to all record data

Microsoft’s Regular Expression languageConcepts:◦ Character escapes◦ Anchors◦ Character classes◦ Grouping◦ Qualifiers◦ Substitutions

Let’s open Regular Expression Language ‐ Quick Reference.html or https://msdn.microsoft.com/en‐us/library/az24scfc(v=vs.110).aspx

How we use Regular Expressions in MarcEditYour most important parts of the regular expression language are:

1. Character escapes: \d\r\n\$\x##2. Character Classes [] & [^]3. Grouping Elements ()4. Anchors: ^$5. Quantifiers: *?+{#}6. Substitutions: $#

ExamplesLooking at regex_example.mrk using the replace function:

◦ Add a period to the 500 if it is missing◦ Update the 300 to reflect electronic information◦ Split the 856 into two fields, breaking on the $u.

Examples 1◦ Add a period to the 500 if it is missing◦ Find What: (=500 ..)(.*[^\W]$)◦ Replace With: $1$2.

Explanation:◦ (=500 ..)

◦ Searches for the 500 field. We leave two blanks because there are always 2 blank characters as part of the mnemonic format. The two periods which stand for any character. If we want to search for exact indicators, you’d place those values rather than the periods.

◦ (.*[^W.]$)◦ Take any characters, and match on a field where the last character in the field isn’t a period.

Examples 2Add online resource information to the 300 field

Example: ◦ Change: 300 \\$a 32 p.◦ To: 300 \\$a1 online resource (32 p.)

Explanation:◦ (=500 ..)

◦ Searches for the 500 field. We leave two blanks because there are always 2 blank characters as part of the mnemonic format. The two periods which stand for any character. If we want to search for exact indicators, you’d place those values rather than the periods.

◦ (?<one>\$a)([^$]*)◦ Capture the $a and then all data in the subfield until you get to the next subfield (if there is one)

Example 3Split the 856 into two fields, breaking on the $u.◦ Find What: (=856.{4})(\$u.*[^$])(\$u.*)

◦ (=856.{4}) ◦ Matches the 856 field

◦ (\$u.*[^$])◦ Match $u, but stop at the end of the subfield

◦ (\$u.*)◦ Match reminder of field

◦ Replace With: $1$2\n=856 41$3

lcase/ucaseMarcEdit’s regular expression engine includes to extension functions for dealing with case switching of characters. ◦ lcase & ucase

◦ Usage: (=450.{4})(\$a.)(.*)◦ $1$2lcase($3)

◦ Example: Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case.

Example (lcase)Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case.

◦ Find What: (=500.{4})(\$a.)([A‐Z .]*)◦ Replace With: $1$2lcase($3)

Multi‐Field ReplacementsBy default, MarcEdit handles one field at a time when doing regular expressions. ◦ However, when you need to do evaluations against multiple fields, you can by adding /m to the end of your replacement in the Replace Function in the MarcEditor

◦ This is a special function added to the MarcEdit regular expression engine

ExampleUsing regex_example.mrk

Changing video disc to blue‐ray in the 300 if the 538 is marked as blue‐ray

Multi‐Line Example

PlaceholderAre there specific editing tasks that folks are interested in?

We can talk about these now

Questions

Integrations

ILS IntegrationILS Integration currently supports direct integration with Koha, Alma, and a local option.

Are other integrations possible? ◦ http://blog.reeset.net/archives/2133

Let’s talk about ALMA Integration

How MarcEdit Works with AlmaMarcEdit works through the following API endpoints:◦ https://developers.exlibrisgroup.com/alma/apis/bibs◦ Because the API is rate limited (i.e., you can only process so many transactions concurrently through the API, and all Alma operations use the API), MarcEdit limits API processes to a single thread. It takes a little longer, but eliminates the possibility that using MarcEdit to automate workflows will bring down your system because the tool is trying to communicate with the system too quickly.

This this API, MarcEdit can:◦ Edit holdings data (and Holdings Records)◦ Create and Update bibliographic data◦ Extract Records

◦ Though discovery should be done via Z39.50 or SRU (which is preferred)

Working with OCLC Connexion

https://youtu.be/a7Cen0gxFCw?list=PLrHRsJ91nVFScJLS91SWR5awtFfpewMWg

Working with OCLC’s Metadata APIMARCEDIT CAN WORK DIRECTLY WITH WORLDCAT VIA THE METADATA API .

MarcEdit: Batch WorldCat Holdings Management

MarcEdit: Batch Bibliographic Record Upload

More Information

OCLC’s Developer Network: ◦ http://oclc.org/developer/

OCLC Metadata API Documentation:◦ http://oclc.org/developer/services/worldcat‐metadata‐api

Notes on MarcEdit Integration: ◦ http://blog.reeset.net/archives/1245

C# OCLC API Library◦ https://github.com/reeset/oclc_api

Documents

Advanced Tooling in MarcEdit [Read-Only]Advanced Tooling in MarcEdit TERRY REESE ... Real‐World Example Library in Greece has a task list with over 1000 task actions. ... MarcEdit