The Cost of Archiving: The AILLA Perspective Susan Smythe Kung, PhD skung@austin.utexas.edu 3rd...

Preview:

Citation preview

The Cost of Archiving:The AILLA Perspective

Susan Smythe Kung, PhDskung@austin.utexas.edu

www.ailla.utexas.org

3rd INNET Conference“Costing and sustainable finding of endangered language archives”

April 29, 2014

Increased number of deposits due to:

• Increased awareness of need to preserve primary language materials

• Increased awareness of AILLA• “New” requirement (of US federal funding

agencies) for a Data Management Plan (DMP) – NSF requirement since Jan. 2011.

3 Parts

• Part 1 – AILLA’s background• Part 2 – The costing exercise• Part 3 – AILLA’s administrative costs

Part 1:

AILLA’s Background

•AILLA is a digital repository of multimedia resources in and about the indigenous languages of Latin America. It is a small, special collection within the Benson Latin American Collection at UT-Austin.•The collections consist of

linguistic primary source field data such as field notes, audio and video recordings, photos and sketches in a wide range of genres (stories, myths, chants, songs, conversations, prayers, rituals, etc.)

analyzed data such as grammars, dictionaries, ethnographies, and manuscripts.

AILLA’s Mission:•Preservation: To preserve irreplaceable materials in and about indigenous languages of Latin America, especially primary source field data of the type that has traditionally not been publicly available.•Access: To make these materials and/or their metadata available to everyone, especially indigenous people, over the Internet.

History:• Founded as a joint project between College of Liberal Arts

(COLA) and the University of Texas Libraries (UTL) by• Joel Sherzer, Anthropology• Anthony Woodbury, Linguistics•Mark McFarland, UTL Digital Initiatives

• Project began in 2000 with seed money from COLA.• Pilot site launched March 2001.• Permanent site launched Jan. 31, 2003.• Repository and website upgrade to take place 2015-2017

(we hope!).

Today:• Jointly supported by the COLA and UTL.•Part of LILLAS Benson Latin American

Studies and Collections • Located inside the

Nettie Lee Benson Latin AmericanCollection on the campusof the University of Texas at Austin

AILLA Collection Statistics:(stats as of August 29, 2014)

•298 languages•22 Latin American countries•12,796 resources•100,041 media files•19,294 audio recordings (6,773 hrs, 14 min, 18 sec)•2,373 video recordings (1,215 hrs, 34 min, 23 sec)

AILLA Collection Statistics (cont’d):

•5,302 digital texts (97,580 pages)•38,491 scanned pages•4,331 images•Only 20% restricted Access•1.8 TB•138 Depositors•Over 5,000 registered users from all over the world

AILLA Staff:• Full-time Manager – Susan Kung

(supported by COLA & UTL)

• 2 Graduate Research Assistants, 20 hrs/wk ea.(supported by grant-funded projects)

Work is also done by:• UTL Digital Library Services Staff provide server

management and minimal technical support – their salaries do NOT come out of the AILLA budget.

• Undergraduate Interns(paid university stipends; independent research credit; volunteer)

• MLIS/MSIS Capstone (thesis) projects• Volunteers

AILLA’s Costs:

1. Digitization, curation, ingestion – Part 22. Data and Metadata storage – Part 23. Administration – Part 34. Software development and maintenance –

Not covered here5. Data and metadata migration – Not

covered here

Part 2:

The Costing ExerciseCollection 1: Analog

Collection 2: Born-Digital

Collection 1 Analog Contents:•20 audio cassettes, each 60 min. long, in good condition (unknown number of recording events)•Metadata spreadsheet for recordings on cassettes•5 transcriptions, hand-written, (unknown # of pages)•200 photographs on photo paper + paper list of photo contents•Collection size (after digitization) = 100 GB

Additional specifications needed at AILLA for Collection 1:• Q1: How many different speech events are on each tape? Our

preference is to separate different speech events into separate resources.• I’ll assume there are 3 narratives (of about 10 minutes) per

side for a total of (3x2x20) 120 speech events.• Q2: How long are the transcriptions?• I’ll assume they are about 25 pages each, for a total of 125

scanned pages.• Q3: How many research participants were involved?• I’ll assume that there were 10 participants.

A resource is AILLA’s term for an organized bundle or set of related files.A resource might consist of •just 1 file, e.g., a single mp3 audio file of a recorded narrative, or•numerous files, e.g., simultaneous audio and video

recordings of a speech event, plus an Elan transcription, or a semester’s worth of recorded lectures about indigenous languages, plus the class syllabus and handouts.

Collection 1 Audio: Required Tasks

1. Digitize the cassettes: each side of each cassette = 1 wav file; total wavs = 40; file names = tape1_sideA, etc.

2. Edit the wave files into individual speech events and assign AILLA IDs: assuming 3 speech events per side, total speech events = 3x2x20 = 120 wav files

3. Convert wav (archival) files to mp3 (access) files4. Add AILLA IDs to the metadata spreadsheet and collect

additional metadata about each speech event, e.g., length of wav, recording specifications, original source, etc.

Collection 1 Paper Transcriptions: Required Tasks

1. Scan each page and create 5 multi-page tif files. Simultaneously assigned AILLA ID as filenames.

2. Add row to spreadsheet for the AILLA ID & Metadata.

3. Convert tif (archival) files to pdf/a (access) files.

Collection 1 Paper Photos: Required Tasks

1. Scan each photo and create 8 multipage tif files of 25 photos each; assign AILLA IDs.

2. Add AILLA IDs, photo contents from paper list, and other metadata to the MD spreadsheet

3. Convert tif (archival) files to jpg (access) files.

Collection 1: Ingestion Required Tasks

1. Create a collection for the depositor2. Add all of the research participants to AILLA’s “people

database” – assume 10 participants3. Upload all files to the server (100GB)4. These steps are done together, but consecutively:• Create 121 AILLA resources (120 speech events & 1 photo

resource), • Link the relevant files, • Enter the metadata & assign access level, and • Complete Spanish (or English) translations.

Collection 1: Total One-time Cost = $1,922.34

Task Audio Paper Transcription

Paper Photos Ingestion

1 $535.90 $75.90 $121.90 $11.502 $52.90 $1.84 $9.20 $39.103 $92.00 $9.20 $13.80 $11.504 $4.60 0 0 $943.00Total $685.40 $86.94 $144.90 $1005.10

Collection 1: Recurring Cost = ???•Yearly server storage for 100 GB = $66/yr•Future file conversion when/if archival and access

formats change = ????•Future upgrades of digital repository and asset

management software = ???•Future file and metadata migration when repository

and asset management software upgrades = ???

Collection 2 Born-Digital Contents:•150 audio wav files, average length = 15 min.•20 video mp4 files, average length = 30 min.•250 digital images•120 eaf files (20 for video, 100 for audio)•Metadata spreadsheet listing contents of all files•Collection size = 150 GB

Additional specifications needed at AILLA for Collection 2:

•Q1: How many research participants were involved? •Again, I’ll assume that there were 10 participants.

•Q2: What is the file format of the digital images?• I’ll assume that it is jpg

Collection 2: Required Tasks for Digital Collections•Massage the metadata (study its organization,

rearrange as necessary, add missing info)•Rename files w/ AILLA IDs:• Rename audio and video files and add the AILLA IDs to

the MD spreadsheet; •Match each eaf file to its corresponding audio or video

file, assign the appropriate related AILLA ID, rename the file, and rearrange MDS if necessary.

•Create mp3 access copies from the wav files.

Collection 2: Ingestion Required Tasks

1. Create a collection for the depositor2. Add all of the research participants to AILLA’s “people

database” – assume 10 participants3. Upload all files to the server (150GB)4. These steps are done together, but consecutively:• Create 127 AILLA resources (150 audio, 20 video & 1 photo

resource), • Link the relevant files, • Enter the metadata & assign access level, and • Complete Spanish (or English) translations.

Collection 2: Total One-time Cost = $1,910.15

Task Ditigal Ingestion 1 $23.00 $11.502a $57.50 $39.102b $6.90 NA2c $230.00 NA3a $5.75 $11.503b $190.90 NA4 0 $1334.00Total $514.05 $1396.10

Collection 2: Recurring Cost = ???

•Yearly server storage for 150 GB = $99/yr•Future file conversion when/if archival and access

formats change = ???•Future upgrades of digital repository and asset

management software = ???•Future file and metadata migration when repository

and asset management software upgrades = ???

Price List Categories

1. Digitization of analog media and digital video transfer (all formats except mp4, mpeg, mpg)

2. Curation & organization3. File conversion4. Ingestion (file upload, collection creation,

participate metadata entry, resource creation and metadata entry)

5. Server storage fees

Category 1: Analog Media and video transfer Part I

Audio cassette tape, 60 min. $25

Audio cassette tape, 90 min. $30

Audio open reel tape, 60 min. $35

Audio open reel tape, 90 min. $45

Audio open reel tape, 120 min. $55

Audio minidisk, 60 min. $25

Audio DAT tape, 60 min. $25

Category 1: Analog Media and video transfer Part II

Category 2: Curation & OrganizationMetadata handling fee – required for all collections so that we can determine the state and organization of the metadata

$25 flat fee

Metadata compilation(e.g., from notes written on paper, tape covers, etc.)

$25/hour

File/materials organization $25/hour

Digital file renaming 50¢/file

Category 3: File Splitting and ConversionAudio file splitting (only done when there are detailed and specific notes indicating where the split should be made.)

$10 per wav file created by the split

Digital audio files, wav to mp3 5¢/file

Video file conversions, mpeg/mpg to mp4 Missing info

Image file conversions to pdf/a (manuscripts) or jpg (images)

$2/file

Category 4: IngestionOn-line collection creation (for 1st-time depositors or to start a new/different collection) – There is on collection fee to add new/more data to an existing collection

$15

Add participants to the AILLA people database (all research participants MUST be added to the database unless they have chosen to be or a required to be anonymous; or for very old or poorly identified data for which research participants’ names are not known

$4/participant

Upload all files to AILLA’s server $25

Create resources, link the relevant files, enter the metadata, and complete Spanish (or English) translations – IFF Spanish/English translations are included in your metadata

$10/resource

Create resources, link the relevant files, enter the metadata, and complete Spanish (or English) translations, add Spanish/English translations

$15/resource

Category 5: Storage Fees

I haven’t quite figured out how to calculate this charge. I think it’s better to charge a flat fee up front (which can be written into a grant budget), but I want to hear the results of our DELAMAN discussion.

100 GB $66/year150 GB $99/year

Part 3:

AILLA’s Administrative Costs

3 Areas that fund AILLA’s Administrative Costs:1.Institutional Support2.Grants – Direct Costs3.Grants – Indirect Costs

A 4th Area—the AILLA endowment, which was, and still is, built from monetary donation to AILLA– will cover some costs (to be determined) in the future, but it has not been accessed yet.

Institutional Support covers:•Manager’s salary & fringe (UTL & COLA)•Office space & some furniture (UTL)•Phone service (COLA)•Electricity (UT)•Manager’s travel for professional development (UTL

& COLA)•Computer ITS – COLA•Server ITS – UTL

Direct Grant Costs (currently) cover:•Manager travel to get collections & to make

presentations about them at conferences•2 GRAs: salary, fringe & tuition remission•Depositor/collaborator trips to AILLA •Shipping •Some server costs

Direct Grant Costs have covered (past):All of the above, plus:•PC and Mac computers and laptops•Scanners – 2 flat bed, 2 ADF•Software – digitization and conversion•Audio equipment (tape cassette decks, MD deck, reel-to-reel players)•Workshops organized by AILLA (including travel for invited participants)

Indirect Grant Costs cover:• Computers for administration, digitization and ingestion• Other computer accessories – sound cards, storage media, printers• Software – both administrative and for digitization and conversion.• Equipment repair (e.g., cassette decks, reel-to-reel players)• Office supplies (paper, printer ink, pens, pencils, sticky notes, paper clips, etc.)• Printing - AILLA brochure, business cards• Shipping• Visitor expenses (e.g., lunches, parking)• Manager’s membership dues

• Administrative cloud storage • Some office furniture

Operating Budget

Counting direct and indirect costs from AILLA’s grants, our operating budget is about $75,000.

This # does not include the administrative costs that are provided by UT-Austin.

BUT, we have a data backlog of approximately 3 years because we do not have time to process the unsolicited deposits because we are so busy with our “solicited” deposit ( our DEL grant to archive Terrence Kaufman’s collection).

Thank you!www.ailla.utexas.org

Please send comments or questions to ailla@ailla.utexas.org

orskung@austin.utexas.org

Recommended