29
1 LingDy February 13, 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London Data management

Data management

  • Upload
    saima

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Data management. LingDy February 13, 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London. Two most valuable strategies. design and use a filename system - PowerPoint PPT Presentation

Citation preview

Page 1: Data management

1

LingDyFebruary 13, 2012

TUFS, Tokyo

David NathanEndangered Languages Archive

Hans Rausing Endangered Languages ProjectSOAS, University of London

Data management

Page 2: Data management

2

Two most valuable strategies

design and use a filename system work out your basic units of documentation

and the relationships between them

- if you get these right, it will do the “heavy lifting” of your data management strategy- data and metadata are intertwined, points in a spectrum rather than different things

Page 3: Data management

3

Three most important qualities

consistency machine readability

“computer programs can act on data in terms of its proper structures and categories”bad example

documentation of conventions, structures, methods

Page 4: Data management

4

Data management

understand and model the data (units, relationships)

use appropriate data structure methods – in both file contents and organisation

use appropriate and conventional data encoding methods (e.g. Unicode)

be explicit and consistent plan for flow of data, working with others,

across different systems document steps, decisions, conventions,

structures think ahead to archiving

Page 5: Data management

5

Managing data in your computer

design a well-organised system of folders so that you can always find your stuff according to what it is, not: where the software decided to put it what the software decided to call it when/where you last used it what someone else called it

Page 6: Data management

6

File structures and names

design folder structure as a logical hierarchy that suits your goals, content and work style have documentary materials within one

overall directory (e.g. for backup) make directories for relevant categories,

e.g. sessions, media types, dates design it so that you will always be able to

find things you may need to restructure at different

points in your project, e.g. move from date-based to session-based structures

Page 7: Data management

7

Designing a file/folder structure

it should relate to reality locations should make sense, so you (and

others) will know where to look for things (where do you keep your passport; favourite cup?)

the best location is “the place that one would naturally look to find it”

Page 8: Data management

8

3 models for file system structures

tree of descriptive folder- and file-names one folder with descriptive filenames one folder with numerical filenames

… what else is needed?

Page 9: Data management

9

On identifiers

real world objects are uniquely identified because they are physically unique - an unlabelled cassette is poorly identified

digital objects have no physical existence - they depend on identifiers that we give them

three types of identifiers: semantic keys relative

Page 10: Data management

10

On identifiers

semantic, e.g. Nelson Mandela The Sound of Music SA_JA_Bongo_Palace_Land Dispute

Trial_015_29-04-2010.wav *

* SA_JA_Bongo_Palace_Land Dispute Trial_015_29-04-2010.wav

Page 11: Data management

11

On identifiers

keys (disambiguators), e.g. 1137204 (a student number) 0803 211 6148 (a telephone number),

p12893fh23.pdf (some system's reference number)

Page 12: Data management

12

On identifiers

relative, e.g. 67 High Street the secretary index.html metadata.xls

Page 13: Data management

13

On identifiers

your collection may have a mix of these but it is important to be aware of their differences and limitations, for example: semantic identifiers: invite name clashes keys: a program or process might depend

on the identifier to work properly relative identifiers: if you move them, you

probably change or destroy their meaning

Page 14: Data management

14

Objects and identities

a digital object’s identity includes its location a file’s full identity = path + filename the path is a representation of the volume

and the directory (folder) hierarchy if the full identity is unambiguous then

everything can be fine, compare: c:\\dogs\spaniels\rover.jpg c:\\cars\british\rover.jpg

or lectures\syntax\2013-02-12\notes.doc

Page 15: Data management

15

Objects and identities

but semantic identifiers are potentially dangerous, because just adding more chunks to disambiguate them will not work: my\rover.jpg my\white_rover.jpg

so domains that do not offer semantic uniqueness may need identifiers which are either keys, or relative identifiers

Page 16: Data management

16

Segue to file names

(having said all that) filenames are only filenames, and do not

necessarily provide information common mistaken assumptions:

that a filename “dp_verbs_39.wav” means there is an entity “dp_verbs_39”

that files are logically linked just by sharing some part of their filenames- these are only true if your system ensures it (and you state it explicitly)

Page 17: Data management

17

File naming

filenames that are unsystematic or are non-standard will cause problems, eventually

unsystematic file naming might be OK if you already have many files you have a working method that already

does everything you need to do your “system” will do everything you need

to do in the future

Page 18: Data management

18

Manage file names from the start

a new file: don’t just accept the default filename or

location suggested by the application when you first save the file

put it where it belongs, immediately. If necessary, create the place (directory/path) where it belongs

name it according to your naming system! if you have an inventory/index of files, add

an entry for the new file

Page 19: Data management

19

Filename rules

all filenames should have correct extensions each filename should have only one ".",

before the extension use only ASCII characters (US keyboard) use only letters, numbers, hyphens (-) and

underscores (_) keep filenames short, just long enough to

contain the necessary identifier - don't fill them up with lots of information about the content (that is metadata!)

Page 20: Data management

20

How about these file names?

1. ready.audio.wav2. ReAlLyOdDtOReAd.txt3. éclair.jpg4. éclair_fr.jpg5. e'clair.jpg6. french-cake.jpeg7. french-cake.jaypeg8. lexicon-master9. ɘɫIɲʰ.eaf10. ice cream.doc11. OBAMA.TXT12. Obama.txt

Page 21: Data management

21

Make filenames sortable

make filenames usefully sortable:

20100119lecture.doc 20100203lecture.doc

gr_transcription_1.txtgr_transcription_2.txtgr_transcription_9.txt gr_transcription_53.txt

gr_transcription_001.txtgr_transcription_002.txtgr_transcription_009.txtgr_transcription_053.txt

Page 22: Data management

22

Associating files

you can make resources sortable together by giving them the same filename root (the part before the extension), or part of the root:

document your conventions and system if you do this

gr_reefs.wavgr_reefs.eafgr_reefs.txt

paaka_photo001.jpgpaaka_photo002.jpgpaaka_txt_conv203.wavpaaka_txt_conv203.eafpaaka_txt_lex.doc

Page 23: Data management

23

Avoid metadata in filenames

avoid putting metadata into filenames. A filename is an identifier, not a data container

better to use a simple (semantic) filename or a key (i.e. meaningless) filename, and then create a metadata table to contain all the relevant information

a table can properly express all the information, contain links etc, and is extensible for further metadata

Page 24: Data management

24

Avoid metadata in filenames

e.g. Paaka_Reefs_Dan_BH_3Oct97.wav better:

paaka_063.wavplus

paaka_063.txt

language topic speaker location date

Paakantyi Reefs at Mutawintyi

Dan Herbert Broken Hill 1997-10-03

paaka_063.txt

Page 25: Data management

25

A filenaming system

carefully design a filename system for your data and document the system so that somebody else can understand it

one documenter’s new system:

aaa_bb_cc_yyyy-mm-dd_nnn.wav

Page 26: Data management

26

A filenaming system

aaa_bb_cc_yyyy-mm-dd_nnn.wavaaa = village codebb = (main) speaker codecc = genre/event codeyyyy-mm-dd = date (why this order?)nnn = optional number (e.g. 001).wav = correct extension for file content type

Page 27: Data management

27

Also (for Part 2)

creating an inventory/index/metadata metadocumentation data/file versions transferring data sharing data backup

Page 28: Data management

28

Documenting the filename system

describe the system- how would you describe it?- where would you put the description?

document the codes – this is probably part of your metadata

Page 29: Data management

29

On changing file names

decide if it’s possible, benefits and side effects (e.g. loss of links in ELAN files)

design a system first don’t change names in situ – copy data set

and gradually migrate it to your new system document file name changes if possible, automate or copy and paste

filenames if possible, use machine processes, e.g.

system filename listings, XLS formulas