61
Dr. Sven Nahnsen/Dr. Marius Codrea, Quantitative Biology Center (QBiC) Data Management for Quantitative Biology Lecture 5: Database systems (continued) LIMS and E-lab books

Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

  • Upload
    qbictue

  • View
    192

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Dr. Sven Nahnsen/Dr. Marius Codrea,

Quantitative Biology Center (QBiC)

Data Management for Quantitative Biology

Lecture 5: Database systems (continued)

LIMS and E-lab books

Page 2: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Many database design & concepts

http://dataconomy.com/wp-content/uploads/2014/07/fig2large.jpg2

Page 3: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Databases

DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database

DB = DB + data model for the application at hand (business logic) + implementation

DB = DB + database management system (DBMS). Software than enables:

3

CRUD

• Create entries

• Read (retrieve)

• Update / edit

• Delete

DB = DB + Administration (User privilages, monitoring)

Page 4: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Selected database systems

I. Relational databases

MySQL

II.NoSQL databases

MongoDB

Page 5: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Specific characteristics MongoDB vs MySQL

More details here: http://db-engines.com/en/system/MongoDB%3BMySQL

System Property MongoDB MySQL

Initial release 2009 1995

Current release 3.0.2, April 2015 5.6.24, April 2015

Triggers No Yes

MapReduce Yes No

Foreign keys No Yes

Transaction concepts No ACID*

*A database transaction, must be Atomic, Consistent, Isolated and Durable.

Page 6: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Fields

Record 1

Record 6

Primary keyPrimary

key

Foreign KeyRef

Mice.Mouse_number

● The values of the primary keys uniquely identifies the rows of the table● The foreign key uniquely links the rows of the host table to 1 record in the referencing table

Mice table

Samples table

Terminology - Relational databases

Page 7: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Mice tableSamples table

Samples are RELATED to mice

1:N one-to-many relationship

Relational databases (Normalization)

Page 8: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Foreign Keys

CREATE TABLE samples (  Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,  Mouse_number SMALLINT UNSIGNED NOT NULL,  Timepoint VARCHAR(15) NOT NULL,  PRIMARY KEY (Sample_ID),  FOREIGN KEY (Mouse_number)         REFERENCES mice(Mouse_number)        ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;

Page 9: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Queries

SELECT * , COUNT(*) as count_per_gender from mice group by Gender, Treatment;

“How many males and how many females per treatment?”

Page 10: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

JOIN queries

SELECT Sample_ID, Treatment, Timepoint, mice.Mouse_number from samples join mice 

on samples.Mouse_number = mice.Mouse_number where mice.Mouse_number=2; 

“What samples do I have from mouse number 2?”

Page 11: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Relational „facts“

1.Rigid schema (once the structure is defined, it may be difficult to adjust)

2.Normalization introduces/requires additional tables, joins, indices and it scatters data

3.Each field in each record has a single value of a pre-defined type

Page 12: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Mice tableSamples table

1:M one-to-many relationship

Relational „facts“ 1

Generalization to other Projects/Experiments in the lab?

Rigid schema (once the structure is defined, it may be difficult to adjust)

Page 13: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Organisms table

Samples tableBROKEN 1:N one-to-many relationship

Relational „facts“ 1

Mice table

Page 14: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Deleted relationship

CREATE TABLE samples (  Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,  Mouse_number SMALLINT UNSIGNED NOT NULL,  Timepoint VARCHAR(15) NOT NULL,  PRIMARY KEY (Sample_ID),  FOREIGN KEY (Mouse_number)         REFERENCES mice(Mouse_number)        ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;

Page 15: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Organisms table

Projects table

Relational „facts“ 1

Page 16: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Projects table

Relational „facts“ 1 Users table

Many users can be involved in many projects. With many roles?

Projects_Users table

Page 17: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Projects table

Relational „facts“ 2

Users table

Normalization introduces/requires additional tables, joins, indices and scatters data

Projects_Users table

CREATE INDEX usr on Project_Users (User_ID);

Page 18: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Relational „facts“ 3

Each field in each record has a single value of a pre-defined type

Primary Key Field 1 Field 2 Field 3

A 2-D map (tuples)

Page 19: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Relational „facts“ 3

A single value ?!?

What if a person has 2 affiliations and thus 2 addresses, 2 phone numbers, etc?

Normalization? Again?

Page 20: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

NoSQL

DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database

● Can we allow for “some” heterogeneity of the data?

● Can the records be highly similar but not necessarily identical? (e.g., most of the users having just 1 phone number but others more?)

Page 21: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB is a document-oriented DB

{ Mouse_number: “1”, Gender: “Male”, Age: 3, Treatment: “Vitamin A” }

Field:value pairs

Document ~ Record

http://www.mongodb.org/

Page 22: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB Documents

{ Mouse_number: “1”, Gender: “Male”, Age: 3, Treatment: “Vitamin A” }

Field:value pairs

Documents are BSON files (binary JSON)

Closely resemble structures in programming languages (key-value association)

Each field can be

● NULL● Single value (integer, string, etc)● An array of many values● Other embedded documents● A reference to another document

Page 23: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB Collections

Documents are stored in Collections

Collection ~ Table

{ Mouse_number: “6”, Gender: “Female”, Age: 2, Treatment: “Vitamin B” , }

Page 24: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Different representation – The challenge remains the same:

Model the relationships between data

Organisms

Projects

Users

AffiliationsSamples

N

N

N

N

N

N

11

Page 25: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Design & operational mechanisms

MySQL

● Primary Key

● Foreign Key

● Join Tables

MongoDB

● Unique ID

● References

● Embedding

Page 26: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB – Field types

{ _id: <ObjectID1> Username: { first_name: “Hans”, last_name: “Meyer” }, Gender: “Male”, Age: 30, Phones: [“+490777”, “+350777”],

Affiliations_id: <UUID_affiliation>}

Users document

● array

● embedded document

● reference

● Unique ID

Page 27: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB – Field types

● Unique ID _id: <ObjectID1>

Acts as a primary key

ObjectId is a 12-byte BSON type, constructed using:

● a 4-byte value representing the seconds since the Unix time,● a 3-byte machine identifier,● a 2-byte process id, and● a 3-byte counter, starting with a random value.

http://docs.mongodb.org/manual/reference/object-id/

ObjectId("507f1f77bcf86cd799439011")

Page 28: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB – Field types

● array Phones: [“+490777”, “+350777”]

● Upon indexing, each value in the array is in the index

● Query for ANY matching value

Page 29: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB – Field types

{ _id: <ObjectID1> Username: { first_name: “Hans”, last_name: “Meyer” },

Gender: “Male”,}

● embedded document

● Pre-joined data?

● Can be indexed

● Query at any level on any field

Page 30: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB – Field types

{ _id: <ObjectID1> Username: { first_name: “Hans”, last_name: “Meyer” }, Gender: “Male”, Age: 30, Phones: [“+490777”, “+350777”],

Affiliations_id: <UUID_affiliation>}

Users document

● reference

Affiliations document

{ _id: <UUID_affiliation> Name: “My lab”, Address: “Tübingen”}

Page 31: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Where is the catch?

● "In MongoDB, write operations are atomic at the document level, and no single write operation can atomically affect more than one document or more than one collection."

● OK, then references (normalized model) are not really Foreign Keys that the DB engine resolves. "Client-side applications must issue follow-up queries to resolve the references".(see next slide)

● “A denormalized data model with embedded data combines all related data for a represented entity in a single document. This facilitates atomic write operations since a single write operation can insert or update the data for an entity.”

● OK, denormalize. Maximum default document size is 16MB.

http://docs.mongodb.org/manual/

Page 32: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Foreign key „ON DELETE CASCADE“

“Mouse number 3 went wrong. Let's just delete it.”

SELECT * from samples;

DELETE from mice where Mouse_number = 3;

SELECT * from samples;

Where are these two samples gone?

Page 33: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

The key challenge

Find the right structure of the documents (references and embedded documents) that best fit

● the requirements of the application (queries, updates) -data usage

● the performance of the database engine

Page 34: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Model the relationships between data

Organisms

Projects

Users

AffiliationsSamples

N

N

N

N

N

N

11

Page 35: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Model the relationships between data 1:N

Organisms

Samples

N

1

OrganismsSample_ids: [ ]

SamplesOrganism_id:

OR?

Depends on the most frequent question?

● What samples do I have from Organism X ?● Where Sample Y came from?

● How many samples? Reach the 16MB limit?

● Organism embeds multiple samples

Page 36: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Relational „facts“ 3

Each field in each record has a single value of a pre-defined type

Primary Key Field 1 Field 2 Field 3

A 2-D map (tuples)

Page 37: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB

Nested documents

_id Field 1 Field 2 Field 3

Page 38: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms

db.organisms.insert(  {   Project_ID: 2,  Species: “human”,  Gender :””,  Age: 30,  Treatment: ”Vaccine A”}  

Page 39: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms

db.organisms.find(  { Project_ID: { $eq : 2} })  

SELECT * from organismsWHERE Project_ID = 2;

Page 40: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms db.organisms.find(  { $and: [{Species: /h.*/}, {Age: {$gt: 20 }}]})

 SELECT * from organisms WHERE Species like 'h%' AND Age > 20;

Page 41: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Schema flexibility

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms

{ _id: 14, Project_ID: 5, Species: “human”, Gender :”Female”, Age: 10, Genetic_background: “WT”}

Data IS the schema!

Page 42: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms db.organisms.find( { Genetic_background: $exists: true } })

 SELECT ??? 

Page 43: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Model the relationships between data 1:N

Organisms

Samples

N

1

OrganismsSample_ids: [ ]

SamplesOrganism_id:

OR?

● Organism embeds multiple samples

Page 44: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

MongoDB

Nested documents

_id Field 1 Field 2 Field 3

Page 45: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Samples: [ { _id: 10, Timepoint:”5h”},

{ _id: 11, Timepoint:”24h” } ], Treatment:”Vaccine A”}

Organisms

db.organisms.find( { '_id': '4',   'Samples._id':'11'  } )

db.organisms.find( { '_id': '4',  'Samples.Timepoint':'5h'  } )

Page 46: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Summary

● Database design requires technical and substantial domain specific knowledge

● Normalization

● Indices

MySQL

● Primary Key

● Foreign Key

● Join Tables

MongoDB

● Unique ID

● References

● Embedding

Hint: http://en.wikipedia.org/wiki/Category:Web_application_frameworks

Page 47: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Laboratory information management system (LIMS)

Organisms

Projects

Users

AffiliationsSamples

N

N

N

N

N

N

11

An underlying data structure of a simple LIMS design

Page 48: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

LIMS definition

http://en.wikipedia.org/wiki/Laboratory_information_management_system

„A Laboratory Information Management System (LIMS), sometimes referred to as a Laboratory Information System (LIS) or Laboratory Management System (LMS), is a software-based laboratory and information management system that offers a set of key features that support a modern laboratory's operations.“

Page 49: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

LIMS properties and functionality

http://en.wikipedia.org/wiki/Laboratory_information_management_system

● Meta data of any sample entering the laboratory

● Tracking of processes throughout sample treatment and preparation; scheduling of the sample and the associated analytical workload

● Quality control associated with the sample and the utilized equipment and inventory

● Inspection, approval, and compilation of the sample data for reporting and/or further analysis

Page 50: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Advantages of LIMS

50

● Fewer transcription errors

● Faster sample processing

● Real-time control of data and metadata

● Reproducibility of experimental processes

● Direct electronic reporting to clients

● Despite many advantages,...

Page 51: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Disadvantages of LIMS

● Customization of LIMS

● Interface is required

● Adequate validation to ensure data quality

Page 52: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

With a good LIMS in place, we can consider Electronic Laboratory Notebooks

Page 53: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Electronic laboratory notebooks (ELN)

http://en.wikipedia.org/wiki/Electronic_lab_notebook

An electronic lab notebook (also known as electronic laboratory notebook, or ELN) is a computer program designed to replace paper laboratory notebooks. Lab notebooks in general are used by scientists, engineers and technicians to document research, experiments and procedures performed in a laboratory. A lab notebook is often maintained to be a legal document and may be used in a court of as evidence.

Page 54: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Prominent use-case: review process

http://rushthecourt.net/mag/wp-content/uploads/2010/09/Three-Ring-Binders.jpg

● You submit a paper

● Several months of review process is not unlikely

● Reviewers ask for a more detailed description of the experiments you did two years back

Page 55: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Traditional Paper Lab Books

Page 56: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

ELN, a survey

Journal of Laboratory Automation 18(3) 229–234, 2012 Society for Laboratory Automation and Screening

DOI: 10.1177/2211068212471834

Page 57: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Examples of ELN software

Page 58: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Practical issues

● Lab technicians “have only two hands”

● Labs are often not equipped with desktop PCs

● Data security of ELNs opposes challenges

● Scientists are classically reluctant adopters

● There is activation energy required to change work habits

● In academic science there is no formal obligation

● Establishment requires stringent modeling (see previous slides on databases) or significant investments into existing tools

Page 59: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Mobile application of ELNs

Nature Methods 8, 541–543 (2011) doi:10.1038/nmeth.1631

● Handwriting capture technology

● All functionality as on paper● Sketch and manipulate

equations● Draw figures

● All notes can be linked, reordered, archived, edited, tagged, annotated and bundled in virtual 'notebooks' representing different projects

Page 60: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Easy solutions

Page 61: Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr. Marius Codrea

Evernote as lab notebook

Journal of Laboratory Automation 18(3) 229–234, 2012 Society for Laboratory Automation and Screening

DOI: 10.1177/2211068212471834