Data Management for Quantitative Biology - Database Systems (continued) LIMS and E-lab books by Dr....

Preview:

Citation preview

Dr. Sven Nahnsen/Dr. Marius Codrea,

Quantitative Biology Center (QBiC)

Data Management for Quantitative Biology

Lecture 5: Database systems (continued)

LIMS and E-lab books

Many database design & concepts

http://dataconomy.com/wp-content/uploads/2014/07/fig2large.jpg2

Databases

DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database

DB = DB + data model for the application at hand (business logic) + implementation

DB = DB + database management system (DBMS). Software than enables:

3

CRUD

• Create entries

• Read (retrieve)

• Update / edit

• Delete

DB = DB + Administration (User privilages, monitoring)

Selected database systems

I. Relational databases

MySQL

II.NoSQL databases

MongoDB

Specific characteristics MongoDB vs MySQL

More details here: http://db-engines.com/en/system/MongoDB%3BMySQL

System Property MongoDB MySQL

Initial release 2009 1995

Current release 3.0.2, April 2015 5.6.24, April 2015

Triggers No Yes

MapReduce Yes No

Foreign keys No Yes

Transaction concepts No ACID*

*A database transaction, must be Atomic, Consistent, Isolated and Durable.

Fields

Record 1

Record 6

Primary keyPrimary

key

Foreign KeyRef

Mice.Mouse_number

● The values of the primary keys uniquely identifies the rows of the table● The foreign key uniquely links the rows of the host table to 1 record in the referencing table

Mice table

Samples table

Terminology - Relational databases

Mice tableSamples table

Samples are RELATED to mice

1:N one-to-many relationship

Relational databases (Normalization)

Foreign Keys

CREATE TABLE samples (  Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,  Mouse_number SMALLINT UNSIGNED NOT NULL,  Timepoint VARCHAR(15) NOT NULL,  PRIMARY KEY (Sample_ID),  FOREIGN KEY (Mouse_number)         REFERENCES mice(Mouse_number)        ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;

Queries

SELECT * , COUNT(*) as count_per_gender from mice group by Gender, Treatment;

“How many males and how many females per treatment?”

JOIN queries

SELECT Sample_ID, Treatment, Timepoint, mice.Mouse_number from samples join mice 

on samples.Mouse_number = mice.Mouse_number where mice.Mouse_number=2; 

“What samples do I have from mouse number 2?”

Relational „facts“

1.Rigid schema (once the structure is defined, it may be difficult to adjust)

2.Normalization introduces/requires additional tables, joins, indices and it scatters data

3.Each field in each record has a single value of a pre-defined type

Mice tableSamples table

1:M one-to-many relationship

Relational „facts“ 1

Generalization to other Projects/Experiments in the lab?

Rigid schema (once the structure is defined, it may be difficult to adjust)

Organisms table

Samples tableBROKEN 1:N one-to-many relationship

Relational „facts“ 1

Mice table

Deleted relationship

CREATE TABLE samples (  Sample_ID SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,  Mouse_number SMALLINT UNSIGNED NOT NULL,  Timepoint VARCHAR(15) NOT NULL,  PRIMARY KEY (Sample_ID),  FOREIGN KEY (Mouse_number)         REFERENCES mice(Mouse_number)        ON DELETE CASCADE )ENGINE=InnoDB DEFAULT CHARSET=utf8;

Organisms table

Projects table

Relational „facts“ 1

Projects table

Relational „facts“ 1 Users table

Many users can be involved in many projects. With many roles?

Projects_Users table

Projects table

Relational „facts“ 2

Users table

Normalization introduces/requires additional tables, joins, indices and scatters data

Projects_Users table

CREATE INDEX usr on Project_Users (User_ID);

Relational „facts“ 3

Each field in each record has a single value of a pre-defined type

Primary Key Field 1 Field 2 Field 3

A 2-D map (tuples)

Relational „facts“ 3

A single value ?!?

What if a person has 2 affiliations and thus 2 addresses, 2 phone numbers, etc?

Normalization? Again?

NoSQL

DB = "A database is an organized collection of data" http://en.wikipedia.org/wiki/Database

● Can we allow for “some” heterogeneity of the data?

● Can the records be highly similar but not necessarily identical? (e.g., most of the users having just 1 phone number but others more?)

MongoDB is a document-oriented DB

{ Mouse_number: “1”, Gender: “Male”, Age: 3, Treatment: “Vitamin A” }

Field:value pairs

Document ~ Record

http://www.mongodb.org/

MongoDB Documents

{ Mouse_number: “1”, Gender: “Male”, Age: 3, Treatment: “Vitamin A” }

Field:value pairs

Documents are BSON files (binary JSON)

Closely resemble structures in programming languages (key-value association)

Each field can be

● NULL● Single value (integer, string, etc)● An array of many values● Other embedded documents● A reference to another document

MongoDB Collections

Documents are stored in Collections

Collection ~ Table

{ Mouse_number: “6”, Gender: “Female”, Age: 2, Treatment: “Vitamin B” , }

Different representation – The challenge remains the same:

Model the relationships between data

Organisms

Projects

Users

AffiliationsSamples

N

N

N

N

N

N

11

Design & operational mechanisms

MySQL

● Primary Key

● Foreign Key

● Join Tables

MongoDB

● Unique ID

● References

● Embedding

MongoDB – Field types

{ _id: <ObjectID1> Username: { first_name: “Hans”, last_name: “Meyer” }, Gender: “Male”, Age: 30, Phones: [“+490777”, “+350777”],

Affiliations_id: <UUID_affiliation>}

Users document

● array

● embedded document

● reference

● Unique ID

MongoDB – Field types

● Unique ID _id: <ObjectID1>

Acts as a primary key

ObjectId is a 12-byte BSON type, constructed using:

● a 4-byte value representing the seconds since the Unix time,● a 3-byte machine identifier,● a 2-byte process id, and● a 3-byte counter, starting with a random value.

http://docs.mongodb.org/manual/reference/object-id/

ObjectId("507f1f77bcf86cd799439011")

MongoDB – Field types

● array Phones: [“+490777”, “+350777”]

● Upon indexing, each value in the array is in the index

● Query for ANY matching value

MongoDB – Field types

{ _id: <ObjectID1> Username: { first_name: “Hans”, last_name: “Meyer” },

Gender: “Male”,}

● embedded document

● Pre-joined data?

● Can be indexed

● Query at any level on any field

MongoDB – Field types

{ _id: <ObjectID1> Username: { first_name: “Hans”, last_name: “Meyer” }, Gender: “Male”, Age: 30, Phones: [“+490777”, “+350777”],

Affiliations_id: <UUID_affiliation>}

Users document

● reference

Affiliations document

{ _id: <UUID_affiliation> Name: “My lab”, Address: “Tübingen”}

Where is the catch?

● "In MongoDB, write operations are atomic at the document level, and no single write operation can atomically affect more than one document or more than one collection."

● OK, then references (normalized model) are not really Foreign Keys that the DB engine resolves. "Client-side applications must issue follow-up queries to resolve the references".(see next slide)

● “A denormalized data model with embedded data combines all related data for a represented entity in a single document. This facilitates atomic write operations since a single write operation can insert or update the data for an entity.”

● OK, denormalize. Maximum default document size is 16MB.

http://docs.mongodb.org/manual/

Foreign key „ON DELETE CASCADE“

“Mouse number 3 went wrong. Let's just delete it.”

SELECT * from samples;

DELETE from mice where Mouse_number = 3;

SELECT * from samples;

Where are these two samples gone?

The key challenge

Find the right structure of the documents (references and embedded documents) that best fit

● the requirements of the application (queries, updates) -data usage

● the performance of the database engine

Model the relationships between data

Organisms

Projects

Users

AffiliationsSamples

N

N

N

N

N

N

11

Model the relationships between data 1:N

Organisms

Samples

N

1

OrganismsSample_ids: [ ]

SamplesOrganism_id:

OR?

Depends on the most frequent question?

● What samples do I have from Organism X ?● Where Sample Y came from?

● How many samples? Reach the 16MB limit?

● Organism embeds multiple samples

Relational „facts“ 3

Each field in each record has a single value of a pre-defined type

Primary Key Field 1 Field 2 Field 3

A 2-D map (tuples)

MongoDB

Nested documents

_id Field 1 Field 2 Field 3

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms

db.organisms.insert(  {   Project_ID: 2,  Species: “human”,  Gender :””,  Age: 30,  Treatment: ”Vaccine A”}  

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms

db.organisms.find(  { Project_ID: { $eq : 2} })  

SELECT * from organismsWHERE Project_ID = 2;

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms db.organisms.find(  { $and: [{Species: /h.*/}, {Age: {$gt: 20 }}]})

 SELECT * from organisms WHERE Species like 'h%' AND Age > 20;

Schema flexibility

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms

{ _id: 14, Project_ID: 5, Species: “human”, Gender :”Female”, Age: 10, Genetic_background: “WT”}

Data IS the schema!

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Treatment:”Vaccine A”}

Organisms db.organisms.find( { Genetic_background: $exists: true } })

 SELECT ??? 

Model the relationships between data 1:N

Organisms

Samples

N

1

OrganismsSample_ids: [ ]

SamplesOrganism_id:

OR?

● Organism embeds multiple samples

MongoDB

Nested documents

_id Field 1 Field 2 Field 3

Queries

{ _id: 4, Project_ID: 2, Species: “human”, Gender :””, Age: 30, Samples: [ { _id: 10, Timepoint:”5h”},

{ _id: 11, Timepoint:”24h” } ], Treatment:”Vaccine A”}

Organisms

db.organisms.find( { '_id': '4',   'Samples._id':'11'  } )

db.organisms.find( { '_id': '4',  'Samples.Timepoint':'5h'  } )

Summary

● Database design requires technical and substantial domain specific knowledge

● Normalization

● Indices

MySQL

● Primary Key

● Foreign Key

● Join Tables

MongoDB

● Unique ID

● References

● Embedding

Hint: http://en.wikipedia.org/wiki/Category:Web_application_frameworks

Laboratory information management system (LIMS)

Organisms

Projects

Users

AffiliationsSamples

N

N

N

N

N

N

11

An underlying data structure of a simple LIMS design

LIMS definition

http://en.wikipedia.org/wiki/Laboratory_information_management_system

„A Laboratory Information Management System (LIMS), sometimes referred to as a Laboratory Information System (LIS) or Laboratory Management System (LMS), is a software-based laboratory and information management system that offers a set of key features that support a modern laboratory's operations.“

LIMS properties and functionality

http://en.wikipedia.org/wiki/Laboratory_information_management_system

● Meta data of any sample entering the laboratory

● Tracking of processes throughout sample treatment and preparation; scheduling of the sample and the associated analytical workload

● Quality control associated with the sample and the utilized equipment and inventory

● Inspection, approval, and compilation of the sample data for reporting and/or further analysis

Advantages of LIMS

50

● Fewer transcription errors

● Faster sample processing

● Real-time control of data and metadata

● Reproducibility of experimental processes

● Direct electronic reporting to clients

● Despite many advantages,...

Disadvantages of LIMS

● Customization of LIMS

● Interface is required

● Adequate validation to ensure data quality

With a good LIMS in place, we can consider Electronic Laboratory Notebooks

Electronic laboratory notebooks (ELN)

http://en.wikipedia.org/wiki/Electronic_lab_notebook

An electronic lab notebook (also known as electronic laboratory notebook, or ELN) is a computer program designed to replace paper laboratory notebooks. Lab notebooks in general are used by scientists, engineers and technicians to document research, experiments and procedures performed in a laboratory. A lab notebook is often maintained to be a legal document and may be used in a court of as evidence.

Prominent use-case: review process

http://rushthecourt.net/mag/wp-content/uploads/2010/09/Three-Ring-Binders.jpg

● You submit a paper

● Several months of review process is not unlikely

● Reviewers ask for a more detailed description of the experiments you did two years back

Traditional Paper Lab Books

ELN, a survey

Journal of Laboratory Automation 18(3) 229–234, 2012 Society for Laboratory Automation and Screening

DOI: 10.1177/2211068212471834

Examples of ELN software

Practical issues

● Lab technicians “have only two hands”

● Labs are often not equipped with desktop PCs

● Data security of ELNs opposes challenges

● Scientists are classically reluctant adopters

● There is activation energy required to change work habits

● In academic science there is no formal obligation

● Establishment requires stringent modeling (see previous slides on databases) or significant investments into existing tools

Mobile application of ELNs

Nature Methods 8, 541–543 (2011) doi:10.1038/nmeth.1631

● Handwriting capture technology

● All functionality as on paper● Sketch and manipulate

equations● Draw figures

● All notes can be linked, reordered, archived, edited, tagged, annotated and bundled in virtual 'notebooks' representing different projects

Easy solutions

Evernote as lab notebook

Journal of Laboratory Automation 18(3) 229–234, 2012 Society for Laboratory Automation and Screening

DOI: 10.1177/2211068212471834

Recommended