Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Storing data in databases
The webinar will begin at 3pm
• You now have a menu in the top right corner of your screen.
• The red button with a white arrow allows you to expand and contract the webinar menu, in which you can write questions/comments.
• We won’t have time to answer questions while we are presenting, but will answer them at the end
• You will be on mute throughout – we can’t hear you.
Storing data in databases
Webinar
25 October 2016
Peter SmythUK Data Service
Can you hear us?
Can you hear us?
• If Not:
• Check your volume, and that your speaker/headset is
plugged in.
• Your invitation also included a phone number, you can
call that to listen in.
o UK +44 (0) 330 221 9914
o US +1 (914) 614-3429
• We are recording this webinar, so you can always
listen to it later.
Overview of this webinar• Definition of a database• Why Excel isn’t always good enough• Different Database types and availability• Relational Databases
• A bit of history• Data organisation• Limitations• Query examples
• Document Databases• MongoDB• Query examples
• Graph Database demo
Definition of Database
“A structured set of data held in a computer, especially one that is accessible in various ways.”(Oxford University Press)
• Structured = Ordered? Or Arranged?• Nothing about the details of the structuring
• Accessible = Searchable, able to query the contents to see what is there
Not a database! - Why not?
What about Excel?
• Worksheets are tabular in nature - very structured
• You can join sheets together using the VLOOKUP
function
• There is a set of Database type functions (DSUM,
DCOUNT etc.)
• You can write queries to filter the rows
Excel Restrictions
• Sheets have limit of 1 million rows (220)
• VLOOKUP can only return a single column
• The database functions can only return a single value
• Setting up queries is quite complex
Why use a desktop database?
• Size of data
• Convenience of a desktop system
• Flexibility in collecting and persisting data
• Flexibility in querying and analysis
Growing and shrinking data
Tweets
Smart meter data
Sent Tweet
All Smart meter data
All tweets from user
All tweets from User & Friends
Data from Tweet
Smart meter by day
Smart meter by Month
By Month and Geography
1Kb 1Mb 1Gb 10+ Gb
Desktop Application Big Data Environment
Growing and shrinking data Tweets
Smart meter data
Sent Tweet
All Smart meter data
All tweets from user
All tweets from User & Friends
Data from Tweet
Smart meter by day
Smart meter by Month
By Month and Geography
1Kb 1Gb 25 Gb
Desktop Application
Big Data Environment
5GB 25+ GB
Desktop Database
Types of Databases
There are many different types of DatabasesFor the end user there are probably four main types.
• Relational Databases • (MySQL, MS SQL, SQLite, Postgres …)
• Document databases• MongoDB, CouchDB, …)
• Graph databases• (Neo4j, Titan, …)
• Wide column stores• (Cassandra, Hbase,,…)
Types of Databases
• Relational Databases predominate – by a long way• Data held in tables with defined relationships between the tables
• Document databases and wide column databases use storage architectures designed to overcome some of the scalability problems of relational databases. Since Big Data sources have become available, these are gaining in popularity
• Graph Databases are designed to optimise specific type of querying of data – where you are more interested in the relationship between different items that the actual attributes of the items, often used with Social networks
Types of Databases
• http://db-engines.com/en/ranking
• The link below provides a table of the different Databases
systems available and their relative use. Both Commercial
and Free databases systems are included.
Types of Databases (Table)Freely available options
The Relational Model
• Why do we have it?
• What is it good for?
• What are the pros and cons?
• What do we mean by relational?
The Relational Model - History
• The term "relational database" was first used by E. F.
Codd in 1970 in the paper "A Relational Model of Data
for Large Shared Data Banks”
• Although not necessarily the primary driver, it should be
noted that at the time computer storage was very
expensive
• The Relational model can be very efficient when storing data.
Typically data items are stored only once
The Relational Model - History
Storage prices fell from about $193K per Gb in 1980 to about $0.03 in 2014
http://www.mkomo.com/cost-per-gigabyte-update
The Relational Model – How it works• If I wanted to record the details of a house and the people
who lived there, I could create a table like this:
• I would need a single record for each person at that address
HouseHold_AllHouseHold_IdAddressPostCodePerson_idFirstNameLastNameDOBSexAgeNo_of _RoomsNo_of_OccupantsTypeConstruction
The Relational Model – How it works
And populate it with data, like this
These records all relate to the same household, but the data about the house itself is repeated for each person in the house
HouseHold_Id Address PostCode Person_id FirstName LastName DOB Sex AgeNo_of _Rooms
No_of_Occupants Type Construction
1Some street, Some Town AA1 2BB 1Alfie Smith 17/09/1963 M 60 8 5Semi Brick
1Some street, Some Town AA1 2BB 2Jane Smith 05/02/1970 F 60 8 5Semi Brick
1Some street, Some Town AA1 2BB 3John Smith 03/01/2001 M 60 8 5Semi Brick
1Some street, Some Town AA1 2BB 4Jack Smith 10/10/2005 M 60 8 5Semi Brick
1Some street, Some Town AA1 2BB 5Jenny Smith 07/05/2009 F 60 8 5Semi Brick
The Relational Model – How it works
• It makes more sense to use multiple tables and split the data
between them
• This eliminates the need to duplicate data
• The arrows represent relationships between the tables.
• If I only wanted details about the a person, I wouldn’t need to
refer to the other tables
The Relational Model – How it works
• All of the Occupant information is kept in a single table.
• Details of the Property are only recorded once in the three
smaller tables
The Relational Model - Advantages
• Data is only stored once (across multiple tables if
necessary)
• Efficient for well known and structured data
• Well defined and understood query language (SQL)
• variants available for all relational databases
• Schema on Write allows comprehensive data checking
before loading – making for cleaner data
The Relational Model - Disadvantages
• The need for multiple tables increases loading times
• Uses vertical scaling
• Not really relevant for desktop databases
• Schema on write cannot deal with unstructured data
efficiently, if at all
Document Databases
• Why do we have it?
• What is it good for?
• What are the pros and cons?
• What is meant by a document?
Document Database
• A ‘document’ does not mean a pdf or word document
• A document is semi-structured data
• It is ‘structured’ in that every data item in the document
has name associated with it
• It is ‘semi-’ in that different documents in the same
collection of documents don’t have to have the same set
of names
JSON Example – semi-structured data
• The most popular format for Semi-structured data is
JSON.
• Most data that can be downloaded from a Web based
API will be in JSON format (or at least offer JSON as a
choice of format)
JSON Example – semi-structured data
The following is a simple example of JSON formatted data
{ ‘Name’ : ‘Manchester’,‘PostCode’ : ‘M13 9PL’,‘Established’ : 1824 }
It is split over several lines just to aid reading. Everything between the ‘{’ and ‘}’ represents a single record, or document
Document Databases
• The semi-structured nature means that it is difficult to store the data in tables• Not all fields need to be in each document• Fields don’t need to be in the same order
{ 'id' : 1234, 'Name' : 'Peter', 'Tel' : 012345678 }{ 'Name' : 'John', 'id' : 3523, 'Email' : ['[email protected]', '[email protected]'] ,'Mob' : 012345678}
• Even more difficult to create a schema for the data in advance
• Instead, data is stored ‘as-is’ and a schema is ‘created’ when the data is read – Schema on read
Document Databases - NoSQL
• Non-Relational databases like MongoDB typically do not use
SQL to query the data.
• When you install MongoDB you are provided with a Simple
Shell interface from which you can query the database.
• Use of the Shell to query requires a knowledge of Javascript.
• As an alternative, both Python and R have packages which
interface to MongoDB to allow querying of the database using
native Python or R like constructs
• The unstructured nature of the data, adds to the complexity of
querying
A Graphics Database – Neo4j
• The default installation of Neo4j provides a simple
default ‘Movies’ database.
• It also comes with tutorials to help get you started
Summary
• The size of your data may be enough to make you
decide on using a desktop database
• But it may not be the only consideration
o How are you collecting the data over time?
o What is the structure of the data?
o How do you intend to use the data
o Can you clean and structure the data as you collect it?
o Do you need to keep all of the raw data just in case?
Questions
Peter Smyth
ukdataservice.ac.uk/help/
Subscribe to the UK Data Service news list at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKDATASERVICE
Follow us on Twitter https://twitter.com/UKDataServiceor Facebook https://www.facebook.com/UKDataService