Data Modelling In Google App Engine For Java

Data Modelling in Google App Engine for JavaGuidelines and tips

Who is this presentation for?

• This presentation is aimed at data modellers and developers interested in our experiences with the Google App Engine for Java (GAEJ) and in particular data modelling.

• We’ve tried to introduce a bit of background for each of the topics, but we do make some assumptions:

– We assume familiarity with Java syntax– We also assume you’ve done some data modelling before and are familiar with

basic relational or object principles– If not, there are some great resources that you can find via Google that explain

the basics of data modelling.

• This presentation outlines some of the things we’ve learned about the datastore whilst developing apps on GAEJ.

– These aren’t immutable laws but merely some guidelines that may make things easier

– We encourage readers to test things out themselves.

• When we put code snippets we use JDO and annotations as that’s our preference

– Rather than JPA and XMLSlide 2

Google App Engine for Java: What and why?

• Google App Engine (GAE) is a cloud-computing platform that allows users to build and deliver applications over the internet

• Originally, it was released with support only for Python, but in 2008 extended to provide Java support with Google App Engine for Java (GAEJ)

• Google charge depending on usage (storage, bandwidth, CPU)

• It provides a cheap way to deploy scalable web applications in Java

• The development environment built in Eclipse and it is fast and easy to develop applications

• As well as Google’s proprietary datastore there are other built-in services (email, chat, queues, etc.)

Slide 3

Working with the datastore

• Google allows applications deployed on the GAE to persist data in a data storage repository (“the datastore”)

– It’s the only storage choice available when using the GAE

• It’s an Object database built on Google’s proprietary BigTable implementation

• To comply with existing standards and make things as painless as possible to Java developers Google App Engine for Java (GAEJ) exposes this datastore via JPA and JDO

– In addition there is a low-level API also provided– Also, there are some non-implemented parts of the JDO spec

• There are big differences in the design and implementation choices that can be made between the datastore and traditional relational databases (RDBs)

– These difference are often obscured by the fact that developers use JDO/JPA which are more commonly associated with ORMs

Slide 4

http://bit.ly/bKVtta

Reader beware!

• Some of these guidelines mean that your model will not follow strict data modelling design rules (either relational or object)

– We are both designers and developers and yes we do understand those rules ... – We also understand that (some) designers can feel passionately about those rules ... – However we have chosen to make things developer-friendly rather than designer-friendly. – When designers get called out on support calls at 1 a.m. we will revise these guidelines to

be more designer-friendly

• The tips in the following section (as are the others) are rules-of-thumb

• Sometimes they make extra work in other areas but simplify dealing with the datastore

• As always, design of real-world artefacts is a trade-off between time, cost, and materials

– We like GAEJ because it is cheap and quick to develop on so have made design choices for our apps that help us work around its differences from other platforms

• The tips are GAEJ-specific and we don’t necessarily recommend them for other data stores

– We are working on the Eternal and Universally Applicable Data Storage Guide ... watch this space

Slide 5

How this presentation is organised

• Each of the sections contains some guidelines and rules of thumb that we have found useful

• We use the term data modelling to cover object and relational modelling• The sections run from high level architecture decisions down through design

choices to development choices. Sections are:– Architecture: Comparison to RDBs and when to choose GAEJ– Design: Tips for defining your persistence model– Development: Understanding the persistence lifecycle, and miscellaneous tips

• Space is limited to we only include snippets of code that are relevant

• Where we have found them useful, we include links to other sources. We’d encourage you to read and understand them too.

• The documentation around the datastore is getting a lot better and that is always a good starting point

• The datanucleus JDO documentation is good and definitely worth reading.

Slide 6

http://bit.ly/9vZZCw

http://bit.ly/cOWAPf

Architecture: Relational compared to datastore

Relational databas e

• Relational databases are set-based

– Relations are intersections of sets

• Relational databases are “strongly-typed”

– The table defines the shape of the data (i.e. What they data must look like)

• Uses indexes and foreign keys to navigate

– Easier to navigate relationships

• Supports SQL as a means to access the underlying data

– Access to data independent of implementation

GAE datas tore

• The datastore holds objects rather than rows

– Relations via associations

• The datastore defines Entities but these are not strongly typed

– Different versions of an Entity can have different attributes. Up to the app to deal with this.

• Uses object identity to navigate relationships

– Hold object ids as associations

• No support for SQL or other independent access

– Data access is via applications or GAEJ management console

Slide 7

Architecture: When to choose GAEJ? GAEJ may be a good choice

...if

• Your application is read-biased– Object datastores can be (crudely) viewed in relational

terms as “denormalised”. – Typically denormalised structures are easier to read

than write (as you need to update many more objects / rows in a denormalised write)

• Your application is based around a single key entity– In order to participate in transactions an objects must

be in the same EntityGroup. This is more easily achieved if there is a single key object (e.g. A Survey in a survey application; A User in a directory ... Etc)

• Your application doesn’t have many relationships– Relationships can be modelled through associations

but it is less straightforward to navigate these relationships.

– GAEJ’s JDO and JPA implementations do not support the concept of joins

• Your application does not return large datasets– Google has recently removed the previous limit of 1000

rows in a result set, but GAEJ isn’t the natural choice for batch or ETL type applications as you will probably want something that can support checkpoints

...GAEJ is not a good choice if

• Your object model is large or complex– GAE is a good service, but larger more complex

models will fall foul of its limitations at design time (there is no easy way to use CASE tools to model the data)

• You are working with highly-normalised domain– E.g. Providing access to a master data store.

Normalised data is best represented in a relational database

• Your data profile is more write-based– Writes are more expensive in the datastore owing to

the nature of the underlying implementation– Also because the data is denormalised a write may

need to span several entities

• You are updating lots of data at once– This is an extension of the point above re. efficiency.

Also, the datastore does not support the concept of savepoints (although the behaviour can be replicated).

• You have transactions spanning multiple objects– Transactions in the datastore are limited to objects in

the same EntityGroup which can lead to complexity in design (more about this later)

– Having said that, think carefully about whether you really need Transactions. Working with transactions (esp. failed ones) is complex from a business rules perspective and (controversial statement coming) most apps can probably do without them

Slide 8

Design: A bit about persistence

Slide 9

Design tips: Selecting persistent entities

• Before you begin annotating your object model we suggest you divide classes in your model in to managed data and reference data

• Managed data– Things that your application looks after, often from birth to death. For example, in a survey application this

might be the survey, the participant, the questions, etc.– Sometimes the data can be imported (e.g. Getting a list of users from an LDAP and then adding the users’

qualifications to the data) but it is still managed

• Reference data– Data that supports your applications. For example, country codes, airport codes, currency codes, etc.

• If the data is managed then you need to mark it as persistence managed• If the data is reference data then you need to choose.

– If you import the data (e.g. our LDAP) example then it may be better to manage the lifecycle (or import and destruction) yourself. Otherwise you end up multi-mastering the data and having to reconcile the two lists.

– If it is reference data that your end-user controls (e.g. A set of product codes) then you might want to mark it as persistent, but make it standalone (i.e. It doesn’t appear in any relationships and is stored by value in the entity it enhances – i.e. denormalised in relational-speak)

– If it can’t be stored by value then it probably isn’t reference data.

• There will be a third, grey area of data– These are things like customer addresses, customer loyalty levels, blog post category, etc.

– These are usually data that belong or associated with other data (we cover this in the next section)– It pays to think how and when this data changes. Is it only when the key entity changes, or can they change

independently. – If they can change independently then maybe they should be persistence managed

Slide 10

Design: Dealing with relationships (1/3)

• Collections– Ordered collections are supported, but GAEJ needs to create a separate index representing the position in

the collection. – This means that manipulation of the collection can be potentially inefficient - especially if items are being

added in the middle of the collection as reindexing will need to occur. – A good use for an ordered collection would be a set of question options in a survey (as the options are

extremely stable once defined).

– A bad use for order would be a league table in an online game (better store in a different data structure that holds the position and then order in the client).

• Foreign key constraints – Don't really exist in the sense of referential integrity. There is the concept of a Key object which can be used

to associate a child with a parent (and so feels like an FKEY) but there is no checking by the datastore to see whether it exists.

• Owned (parent-child) relationships– The two entities belong in the same EntityGroup. – In order to perform an update within a transaction, the entities must belong to the same EntityGroup (as

being in the same entity group means that the data is physically co-located)

– Do this by setting the parent’s key in the child object (see Google's documentation)– However, there is a limitation with this in that GAEJ can only perform 1-10 writes per second to an entity

group. This means that if the object(s) involved will receive more writes per second then the call will block.

Slide 11

http://bit.ly/9XIkFB


• What kind of association is it – Composition or Aggregation?• Composition

– The relationship created and destroyed along with the owning entity (e.g. A survey may have a set of questions associated with it. If you delete the survey then the questions should be deleted too – assuming there isn’t the concept of a question bank),

– In this case think about modelling the object as an embedded object

– An embedded object has to be Serializable and is extracted/condensed with each storage. You don’t have to worry about navigating associations as the objects are stored by value

– Make an object embedded by annotating the property with @Embedded. This is equivalent to @Persistent(serialized=“true”) We prefer the latter as it is not only clearer but it allows you to specify other attributes (more later)

– If you have 1000s of objects then may serializing them isn’t such a great idea as it is much more of an overhead than looking up.

• Aggregation– A relationship exists between your entity and the other object but the associated object has some kind of

independent existence (e.g. Just because a football team folds it doesn’t mean that all the players have to be destroyed)

– How large is your associated object? Do the associated objects get updated in bulk or usually one at a time?

– It may be worth just storing the objectId of the associated object and then resolving the relationship yourself (using a PersistenceManager.getObjectById())

– If the objects can be updated independently of the key entity then you probably need to model them as independent persistent entities. You definitely won’t want to model them as embedded objects in this case.Slide 12


• Can the attributes of the association be denormalised?– What data do I really need when modelling the association?

• For an address, is the house number and postcode sufficient?

– What data actually changes in the association?• In a survey participant is it just the answers and the status?

– If so, perhaps these data can be stored in the entity the other side of the relationship– Updates become a bit more complex as you need to keep the denormalised attributes in step– If there is a lot of updating happening then denormalising isn’t the right choice

• What use cases cause the association to be updated?– This is a general way of asking the same questions as above– But until you understand when and where your data is created then you will not be able to model it right

(unless you get lucky)– We use sequence diagrams to help us

Slide 13

• The sequence diagram opposite helps us understand and shows a potential case when not to denormalise

– A user completes a survey that causes her status to be updated. We also need to keep track of the number of completions to understand survey uptake

– The example updates a denormalised count in the Survey– But do we really want to update the survey every time a participant

completes the survey? If there are 15 participants maybe? If there are 15,000 participants probably not

– Making it explicit through sequence diagrams can help– The diagrams can be done with paper and pencil. The important thing

is understanding the object lifecycle

Development : The persistence lifecycle (1/2)

• There are three main states (and lots of sub-states) in the persistence lifecycle– See the JDO spec for full details:

• Transient objects behave just like normal POJOs.– We don’t really talk about these

• Persistent objects have their behaviour linked to the underlying transactional datastore.

– This means that JDO tracks changes to the instances and refreshes the instance with the values in the datastore and also stores changes back to the datastore.

– They will be under the stewardship of a (single) PersistenceManager.

• Detached objects are similar to persistent and maintain a datastore identity. – However, not all fields will be loaded and any attempt to access an unloaded field is denied.

– Detached objects can have their loaded fields changed and these will be stored to the database (if reattached). Detached objects do not adhere to transactional boundaries.

• A web application will typically perform the following:– Get an object from the datastore based on a user’s request. For example, a clerk calling up a customer’s

record– Manipulate and change the object via the user interface. For example, a clerk adds a postcode to a

customer’s record– Store the newly updated object back in the datastore– In order to achieve this, the object must be detached from the PersistenceManager’s stewardship, so

that the web layer can make updates to itSlide 14

http://bit.ly/4vzcI2

Development : The persistence lifecycle (2/2)

• Detaching– You can manually detach

• Call PersistenceManager.detachCopy(object) to get a detach a copy of the object or detachCopyAll(collection) to detach persistent collections

• This can be a bit messy and error-prone (if you forget to detach)

– You can detach following a successful transaction (our preference)• By setting PersistenceManager.detachAllOnCommit(true) • This can also be set in the JDO configuration file but we prefer to make it explicit in the code

• Detaching @Embedded properties can be quirky– If you have an embedded collection then it is not detached even when setting

detachAllOnCommit (you get an error when trying to access the collection)– You can work around this by “touching” the object whilst in the transaction

• This effectively fetches the object so that when the detach is called, the collection is present

– A better way is to change the persistence declaration of the object to keep it as embedded but ensure that the collection is in the same fetch group as its object

• Do this through @Persistent(serialized=“true”, defaultFetchGroup=“true”)• If you have named fetch groups you will need to change this

Slide 15

Development tips: Working with queries

• Queries in GAEJ looks similar to queries in other ORMs, but there are differences• The entity you are querying on has to have an index• There are restrictions on some operators, e.g. <>• You can’t query on Blob and Text fields• Sometimes what appears to be a single query will in fact result in two or more queries• The differences are subtle and non-trivial so we recommend reading the

Google documentation• If you are going to perform a “join-query” – i.e. filtering on an attribute held in a

Collection (e.g. find all customers with postcode beginning “W1”) then it is probably worth denormalising (e.g. holding the postcode as an attribute of the Customer)

– You can always make it read-only to callers of the object attribute through class design

• We try to keep our use of queries as simple as possible and only use them to for list behaviours where we return a collection of objects of the same type

– We know that this isn’t always possible, but it is worth thinking of alternatives to other queries (e.g. splitting in to multiple queries, resolving in the middle layer) just to make life easier

– Again, this will depend on how many objects you have, how often the query will be run, how often the data changes

– These are all things that you need to consider before writing your data access code

Slide 16

http://bit.ly/djG57o

Development tips: Miscellaneous tips

• GAEJ provides access to a memcached cache• We try to use the cache wherever possible

– The cache built in to GAEJ is easy to use and follows the JSR-107 spec– There is also a low-level API although we have never had occasion to use it– The cache is significantly faster than using the datastore– The Google documentation tells you all you need to know about the API

• More important you must understand your objects’ lifecycle before caching– When will the object in the cache need to be refreshed?– When should objects be evicted from the cache?– This exercise is no different in GAEJ than designing caching in other applications

• Rule of thumb #1: We “always” cache reference data– There will no doubt be an occasion when we don’t cache some reference data

• Rule of thumb #2: We cache objects that are frequently accessed– For example, an online survey that is currently in progress

Slide 17

http://bit.ly/9QT0DL

We’d like to hear from you

• We have several blog post that delve in to more detail on the GAEJ. Feel free to comment on them and share your experiences– True North blog

• Did you know that Google have a an AppEngine blog oriented to developers?

• We’re always happy to learn from your experiences with the datastore or with GAEJ:– Contact us via Twitter– Or email us

Slide 18

http://bit.ly/9R8AXm

http://bit.ly/btRj35

http://twitter.com/truenorth_buzz

mailto:[email protected]

Technology

Data Modelling In Google App Engine For Java