23
Architectural Anti Patterns Notes on Data Distribution and Handling Failures Gleicon Moraes http://zenmachine.wordpress.com http://github.com/gleicon @gleicon

Architectural anti-patterns for data handling

Embed Size (px)

DESCRIPTION

Now with three more anti patterns and a new required listening. This is the Discipline release, all hail to King Crimson and Fripp's care with details.

Citation preview

Page 1: Architectural anti-patterns for data handling

Architectural Anti PatternsNotes on Data Distribution and Handling Failures

Gleicon Moraes

http://zenmachine.wordpress.comhttp://github.com/gleicon

@gleicon

Page 2: Architectural anti-patterns for data handling

Required Listening: King Crimson - Discipline

Page 3: Architectural anti-patterns for data handling

Anti Patterns

Evolution from SQL Anti Patterns (NoSQL:br May 2010)More than just RDBMSLarge volumes of dataDistributionArchitectureResearch on other toolsMessage Queues, DHT, Job Schedulers, NoSQLIndexing, Map/ReduceNew revision since QConSP 2010: included Hierarchical Sharding, Embedded lists and Distributed Global Locking

Page 4: Architectural anti-patterns for data handling

RDBMS Anti Patterns Not all things fit on a relational database, single ou distributed

The eternal table-as-a-tree Dynamic table creationTable as cache Table as queue Table as log fileStoned ProceduresRow AlignmentExtreme JOINsYour scheme must be printed in an A3 sheet.Your ORM issue full queries for Dataset iterations Hierarchical Sharding Embedded listsDistributed global locking

Page 5: Architectural anti-patterns for data handling

Doing it wrong, Junior !

Page 6: Architectural anti-patterns for data handling

The eternal treeProblem: Most threaded discussion example uses something like a table which contains all threads and answers, relating to each other by an id. Usually the developer will come up with his own binary-tree version to manage this mess.

id - parent_id -author - text1 - 0 - gleicon - hello world2 - 1 - elvis - shout !

Alternative: Document storage:{ thread_id:1, title: 'the meeting', author: 'gleicon', replies:[ { 'author': elvis, text:'shout', replies:[{...}] } ]}

Page 7: Architectural anti-patterns for data handling

Dynamic table creationProblem: To avoid huge tables, one must come with a "dynamic schema". For example, lets think about a document management company, which is adding new facilities over the country. For each storage facility, a new table is created:

item_id - row - column - stuff1 - 10 - 20 - cat food2 - 12 - 32 - trout

Now you have to come up with "dynamic queries", which will probably query a "central storage" table and issue a huge join to check if you have enough cat food over the country.

Alternatives: - Document storage, modeling a facility as a document- Key/Value, modeling each facility as a SET

Page 8: Architectural anti-patterns for data handling

Table as cacheProblem: Complex queries demand that a result be stored in a separated table, so it can be queried quickly. Worst than views

Alternatives: - Really ?

- Memcached

- Redis + AOF + EXPIRE

- De-normalization

Page 9: Architectural anti-patterns for data handling

Table as queueProblem: A table which holds messages to be completed. Worse, they must be ordered bytime of creation.

Corolary: Job Scheduler table

Alternatives: - RestMQ, Resque

- Any other message broker

- Redis (LISTS - LPUSH + RPOP)

- Use the right tool

Page 10: Architectural anti-patterns for data handling

Table as log fileProblem: A table in which data gets written as a log file. From time to time it needs to be purged. Truncating this table once a day usually is the first task assigned to new DBAs.

Alternative:

- MongoDB capped collection

- Redis, and RRD pattern

- RIAK

Page 11: Architectural anti-patterns for data handling

Stoned proceduresProblem: Stored procedures hold most of your applications logic. Also, some triggers are used to - well - trigger important data events.

SP and triggers has the magic property of vanishing of our memories and being impossible to keep versioned.

Alternative: - Now be careful so you dont use map/reduce as modern stoned procedures. Unfit for real time search/processing

- Use your preferred language for business stuff, and let event handling to pub/sub or message queues.

Page 12: Architectural anti-patterns for data handling

Row AlignmentProblem: Extra rows are created but not used, just in case. Usually they are named as a1, a2, a3, a4 and called padding.

There's good will behind that, specially when version 1 of the software needed an extra column in a 150M lines database and it took 2 days to run an ALTER TABLE. But that's no excuse.

Alternative:

- Quit being cheap. Quit feeling 'hacker' about padding

- Document based databases as MongoDB and CouchDB, has no schema. New atributes are local to the document and can be added easily.

Page 13: Architectural anti-patterns for data handling

Extreme JOINsProblem: Business stuff modeled as tables. Table inheritance (Product -> SubProduct_A). To find the complete data for a user plan, one must issue gigantic queries with lots of JOINs.

Alternative:

- Document storage, as MongoDB might help having important information together.

- De-normalization

- Serialized objects

Page 14: Architectural anti-patterns for data handling

Your scheme fits in an A3 sheetProblem: Huge data schemes are difficult to manage. Extreme specialization creates tables which converges to key/value model. The normal form get priority over common sense.

Product_A Product_Bid - desc id - desc

Alternatives: - De-normalization- Another scheme ? - Document store for flattening model- Key/Value- See 'Extreme JOINs'

Page 15: Architectural anti-patterns for data handling

Your ORM ...Problem: Your ORM issue full queries for dataset iterations, your ORM maps and creates tables which mimics your classes, even the inheritance, and the performance is bad because the queries are huge, etc, etc

Alternative:

- Apart from denormalization and good old common sense, ORMs are trying to bridge two things with distinct impedance.

- There is nothing to relational models which maps cleanly to classes and objects. Not even the basic unit which is the domain(set) of each column. Black Magic ?

Page 16: Architectural anti-patterns for data handling

Hierarchical ShardingProblem: Genius at work. Distinct databases inside a RDBMS ranging from A to Z, each database has tables for users starting with the proper letter. Each table has user data. Fictional example: e-mail accounts management

> show databases;a b c d e f g h i j k l m n o p q r s t u w x z > use a> show tables;...alberto alice alma ... (and a lot more)

There is no way to query anything in common for all users with out application side processing. In this particular case this sharding was uncalled for as relational databases have all tools to deal with this particular case of 'different clients and data'

Page 17: Architectural anti-patterns for data handling

Embedded ListsProblem: As data complexity grows, one thinks that it's proper for the application to handle different data structures embedded in a single cell or row. The popular 'Let's use commas to separe it'. Another name could be Stored CSV

> select group_field from that_email_sharded_database.user"[email protected], [email protected],[email protected]"

This is a bit complementar of 'Your scheme fits in a A3 sheet'. This kind of optimization, while harmless at first sight, spreads without control. Querying such fields yields unwanted results.

You might only use languages where string splitting is not expensive or remember to compile RegExps before querying a large data volume. Either learn to model your data, or resort to K/V stores as Redis.

Page 18: Architectural anti-patterns for data handling

Distributed Global LockingProblem: Someone learns java and synchronize. A bit later genius thinks that a distributed synchronize would be awesome. The proper place to do that would be the database of course. Start with a reference counter in a table and end up with this:

> select COALESCE(GET_LOCK('my_lock',0 ),0 )

Plain and simple, you might find it embedded in a magic class called DistributedSynchronize or ClusterSemaphore. Locks, transactions and reference counters (which may act as soft locks) doesn't belongs to the database. While its they use is questionable even in code, the matter of fact is that you are doing it wrong, if you are doing like that.

Page 19: Architectural anti-patterns for data handling

No silver bullet

- Think about data handling and your system architecture

- Think outside the norm

- De-normalize

- Simplify

- Know stuff (Message queues, NoSQL, DHT)

Page 20: Architectural anti-patterns for data handling

Cycle of changes - Product A

1. There was the database model2. Then, the cache was needed. Performance was no good.3. Cache key: query, value: resultset4. High or inexistent expiration time [w00t]

(Now there's a turning point. Data didn't need to change often. Denormalization was a given with cache)

5. The cache needs to be warmed or the app wont work.6. Key/Value storage was a natural choice. No data on MySQL anymore.

Page 21: Architectural anti-patterns for data handling

Cycle of changes - Product B

1. Postgres DB storing crawler results.2. There was a counter in each row, and updating this counter

caused contention errors.3. Memcache for reads. Performance is better.4. First MongoDB test, no more deadlocks from counter

update.5. Data model was simplified, the entire crawled doc was

stored.

Page 22: Architectural anti-patterns for data handling

Stuff to think about

Think if the data you use aren't de-normalized somewhere (cached)

Most of the anti-patterns signals that there are architectural issues instead of only database issues.

The NoSQL route (or at least a partial NoSQL route) may simplify it.

Are you dependent on cache ? Does your application fails when there is no cache ? Does it just slows down ?

Think about the way to put and to get back your data from the database (be it SQL or NoSQL).

Page 23: Architectural anti-patterns for data handling

Thanks

http://scienceblogs.com/evolgen/upload/2007/04/rube_back_scratch.gifhttp://en.wikipedia.org/wiki/Rube_Goldberg_machine