ObjectiveView - Nagafixusers.nagafix.co.uk/~mark/ovmag/OV12.pdfIan Robinson on Neo4j's Graph Database Also: Agile India - A Conference Report ovmag.com Page 2 of 52 ` ObjectiveView

ObjectiveView ovmag.com

for professional software developers and managers

Victoria Horkan – The Landing Limited Edition Print Available

IN THIS ISSUE - 52 PAGES:

Plus:

Interview with Grady Booch

Scott Ambler surveys TDD in the field David Heinemeier-Hansson: TDD is dead!

Clean Coding with Uncle Bob

NoSQL:

Pramod Sadalage overviews NoSQL DBs

Jan Lehnardt: CouchDB - The Definitive Intro Ian Robinson on Neo4j's Graph Database

Also: Agile India - A Conference Report

ovmag.com

Page 2 of 52

`

ObjectiveView

`

sdfdsdf sdfds

See ObjectiveView on

www.infoq.com/objective-view

Join ObjectiveView at QCon SF 2014

qconsf.com

CONTACTS

Editor

Mark Collins-Cope

[email protected] http://markcollinscope.info

Editorial Advisors

Scott Ambler

[email protected] http://ScottAmbler.com

Kevlin Henney

[email protected] http://curbralan.com/

Subscribe for notification: see www.ovmag.com

CONTENTS Interview with Grady Booch 3 TDD is Dead 10 No TDD is not dead 11 The Life and Times of TDD 12 Let Me Graph That For You 14 Agile India 2014 - A Review 24 NoSQL - A Complete Introduction 26 CouchDB: The Definitive Overview 34 Clean Coding with Uncle Bob. 49

Featured Artist

Victoria Horkan’s work offers a bold, vibrant and

expressive milieu of forms and colours that falls

somewhere between the realms of impressionism,

abstraction and expressionism. Her central focus is on colour, gesture and mark

making. Her strong, confidently placed marks are the

sign of an assured and mature artist and the manner in

which they are applied creates a sense of movement,

giving the work an energetic, flickering quality. The

Landing (cover image) is a great example of this. She has already exhibited in London, Leeds, Belfast

and Edinburgh and with clients in America, Italy, Dubai

and Abu Dhabi and we believe Victoria Horkan is

ovmag.com

Page 3 of 52

Interview with Grady Booch Grady Booch: Creator of the Unified Modelling Lang uage (UML), chief scientist of the former Rational Software Corp., founding member of the Agile Alliance and the Hillside Group, and chief scientist – software engi neering at IBM Research, talks to Mark Collins-Cope about UML, Agile, XP and a bunch of other things… Mark : Hi Grady, thanks for agreeing to do this interview. Grady : Mark, my pleasure. Thank you for the opportunity.

UML and UP Mark: In the mid ‘90s modeling software using notations was big news, and we were all arguing about which notation to use (Booch, Jacobson, OMT, Coad, Firesmith, Selic, HOOD, RDD, Jackson, there were easily two dozen or more!). Then the ‘three amigos’ - as yourself, Rumbaugh and Jacobson affectionately became known - got together and created the UML - the Unified Modelling Language (see also: http://students.cs.byu.edu/~pbiggs/survey.html). This was good for everyone as it meant we could all communicate using a single notation. Then UP came out, iterative and incremental (mini-waterfalls) became the order of the day and then XP hit the scene and the beginnings of “full agile” began to appear. Modelling using UML then seemed to lose the focus it had had previously. Modelling in general - to some degree - but mostly ‘big design up front’ (BDUF) in particular seemed - on some discussion groups at least - to become the work of the devil. Why do you think that was? Grady: The mid 90’s were a vibrant time...but there was a deeper reason for this, beyond the methodology wars. You must remember that that in the 80’s, the programming language landscape was fundamentally different than today. The industry was making the transition from algorithmically-oriented languages such as Fortran, Cobol, and C to object-oriented ones, such as Smalltalk, Ada, and C++. The problem therein was two-fold. First, we were building systems of exponentially greater complexity than before, and second, we didn’t have the proper methodological abstractions to

attend to these new ways of programming (namely, objects). As such, there was rabid innovation in the art and practice of software engineering as we moved from structured methods to object-oriented ones. Today, objects are part of the atmosphere, and so we don’t think about it, but back then, the very ideas were controversial. So, we need to separate methodology from process, for the two are not the same. On the one hand, there was a general recognition that we needed better ways to reason about our systems and that led to this era of visual modeling languages. On the other hand, it was clear that traditional waterfall methods of the 60s and 70s were simply not right for contemporary software development, which led us to the predecessors of agile methods. Waterfall (from Wyn Royce, although even Wyn recognized the need for incrementality) begat the spiral model (from Boehm) which begat incremental and iterative approaches, which were always a part of the OOAD processes we at Rational developed. We chose to separate the UML as notation from RUP as process. The former we standardized through the OMG, the latter we made mostly open source. Mark: It was an unusual period in time. Transitional. Grady: All times are transitional! Mark: :) Mark: Do you think UP was a transitional approach - in the sense it broke the path for later methods by carving out the iterative & incremental nature of software - but allowing the mini-waterfalls for perhaps ‘comfort’ reasons - rather than fully technical ones. Grady: I concur that the UP was transitional, but the notion of incremental and iterative as comforting was not the reason. I am heavily influenced by Herbert Simon’s work, especially as described in The Sciences of the Artificial in which he observes that all complex systems grow through a series of stable intermediate states (John Gall in Systemantics says very

ovmag.com

Page 4 of 52

much the same things). So, the notion of these stable points is quite legitimate. Even agile methods have them...these are manifest in each build. Process-wise, this is what mini-waterfalls provide. Also, Parnas’s A Rational Design Process, Why and How To Fake It applies here as well. Mark: Coming back to UML - there was certainly a degree it ceased to be flavour of the day. To what degree do you think change of emphasis was justified or not justified? Was it perhaps that because UML was new, it became too much of a focus? Grady: Sic transit gloria mundi: there is a time and place for all things, and when those things prove intrinsically valuable, they become part of the atmosphere and they morph to meet the needs of the present time. So it is with the UML. The notation retains modest use, but the underlying concepts live on. The penetration of the UML probably never exceeded 10-20% of the industry, although some domains - such as in the hard real time world - found use much higher. So, honestly, I’m pleased with what the UML achieved to the degree that it did, because it helped transform objects from something strange into something in the interstitial spaces of the software world. That being said, I think that the UML eventually suffered from the standard growing to be overly complex. The MDD movement turned the UML into more of a programming language. While I celebrate organizations who were quite successful in that use - such as Siemens, who has used the UML deeply in its telecom products - our intended use case for the UML was more modest: to be a language for reasoning about systems. In practice, one should throw away most UML diagrams; in practice, the architecture of even the most complex software-intensive system can be codified in a few dozen UML diagrams, thus capturing the essential design decisions. Much more makes the UML a programming language, for which I certainly never intended it. So, to continue, in many ways XP and its successors was an outgrowth of the changing

nature of software development due to the Web. We began to see a dichotomy arise: lots of simple code being written on the edge of the Internet, and smaller volume/greater complexity below the surface. XP flourished out of the dynamics of building these things at the edge, where experimentation was key, there was no legacy of any material amount, and for which we had domains wherein there was no obvious dominant design, and thus required rapid build and scrap and rework...all good things! Mark: We’ll come back to XP and Agile a little later, but sticking for the moment to UML - was there anything in the notation - as per release

1.0 - that with hindsight you regretted or thought could have been done in a better way? Grady: Two general things come to mind. First, we never got the notation

for collaborations right. I was trying to find the Right Way to describe patterns, and collaborations were the attempt. Second, component and deployment diagrams needed more maturing. Kruchten’s 4+1 view model was, in my opinion, one of the great ideas of software engineering, and being a systems engineer, not just a software engineer, I designed the UML to reflect those views. However, my systems-orientation was not well accepted by others. Oh, there’s a third one - typically programmer error on my part, off by one! The UML metamodel became grossly bloated, because of the drive to model driven development. I think that complexity was an unnecessary mistake. Mark: How would you like to have seen collaboration diagrams? Grady : I think we needed something akin to what National Instruments did in LabView for subsystems, but with a bit of special sauce to express cross-cutting concerns. I had hoped the aspect-oriented programming community could have contributed to advances here, but they seemed to have gotten lost in the weeds and forgot the forest.

“I believe in moderation in all things, even in the edict of writing all tests first. I also believe in the moderation of moderation”

ovmag.com

Page 5 of 52

Mark: What would you consider to be the most important core techniques of UML, and why? And what modelling techniques have you seen most used in your experience with the wider industry? Grady: Two things. First, the very notion of objects and classes as meaningful abstractions were a core concept that the presence of the UML helped in transforming the way people approached design. Second, the presence of the UML, I think, helped lubricate the demand pull for design patterns. I was always a great fan of the design patterns Gang of Four, and I hope that the UML and the work I did in this space contributed in some small manner to making their work more visible. Mark: So in a sense UML was key to introducing the object-oriented mindset into industry. Perhaps without it we wouldn’t have the mainstream adoption of OO as we do today? Grady : The UML - and all that surrounded it - was simply a part of the journey. Mark: What notations would you recommend be used on agile projects today Grady: Oh, I rather still like the UML :-) Seriously, you need about 20% of the UML to do 80% of the kind of design you might want to do in a project - agile or not - but keep in mind that this is in light of my recommendation for using the UML with a very light touch: use the notation to reason about a system, to communicate your intent to others...and then throw away most of your diagrams. Mark: Perhaps approaching modeling using something like Scott Ambler’s “Agile Modelling?” Grady: Scott’s work is good, as is of course Martin Fowlers. I’d also add Ruth Malan’s writings to the mix. Mark: UML was eventually handed over to the Object Management Group (OMG) who later released version 2.0. Was this an improvement over version 1.0, Grady: I do celebrate the stewardship the OMG gave to the UML. In an era when open source was just emerging, handing over the UML standard to another body, to put it into the wild,

was absolutely the right thing. Having a proprietary language serves no one well, and by making the UML a part of the open community, it had the opportunity to flourish. Mark: What, in your opinion, does modelling give you that simply sitting down and writing code doesn’t? Grady: As I often say, the code is the truth, but it is not the whole truth. There are design decisions and design patterns that transcend individual lines of code, and at those points, the UML adds values. The UML was never intended to replace textual languages..it was meant to complement them. Consider the example diagrams above, coming from the one million plus SLOC code base of Watson. You could find all these things in the code, but having them in a diagram offers a fresh and simple expression of cross-cutting concerns and essential design decisions. Mark: So models in UML can assist in conveying a higher level of thinking about the intent in the code? Grady: Absolutely...and if a visualization such as the UML doesn’t, then we have failed. As I have often said, the history of software engineering is one of rising levels of abstraction (and the UML was a step in that direction). The Unified Process (UP) Mark: Was UP was the first major software development process that embraced iterative and incremental development, over the older style ‘waterfall’ model. Grady: Actually, if you read Wyn Royce’s waterfall paper, or Parnas’ classic paper A Rational Design Process: How and Why To Fake It, you’ll realize that the seeds for iterative and incremental processes were already there. Additionally, Boehm’s Spiral Model (and Simon’s intermediate stable states) were all in the atmosphere. We just brought them together in the UP. Mark: What were the motivations behind that at the time? Grady : The UP reflected our experience at Rational Software - and the experience of our customers - who were building ultra-large systems. We were simply documenting best

ovmag.com

Page 6 of 52

practices that worked, and that had sound theoretical foundations. Mark: To clarify for readers: what is the difference between being iterative and incremental? Grady: Consider washing an elephant. Iterative means you lather, rinse, then repeat; incremental means you don’t do the whole elephant at once, but rather you attack it one part at a time. All good projects observe a regular iterative heartbeat. What bit you choose is a matter of a) attacking risk, b) reducing unknowns, and c) delivering something executable. Honestly, everything else is just details. The dominant methodological problems that follow are generally not technical in nature, but rather social, and part of the organizational architecture and dynamics. Mark: UP was pretty large - especially if you looked to follow it with any degree of rigour. In retrospect is that something that you regret? Or do you think perhaps people got the wrong end of the stick about how UP should be used in practice? Grady: Here’s how I would say it (and still do). The fundamentals of good software engineering can be summarized in one sentence: grow the significant design decisions of your system through the incremental and iterative release of testable executables. Honestly, everything beyond that is details or elaboration. Note that there are really three parts here: the most important artifact is executable code; you do it incrementally and iteratively with these stable intermediate forms; you grow the system’s architecture. The UP in its exquisite detail had a role...remember that this was a transitional time in which objects were novel. Mark: All times are transitional :). Grady: Even this interview! :-) Mark: Touché :) Mark: The four major phases of UP are inception, elaboration, construction and transition. To what degree do you think these phases are relevant in today’s ‘agile’ world?

Grady: No matter what you name something, these are indeed phases that exist in the cycles of every software-intensive system. One must have a spark of an idea, one must built it, one must grow it, and then eventually you must release it into the wild. Mark: Which aspects of UP that you think agile projects could benefit from in particular? Grady : Two things come to mind. First, it’s a reminder of the one-sentence methodology I explained earlier - there is a simplicity that underlies this all; second, it’s a reminder of the importance of views and design patterns in the making of any complex system. By views, I mean the concept that one cannot fully understand a system from just one point of view; rather, each set of stakeholders has a particular set of concerns that must be considered. For example, the view of a data analysis is quite different from the view of a network engineer...and yet, in many complex systems, each has valid concerns that must be reconciled. Indeed, every engineering process is an activity of reconciling the forces on a system, and by examining these forces from different points of view, it is possible to build a system with a reasonable separation of concerns among the needs of these stakeholder groups. For more detail, go look at Philip Kruchen’s classic paper “The 4+1 View of Architecture”.

XP Mark: XP was the first ‘agile’ approach to software development to gain a really big following. Why do you think that was? Grady: XP was the right method at the right time led by charismatic - and very effective - developers. This is as it should be: as I often say, if something works, it is useful. XP worked, and was useful. Mark: What do you think of the practises of XP? Grady: I think that the dogma of pair programming was overrated. TDD was - and is - still key. The direct involvement of a customer is a great idea in principle but often impractical. Doing the simplest thing possible is absolutely correct, but needs to be tempered with the reality of balancing risk. Finally, the notion of

ovmag.com

Page 7 of 52

continuous development is absolutely the right thing. Mark: On the subject of TDD - do you think comprehensive unit test is good thing - and is it necessary to write all the tests before the code? Or to put it another way, are you sometimes tempted to write the tests afterwards? Grady: I believe in moderation in all things, even in the edict of writing ALL tests first. I also believe in the moderation of moderation :-) Mark: Before XP refactoring existing code during later iterations seemed to be completely ignored as a major activity. Or was it? Do you think XP has made a major contribution here? Grady: XP gave a name and a legitimacy to the notion of refactoring. In that regard, XP has made a major contribution. Still, one must use refactoring in moderation. At the extreme, refactoring can become a major contributor to scrap and rework, especially if you choose a process that encourages considerable technical debt. Mark: Do you think there is a balance to be struck between up front design and refactoring? Grady: Well, of course, and it all goes back to risk. Remember also that in many domains, the key developers already intrinsically know the major design decisions, and so can proceed apace. It is when those decisions are not known, when there is high risk, that you must rework the balance. It’s ok to refactor a bit of Javascript; it is not ok to refactor a large subsystem on which human lives depend. Mark: Continuing that theme, are there some design decisions that are more important than others? Decisions that need to be tied down early in the project lifecycle? If so, why? Grady: Again, it all goes to risk. What are those design decisions that, if left unattended to, will introduce risk of failure or risk of cost of change? This is why I suggest that decisions should be attended to as a matter of reducing risk and uncertainty. In all other cases, where the risk and cost are low, then you process with the simplest thing possible. Often, you must also remember, you may not even know the questions you need to ask about a system until you have built something.

For example, suppose I’m building a limited memory, embedded system. I might do the simplest thing first, but if that simple thing ignores the reality of constrained memory resources, I may be screwed in operation. Similarly, suppose I do the simplest thing first, just to get functionality right, but then realize I must move to Internet scale interactions. If I don’t attend to that sooner rather than later, then I am equally screwed. Mark: In what way do you think some techniques of UML might help with the last couple of points? Grady: The UML should be used to reason about alternatives. Put up some diagrams. Throw some use cases against it. Throw away those diagrams then write some code against you best decision. Repeat (and refactor) Mark: Do you have a preference for iteration size - and do you think it is necessary to differentiate between different types of releases when talking about iterations? Grady: It depends entirely on the domain, the risk profile,and the development culture. Remember, I have been graced with the opportunity to work on software intensive systems of a staggerly broad spectrum, from common Web-centric gorp to hard real time embedded stuff. So, it really depends. In some cases, a daily release is right; in others, it’s every few weeks. I’d say the sweet spot is to have stable builds every day for every programmer, with external releases every week or two. Mark: I was surprised to see BPML as a separate notation to UML. Was this really necessary? Grady: I think it was a sad and foolish mistake to separate BPML from the UML. There really is no need for Yet Another Notation. I think the failure came about because we failed to find a common vocabulary with the business individuals who drove BPML. In many ways, there really is a cultural divide between business modeling and systems modeling...but there really shouldn’t be. Mark: Do you think UML would have been more popular if it had had a Japanese name :-)

ovmag.com

Page 8 of 52

Grady : I would have preferred Klingon: Hol ghantoH Unified. Or even a Borg designation (for resistance would then be futile).

Agile Mark: after XP came the ‘Agile Alliance’ - a very talented group of independents seemed to push the gains XP had made even further. Was this a good thing? Grady: Absolutely. I was a founding member of the Agile Alliance (and would have signed the Snowbird document, but I was working with a customer that week). Don’t forget also the Hillside Group, which at the same time was promoting the use of design patterns. Mark: One of the big things with the Agile Alliance was a move to make software development more of a collaborative thing - than a contractual thing - is that something you agree with - and in which circumstances? Grady: Development is team sport, so of course I support this. BTW, I think we must go even further: development as a social activity, with attendant issue of ethics, morality, and its impact on the human experience. This, by the way, is exactly what we are trying to explore in Computing: The Human Experience, a multi-part documentary we are developing for public television - see computingthehumanexperience.com Mark : How would you deal with a customer who was insistent they wanted to have the full cost of a system defined upfront? Grady: I would either bid an outrageously high cost or walk away. Most likely, I would walk away. I don’t like working with organizations who are clueless as to the realities of systems development, for I find that I spent most of my time educating them. Mark: Scrum seems to have come out as something of a winner in the agile approach stakes. It’s interesting that Scrum itself doesn’t really refer to any detailed software development steps - but seems to focus more in the product management and team working aspects of development . What do you think of Scrum?

Grady: As a former rugby player, I like scrums. Software development is a team sport, and scrums attend to the social dynamics of teams in a (generally) positive way. That being said, there’s a danger that teams get caught up in the emotional meaning of the word and don’t really do what it entails. There’s a lot of ceremony about what makes for a good scrum (and lots of consultants who will mentor a project), but the essence of the concept is quite simple. Wrapping it up in lots of clique-like terminology - scum master, sprints, and so on - makes it seem more complex than it is.

The Cloud and SaaS/PaaS Mark: Software as a Service is an emerging, or perhaps emerged model. Are you involved with that in any way? Grady: This is something I talked about over a decade ago, as I projected out the trajectory of software-intensive systems. We first saw systems built on bare hardware, then we saw the rise of operating systems, then the rise of the Internet as a platform...and now we are seeing the rise of domain specific platforms such as Amazon, AutoSAR (for in-car electronics), Salesforce, Fawkes (for robotic systems), and many others. In effect, this is a natural consequence of Utterback and Abernathy’s idea of dominant design: as software-intensive systems become economically interesting, there will arise a dominant platform around which ecosystems emerge. Mark: Using the SaaS model, the browser - in most cases - becomes the vehicle of the UI. We seem to have quite a lot of shifts in our run time environments over the years… Grady: It’s been interesting to see the shift of platforms: from systems built on bare metal, to those built on top of operating systems, to those on top of the Web, to those on top of domain-specific platforms. The browser was the gateway to systems on the Web, but even that is changing, as we move to mobile devices and now the Internet of Things. Apps are the gateway to mobile systems, and what the IoT will bring is up for grabs.

Software Development - Hype versus Reality?

ovmag.com

Page 9 of 52

Mark: I’ve heard people criticise software development as being more like a fashion industry than a serious engineering discipline. For example we’ve had: Structured Programming; 4GLs; SOAs; CBD; RAD; Agile; Aspect oriented analysis and design; etc. That doesn’t mean they don’t add value, but some seem to come and go quickly... Grady: If you looked inside other engineering domains, you’d see a similar history of ideas, so we are not unique. Indeed, I’d be disappointed if we had everything figured out, because it would mean that we are not pushing the limits of building real things.

Programming Languages Mark: Dynamic or scripting languages have gained immensely in popularity over the last ten to fifteen years - it seems they have grown from being ‘simple’ ways to add some functionality - client or server side - to HTML pages, and are now being used for ‘full blown’ applications. Are they fit for purpose in that context? Grady: This attends to what I alluded to earlier, the notion that a lot of new software is being written at the edges of the Web. In such circumstances, you really do need a language that lets you weave together loosely-coupled components in a rapid fashion. Scripting languages fit this need perfectly. Personally, I use PHP and Javascript the most in that world. Mark: Functional languages or functional programming seems also to be an area of growing interest. Do you think that offers any major benefits? Or is it perhaps just another ‘fad’? Grady: I had the opportunity to interview John Backus, just about a month before his death. We spoke of functional languages - he was responsible for a lot of what’s gone on in FP - and something he said has stuck with me: the reason that FP failed in his time was that it was easy to do hard things but almost impossible do easy things. I don’t think that circumstances have changed much since John’s time.

Drivers of Major Change and Open Source Mark: It seems to be that the real driver of major change to software development isn’t actually software at all, but is hardware/infrastructure driven: increased processor power, increased network speed, the advent of broadband, etc. Software development, on the other hand, has changed rather slowly in comparison - all the current major paradigms (OO, functional, procedural) have been around for over - what - forty years now. Would you agree, and if so is there an underlying reason for this, do you think? Grady : I disagree. The real driver of major change has been the reality that software-intensive systems have woven themselves into the interstitial spaces of our civilization, and ergo there is a tremendous demand pull for such systems. In many ways, hardware development is reaching a plateau - it is increasingly a commodity business. We will see more breakthroughs, but consider that all the major hardware paradigms have been around for 60 or more years. Mark: One final question - perhaps on a lighter note - do you have any predictions for how the world of technology - as it relates to software development - will appear in say 20 years time (we won’t hold you to them :). Grady: Yes (note that I answered your question precisely) :-) Mark: Grady Booch, thank you very much for your time on this interview - it’s been a pleasure talking to you. Follow him: @grady_booch See also computingthehumanexperience.com Grady is an IBM Fellow, an ACM Fellow, an IEEE Fellow, a World Technology Network Fellow, aSoftware Development Forum Visionary, and a recipient of Dr. Dobb's Excellence in Programming award plus three Jolt Awards. Grady was a founding board member of the Agile Alliance, the Hillside Group, and the Worldwide Institute of Software Architects, and now also serves on the advisory board of the International Association of Software Architects. He is also a member of the IEEE Softwareeditorial board. Additionally, Grady serves on the board of the Computer History Museum, where he helped establish work for the preservation of classic software and therein has conducted several oral histories for luminaries such as John Backus, Fred Brooks, and Linus Torvalds. He previously served on the board of the Iliff School of Theology.

ovmag.com

Page 10 of 52

Opinion: TDD is Dead – Long live Testing David Heinemeier Hansson - renowned creator of Ruby on Rails - explains why he has moved away from a fundamental approach to test first development. Test-first fundamentalism is like abstinence-only sex: An unrealistic, ineffective morality campaign for self-loathing and shaming. It didn't start out like that. When I first discovered TDD, it was like a courteous invitation to a better world of writing software. A mind hack to get you going with the practice of testing where no testing had happened before. It opened my eyes to the tranquility of a well-tested code base, and the bliss of confidence it grants those making changes to software. The test-first part was a wonderful set of training wheels that taught me how to think about testing at a deeper level, but I also left behind some of those thoughts behind fairly quickly. Over the years, the test-first rhetoric got louder and angrier, though. More mean-spirited. And at times I got sucked into that fundamentalist vortex, feeling bad about not following the true gospel. Then I'd try test-first for a few weeks, only to drop it again when it started hurting my designs. It was yoyo cycle of pride, when I was able to adhere to the literal letter of the teachings, and a crash of despair, when I wasn't. It felt like falling off the wagon. Something to keep quiet about. Certainly not something to admit in public. In public, I at best just alluded to not doing test-first all the time, and at worst continued to support the practice as "the right way". I regret that now. Maybe it was necessary to use test-first as the counterintuitive ram for breaking down the industry's sorry lack of automated, regression testing. Maybe it was a parable that just wasn't intended to be a literal description of the day-to-day workings of software writing. But whatever it started out as, it was soon since corrupted. Used as a hammer to beat down the non-believers, declare them unprofessional and unfit for writing software. A litmus test. Enough. No more. My name is David, and I do not write software test-first. I refuse to apologize for that any more, much less hide it. I'm grateful for what TDD did to open my eyes to automated regression testing, but I've long since moved on from the design dogma. I suggest you take a hard look at what that approach is doing to the integrity of your system design as well. If you're willing to honestly consider the possibility that it's not an unqualified good, it'll be like taking the red pill. You may not like what you see after that. So where do we go from here? Step one is admitting there's a problem. I think we've taken that now. Step two is to rebalance the testing spectrum from unit to system. The current fanatical TDD experience leads to a primary focus on the unit tests, because those are the tests capable of driving the code design (the original justification for test-first).

I don't think that's healthy. Test-first units leads to an overly complex web of intermediary objects and indirection in order to avoid doing anything that's "slow". Like hitting the database. Or file IO. Or going through the browser to test the whole system. It's given birth to some truly horrendous monstrosities of architecture. A dense jungle of service objects, command patterns, and worse. I rarely unit test in the traditional sense of the word, where all dependencies are mocked out, and thousands of tests can close in seconds. It just hasn't been a useful way of dealing with the testing of Rails applications. I test active record models directly, letting them hit the database, and through the use of fixtures. Then layered on top is currently a set of controller tests, but I'd much rather replace those with even higher level system tests through Capybara or similar. I think that's the direction we're heading. Less emphasis on unit tests, because we're no longer doing test-first as a design practice, and more emphasis on, yes, slow, system tests. (Which – by the way - do not need to be so slow any more, thanks to advances in parallelization and cloud runner infrastructure). Rails can help with this transition. Today we do nothing to encourage full system tests. There's no default answer in the stack. That's a mistake we're going to fix. But you don't have to wait until that's happening. Give Capybara a spin today, and you'll have a good idea of where we're heading tomorrow. But first of all take a deep breath. We're herding some sacred cows to the slaughter right now. That's painful and bloody. TDD has been so successful that it's interwoven in a lot of programmer identities. TDD is not just what they do, it's who they are. We have some serious deprogramming ahead of us as a community to get out from under that, and it's going to take some time. The worst thing we can do is just rush into another testing religion. I can just imagine the golden calf of "system tests only!" right now. Please don't go there. Yes, test-first is dead to me. But rather than dance on its grave, I'd rather honor its contributions than linger on the travesties. It marked an important phase in our history, yet it's time to move on. Long live testing. David Heinemeier Hansson is the creator of Ruby on Rails, founder & CTO at Basecamp(formerly 37signals), best-selling author, Le Mans class-winning racing driver, public speaker, hobbyist photographer, and family man.

Follow him: @dhh

ovmag.com

Page 11 of 52

Opinion: No It’s Not (dead) - Robert Martin respon ses to DHH’s post Robert Martin, TDD expert and founder of the Agile Alliance states his point of view. When an article begins like this... "Test-first fundamentalism is like abstinence-only sex ed: An unrealistic, ineffective morality campaign for self-loathing and shaming." ... you have to wonder if the rest of the article can recover its credibility, or whether it will continue as an unreasoned rant. Of course I understand what the author was trying to say. There is a stridence in the preaching of TDD that makes him uncomfortable. I have used that stridence myself; and I believe the stridence is called for. The reason is simple. As an industry, we suck. If you aren't doing TDD, or something as effective as TDD, then you should feel bad. Why do we do TDD? We do TDD for one overriding reason and several less important reasons. The less important reasons are: ‒ We spend less time debugging. ‒ The tests act as accurate, precise, and

unambiguous documentation at the lowest level of the system.

‒ Writing tests first requires decoupling that other testing strategies do not; and we believe that such decoupling is beneficial.

Those are ancillary benefits of TDD; and they are debatable. There is, however, one benefit that, given certain conditions are met, cannot be debated: If you have a test suite that you trust so much that you are willing to deploy the system based solely on those tests passing; and if that test suite can be executed in seconds, or minutes, then you can quickly and easily clean the code without fear. Now there are two predicates in that statement, and they are big predicates. But, given those predicates are met, then developers can quickly and easily clean the code without fear of breaking anything. And that is power. Because if you can clean the code, you can keep the development team from bogging down into the typical Big Ball of Mud. You can keep the team moving fast. Indeed, the benefit of keeping the code clean, and keeping the team moving fast, is so great, that those two predicates begin to pale in comparison. Yes! If I can keep the team moving fast, then I will find a way to trust my test suite, and I will keep those tests running fast.

Anyway, that's where the stridence comes from. Those of us who have experienced a fast and trustworthy test suite, and have thereby kept a large code base clean enough to keep development going fast, are very enthusiastic. So enthusiastic, in fact, that we exhibit a stridence that the author has unfortunately, and inaccurately, dubbed as "fundamentalism"; claiming it to be ineffective and unrealistic. What does the author suggest as an alternative? As someone who writes systems in Rails he suggests integration tests that use the database and operate through the GUI (using Capybara). My response to this is: If you can meet my two predicates of trustworthiness and speed, go for it! If you trust those integration tests so much that you are willing to deploy when they pass; and if they execute so quickly that you can continuously and effectively refactor and clean the code, then you aren't doing any better than me. Do it. But (and this is a big "but"), it seems to me that integration tests have very little chance of meeting my two predicates. First I doubt they can attain the necessary trustworthiness because they operate through the GUI; and you can't reach all the code from the GUI. There's lots of code in a normal system that deals with exceptions, errors, and odd corner cases that cannot be reached through the normal user interface. Indeed, I reckon you can only cover a bit more than half the code that way. It seems unlikely to me that anyone would be willing to deploy a system based on tests that leave such a large fraction of the code uncovered. Second, it seems very unlikely to me, despite the ability to spin up hundreds of servers in the cloud, that you can get those tests executed in anything like the speed you'd need to effectively and continuously refactor the code. Databases and GUIs are slow. Now, I could be wrong. I'd love to be proven wrong. And perhaps, despite the author's poor reasoning at the start of his blog, he really can trust his tests enough, and execute them quickly enough, to make them effective for keeping the code clean. If so, then I'll holler: "Amen, Brother, and Hallelujah!" and will become a strident convert to his particular brand of fundamentalist polygamy.

ovmag.com

Page 12 of 52

The Life and Times of TDD

Disciplined Agilist Scott Ambler discusses a recent mini-survey designed to find out how TDD is being used in practice out ther e.

There’s been a lot of hullabaloo lately about the state of test driven development (TDD). This was the result of blog posting and conference presentation by David Heinemeier Hansson entitled TDD is dead. Long live testing. TDD is of course alive and well, albeit not as common as its protagonists would like. This is because TDD requires both skill and discipline on the part of practitioners, and as you will soon see TDD is only one of many practices that disciplined agile practitioners follow. Let’s start by defining a few terms. TDD is an approach that combines test-first development (TFD) and refactoring. With TFD you write a single test and then just enough production code to fulfill that test. Refactoring is a strategy where you improve the quality of something (source code, your database schema, your user interface) by making small changes to it that do not change its semantics – in other words refactoring is a clean-up activity that makes something better but does not add or subtract functionality. Because you write a test before you write the code the test in effect does double duty in that it both specifies and validates that piece of code. A test-driven approach can be applied at both the requirements level and at the design level, more on this later. In May of 2014 I ran a mini-survey to explore how teams have adopted TDD in practice. There were 247 respondents from around the world, 40% of whom were TDD practitioners with an average of 16 years of experience in software development. I purposely gave the survey the title “Is TDD Dead?” so as to attract TDD practitioners. The goal of the survey wasn’t to determine whether people were doing TDD, with that survey title I would have hopelessly biased the survey, but instead to determine what teams were doing in addition to TDD to specify and validate their work. Not surprisingly the survey found that TDD practitioners are doing far more than just TDD to get the job done.

TDD and Requirements A TDD approach can be used to specify detailed requirements on a just-in-time (JIT) basis throughout construction. Instead of writing a document to capture your

requirements you instead capture them as acceptance tests which are added into your regression test suite throughout construction. This practice is often referred to as acceptance test-driven development (ATDD) or behavior driven development (BDD). People will sometimes argue that ATDD and BDD are different, but the argument boils down to nuances that don’t seem to matter in practice– so call it whatever you want and move on. An ATDD approach is supported by a wide variety of development tools including Cucumber, Fitness, and JBehave to name a few. Is TDD sufficient to explore requirements? Very likely not. Yes, ATDD is an incredibly valuable technique that I highly recommend, but it isn’t sufficient in most cases. The survey found that practitioners who were following an ATDD approach were also applying other requirements-related activities:

• Sketches captured on whiteboards or paper (76% of respondents)

• Text-based requirements using a word processor or wiki (54%)

• Text-based requirements captured on paper (perhaps using index cards or sticky notes) (46%)

• Diagrams captured using modeling tools (such as Blueprint, MagicDraw, or Enterprise Architect) (18%)

Years ago in Agile Modeling we described how modeling and TDD fit together in practice. The observation is that modeling is very good for thinking through high-level concepts but usually rather clunky for specifying details. TDD, on the other hand, is great for specifying details but inappropriate for high-level concepts. As a result TDD practitioners need to adopt light-weight agile modeling strategies to enhance their TDD activities, something that comes out very clear in the survey.

ovmag.com

Page 13 of 52

TDD and Design A TDD approach can be used to specify the detailed design of your application code, database schema, or user interface (UI) in a JIT executable manner throughout construction. This is referred to as developer TDD or unit TDD and is typically done via xUnit tools just as jUnit for Java and PL/Unit for Oracle. Not surprisingly the survey found that TDD practitioners are commonly doing more than just TDD to explore their designs. People doing developer TDD were also working on teams who were applying other design related activities, such as:

• Sketches captured on whiteboards or paper (79%)

• Diagrams captured using diagramming tools (such as Visio or Powerpoint) (40%)

• Text-based designs captured on paper (perhaps using index cards or sticky notes) (33%)

• Text-based designs using a word processor or wiki (29%)

• Diagrams captured using modeling tools (such as Enterprise Architect, Magic Draw, ...) (18%)

TDD and Testing TDD is an aspect of confirmatory testing, the equivalent of testing against the specification (in the case the tests are the detailed specifications). Confirmatory testing is important but it isn’t sufficient as it assumes that you’ve been told what the requirements are, something we know that stakeholders are not very good at in practice. Disciplined agile teams realize that they need to also perform exploratory testing, the aim of which is to identify potential issues that the stakeholders haven’t told you about. Furthermore there are some forms of testing, such as integration testing within complex environments, which may be better performed but professional testers with experience in such activities. Sure enough, the survey found that in addition to TDD that agile teams were adopting additional testing strategies as well:

• Testers are members of the development team (62%)

• Some testing will occur at the end of lifecycle (41%)

• There is a team of testers working in parallel to the development team (33%)

Got Discipline? To be successful at agile software development it requires discipline to identify, adopt, and then execute the practices described in this article. This article has just touched on a few of the practices, in addition to TDD, that agile teams are adopting. There is a plethora of software development strategies that your team may choose to adopt, and these strategies can be combined in many different ways. The Disciplined Agile Delivery (DAD) process framework captures several hundred agile techniques, providing advice for when and when not to adopt them. DAD has effectively done a lot of the heavy lifting for you when it comes to process, helping you to identify ways to enhance your approach to software development. Including, but not limited to, strategies around TDD.

Recommended Resources • At Surveys Exploring the Current State

of IT Practices I post the original questions, the source data, and my analysis of the results for all the surveys that I run. I am a firm believer in open research, and this is the highest level of openness possible.

• At Disciplined Agile Delivery and Disciplined Agile Consortium you can find out more about the Disciplined Agile Delivery (DAD) process framework.

• To find out more about TDD, you may find my article An Introduction to TDD to be a good starting point.

• The Agile Modeling site has a wide range of advice for effective modeling and documentation strategies, including advice for how modeling and TDD fit together in practice. The article Introduction to Agile Model Driven Development (AMDD) in particular should be a valuable read.

Scott Ambler is Senior Consulting Partner with Scott Ambler + Associates, and specializes in helping organizations to successfully adopt disciplined agile strategies. He is the author of many books and a regular contributor to many magazines and industry conferences. http://www.ambysoft.com/scottAmbler.html Follow him: @scottwambler

ovmag.com

Page 14 of 52

Let Me Graph That For You

Neo4j is an open source, schema-free graph database ; an online ACID transactional system that allows you to model, store and query

your data in the form of a graph or network structu re. Ian Robinson takes a look at what it has to offer.

In this article we’ll look at some of

the challenges in the contemporary data landscape that Neo4j is designed to solve, and the ways in which it addresses them. After taking a tour of Neo4j’s underlying graph data model, we'll look at how we can apply its data model primitives when developing our own graph database-backed applications. We’ll finish by reviewing some modelling tips and strategies.

Tackling Complex Data Why might we consider using a graph database? In short, to tackle complexity, and generate insight and end user value from complex data. More specifically, to wrest insight from the kind of complexity that arises wherever three contemporary forces meet—where an increase in the amount of data being generated and stored is accompanied by a need both to accommodate a high degree of structural variation and to understand the multiply-faceted connectedness inherent in the domain to which the data belongs. Increased data size—big data—is perhaps the most well understood of these three forces. The volume of net new data being created each year is growing exponentially—a trend that looks set to continue for the foreseeable future. But as the volume of data increases, and we learn more about the instances in our domain, so each instance begins to look subtly different from every other instance. In other words, as data volumes grow, we trade insight for uniformity. The more data we gather about a group of entities, the more that data is likely to be variably structured. Variably structured data is the kind of messy, real-world data that doesn't fit comfortably into a uniform, one-size-fits-all, rigid relational schema; the kind that gives rise to lots of sparse tables and null checking logic. It’s the increasing prevalence of variably structured data in today’s applications that has led many organisations to adopt schema-free alternatives

to the relational model, such as key-value and and document stores. But the challenges that face us today aren’t just around having to manage increasingly large volumes of data, nor do they extend simply to us having to accommodate ever increasing degrees of structural variation in that data. The real challenge to generating significant insight is understanding connectedness. That is, to answer many of the most important questions we want to ask of our domains, we must first know which things are connected, and then, having identified these connected entities, understand in what ways, and with what strength, weight or quality, they are connected. If you've ever had to answer questions such as:

• Which friends and colleagues do we have in common?

• Which applications and services in my network will be affected if a particular network element—a router or switch, for example—fails? Do we have redundancy throughout the network for our most important customers?

• What's the quickest route between two stations on the underground?

• What do you recommend this customer should buy, view, or listen to next?

• Which products, services and subscriptions does a user have permission to access and modify?

• What's the cheapest or fastest means of delivering this parcel from A to B?

• Which parties are likely working together to defraud their bank or insurer?

• Which institutions are most at risk of poisoning the financial markets?

—then you've already encountered the need to manage and make sense of large volumes of variably-structured, densely-connected data. These are the kinds of problems for which graph databases are ideally suited. Understanding what depends on what, and how things flow; identifying and assessing risk, and analysing the impact of events on deep

ovmag.com

Page 15 of 52

dependency chains: these are all connected data problems. Today, Neo4j is being used in business-critical applications in domains as diverse as social networking, recommendations, datacenter management, logistics, entitlements and authorization, route finding, telecommunications network monitoring, fraud analysis, and many others. Its widespread adoption challenges the notion that the relational databases is the best tool for working with connected data. At the same time, it proposes an alternative to the simplified, aggregate-oriented data models adopted by NOSQL. The rise of NOSQL was largely driven by a need to remedy the perceived performance and operational limitations of relational technology. But in addressing performance and scalability, NOSQL has tended to surrender the expressive and flexible modelling capabilities of its relational predecessor, particularly with regard to connected data. Graph databases, in contrast, revitalise the world of connected data, shunning the simplifications of the NOSQL models, yet outperforming relational databases by several orders of magnitude. To understand how graphs and graph databases help tackle complexity, we need first to understand Neo4j’s graph data model.

The Labelled Property Graph Model Neo4j uses a particular graph data model, called the labelled property graph model, to represent network structures. A labelled property graph consists of nodes, relationships, properties and labels. Here’s an example of a property graph:

Nodes Nodes represent entity instances. To capture an entity's attributes, we attach key-value pairs—properties—to a node, thereby creating a record-like structure for each individual thing in our domain. Because Neo4j is a schema-free database, no two nodes need share the same set of properties: no two nodes representing persons, for example, need have the exact same attributes.

Relationships Relationships represent the connections between entities. By connecting pairs of nodes with relationships, we introduce structure into the model. Every relationship must have a start node and an end node. Just as importantly, every relationship must have a name and a direction. A relationship's name and direction lend semantic clarity and context to the nodes attached to the relationship. This allows us—in, for example, a Twitter-like graph—to say that “Bill” (a node) “FOLLOWS” (a named and directed relationship) “Sally” (another node). Just like nodes, relationships can also contain properties. We typically use relationship properties to represent some distinguishing feature of each connection. This is particularly important when, in answering the questions we want to ask of our domain, we must not only trace the connections between things, but also take account of the strength, weight or quality of each of those connections.

Node Labels Nodes, relationships and properties provide for tremendous flexibility. In effect, no two parts of the graph need have anything in common. Labels, in contrast, allow us to introduce an element of commonality that groups nodes together and indicates the roles they play within our domain. We do this by attaching one or more labels to each of the nodes we want to group: we can, for example, label a node as representing both a User and, more specifically, an Administrator. (Labels are optional: therefore, each node can have zero or more labels.) Node labels are similar to relationship names insofar as they lend additional semantic context

ovmag.com

Page 16 of 52

to elements in the graph, but whereas a relationship instance must perform exactly one role, because it connects precisely two nodes, a node, by virtue of the fact it can be connected to zero or more other nodes, has the potential to fulfil several different roles: hence the ability to attach zero or more labels to each node. On top of this simple grouping capability, labels also allow us to associate indexes and constraints with nodes bearing specific labels. We can, for example, require that all nodes labelled Book are indexed by their ISBN property, and then further require that each ISBN property value is unique within the context of the graph.

Representing Complexity This graph model is probably the best abstraction we have for modelling both variable structure and connectedness. Variable structure is provided for by virtue of connections being specified at the instance level rather than the class level. Relationships join individual nodes, not classes of nodes: in consequence, no two nodes need be connected in exactly the same way to their neighbours; no two subgraphs need be structured exactly alike. Each relationship in the graph represents a specific connection between two particular things. It's this instance-level focus on things and the connections between things that makes graphs ideal for representing and navigating a variably structured domain. Relationships not only specify that two things are connected, they also describe the nature and quality of that connection. To the extent that complexity is a function of the ways in which the semantic, structural and qualitative aspects of the connections in a domain can vary, our data models require a means of expressing and exploiting this connectedness. Neo4j's labelled property graph model, wherein every relationship can not only be specified independently of every other, but also annotated with properties that describe how and in what degree, and with what weight, strength or quality, entities are connected, provides one of the most powerful means for managing complexity today.

And Doing It Fast

Join-intensive queries in a relational database are notoriously expensive, in large part because joins must be resolved at query time by way of an indirect index lookup. As an application’s dataset size grows, these join-inspired lookups slow down, causing performance to deteriorate. In Neo4j, in contrast, every relationship acts as a pre-computed join, every node as an index of its associated nodes. By having each element maintain direct references to its adjacent entities in this way, a graph database avoids the performance penalty imposed by index lookups—a feature sometimes know as index-free adjacency. As a result, for complexly connected queries, Neo4j can be many thousands of times faster than a join-intensive operation in a relational database. Index-free adjacency provides for queries whose performance characteristics are a function of the amount of the graph they choose to explore, rather than the overall size of the dataset. In other words, query performance tends to remain reasonably constant even as the dataset grows. Consider, for example, a social network in which every person has, on average, fifty friends. Given this invariant, friend-of-a-friend queries will remain reasonably constant, irrespective of whether the network has a thousand, a million, or a billion nodes.

Graph Data Modelling In this section we’ll look at how we go about designing and implementing an application’s graph data model and associated queries.

From User Story to Domain Questions Imagine we’re building a cross-organizational skills finder: an application that allows us to find people with particular skills in a network of professional relationships. To see how we might design a data model and associated queries for this application, we’ll follow the progress of one of our agile user stories, from analysis through to implementation in the database. Here’s the story:

As an employee I want to know which of my colleagues have similar skills to me So that I can exchange knowledge with them or ask them for help

ovmag.com

Page 17 of 52

Given this description of an end-user goal, our first task is to identify the questions we would have to ask of our domain in order to satisfy it. Here’s the story rephrased as a question:

Which people, who work for the same company as me, have similar skills to me?

Whereas the user story describes what it is we’re trying to achieve, the questions we pose to our domain provide a clue as to how we might satisfy our users’ goals. A good application graph data model makes it easy to to ask and answer such questions. Fortunately, the questions themselves contain the germ of the structure we’re looking for. Language itself is a structuring of logical relationships. At its simplest, a sentence describes a person or thing, some action performed by this person or thing, and the target or recipient of that action, together with circumstantial detail such as when, where or how this action was accomplished. By attending closely to the language we use to describe our domain and the questions we want to ask of our domain, we can readily identify a graph structure that represents this logical structuring in terms of nodes, relationships, properties and labels.

From Domain Questions to Cypher Path Expressions The particular question we outlined earlier names some of the significant entities in our domain: people, companies and skills. Moreover, the question tells us something about how these entities are connected to one another:

• A person works for a company • A person has several skills

These simple natural-language representations of our domain can now be transformed into Neo4j’s query language, Cypher. Cypher is a declarative, SQL-like graph pattern matching language built around the concept of path expressions: declarative structures that allow us to describe to the database the kinds of graph patterns we wish either to find or to create inside our graph.

When translating our ordinary language descriptions of the domain into Cypher path expressions, the nouns become candidate node labels, the verbs relationship names:

(:Person)-[:WORKS_FOR]->(:Company), (:Person)-[:HAS_SKILL]->(:Skill)

Cypher uses parentheses to represent nodes, and dashes and less-than and greater-than signs (<-- and --> ) to represent relationships and their directions. Node labels and relationship names are prefixed with a colon; relationship names are placed inside square brackets in the middle of the relationship. In creating our Cypher expressions, we’ve tweaked some of the language. The labels we’ve chosen refer to entities in the singular. More importantly, we’ve used HAS_SKILL rather than HAS to denote the relationship that connects a person to a skill. The reason for this is that HAS is far too general a term. Right-sizing a graph’s relationship names is key to developing a good application graph model. If the same relationship name is used with different semantics in several different contexts, queries that traverse those relationships will tend to explore far more of the graph than is strictly necessary—something we are mindful to avoid. The expressions we’ve derived from the questions we want to ask of our domain form a prototypical path for our data model. In fact, we can refactor the expressions to form a single path expression: (:Company)<-[:WORKS_FOR]-(:Person)-[:HAS_SKILL]->(:Skill)

While there are likely many other requirements for our application, and many other data elements to be discovered as a result of analysing those requirements, for the story at hand, this path structure captures all that is needed to meet our end-users’ immediate goals. There is still some work to do to design an application that can create instances of this path structure at runtime as users add and amend their details, but insofar as this article is focussed on the design and implementation of the data model and associated queries, our next task is to implement the queries that target this structure.

ovmag.com

Page 18 of 52

A Sample Graph To illustrate the query examples, we’ll use Cypher’s CREATE statement to build a small sample graph comprising two companies, their employees, and the skills and levels of proficiency possessed by each employee: // Create skills-finder network CREATE (p1:Person{username:'ben'}), (p2:Person{username:'charlie'}), (p3:Person{username:'lucy'}), (p4:Person{username:'ian'}), (p5:Person{username:'sarah'}), (p6:Person{username:'emily'}), (p7:Person{username:'gordon'}), (p8:Person{username:'kate'}), (c1:Company{name:'Acme'}), (c2:Company{name:'Startup'}), (s1:Skill{name:'Neo4j'}), (s2:Skill{name:'REST'}), (s3:Skill{name:'DotNet'}), (s4:Skill{name:'Ruby'}), (s5:Skill{name:'SQL'}), (s6:Skill{name:'Architecture'}), (s7:Skill{name:'Java'}), (s8:Skill{name:'Python'}), (s9:Skill{name:'Javascript'}), (s10:Skill{name:'Clojure'}), (p1)-[:WORKS_FOR]->(c1), (p2)-[:WORKS_FOR]->(c1), (p3)-[:WORKS_FOR]->(c1), (p4)-[:WORKS_FOR]->(c1), (p5)-[:WORKS_FOR]->(c2), (p6)-[:WORKS_FOR]->(c2), (p7)-[:WORKS_FOR]->(c2), (p8)-[:WORKS_FOR]->(c2), (p1)-[:HAS_SKILL{level:1}]->(s1), (p1)-[:HAS_SKILL{level:3}]->(s2), (p2)-[:HAS_SKILL{level:2}]->(s1), (p2)-[:HAS_SKILL{level:1}]->(s9), (p2)-[:HAS_SKILL{level:2}]->(s5), (p3)-[:HAS_SKILL{level:3}]->(s3), (p3)-[:HAS_SKILL{level:2}]->(s6), (p3)-[:HAS_SKILL{level:1}]->(s8), (p4)-[:HAS_SKILL{level:2}]->(s7), (p4)-[:HAS_SKILL{level:3}]->(s1), (p4)-[:HAS_SKILL{level:2}]->(s2), (p5)-[:HAS_SKILL{level:1}]->(s1), (p5)-[:HAS_SKILL{level:3}]->(s7), (p5)-[:HAS_SKILL{level:2}]->(s2), (p5)-[:HAS_SKILL{level:1}]->(s10), (p6)-[:HAS_SKILL{level:2}]->(s1), (p6)-[:HAS_SKILL{level:1}]->(s3), (p6)-[:HAS_SKILL{level:2}]->(s8), (p7)-[:HAS_SKILL{level:3}]->(s3), (p7)-[:HAS_SKILL{level:1}]->(s4), (p8)-[:HAS_SKILL{level:2}]->(s6), (p8)-[:HAS_SKILL{level:3}]->(s8)

This statement uses Cypher path expressions to declare or describe the kind of graph structure we wish to introduce into the graph. In the first half we create all the nodes we’re

interested in—in this instance, nodes representing companies, people and skills—and then in the second half we connect these nodes using appropriately named and directed relationships. The entire statement, however, executes as a single transaction. Let’s take a look at the first node definition: (p1:Person{username:'ben'})

This expression describes a node labelled Person . The node has a username property whose value is “ben”. The node definition is contained within parentheses. Inside the parentheses we specify a colon-prefixed list of the labels attached to the node (there’s just one here, Person ), together with the node’s properties. Cypher uses a JSON-like syntax to define the properties belonging to a node. Having created the node, we then assign it to an identifier, p1. This identifier allows us to refer to the newly created node elsewhere in the query. Identifiers are arbitrarily named, ephemeral, in-memory phenomena; they exist only within the scope of the query (or subquery) where they are declared. They are not considered part of the graph, and are, therefore, discarded when the data is persisted to disk. Having created all the nodes representing people, companies and skills, we then connect them as per our prototypical path expression: each person WORKS_FOR a company; each person HAS_SKILL one or more skills. Here’s the first of the HAS_SKILL relationships: (p1)-[:HAS_SKILL{level:1}]->(s1)

This relationship connects the node identified by p1 to the node identified by s1 . Besides specifying the relationship name, we’ve also attached a level property to this relationship using the same JSON-like syntax we used for node properties. (We’ve used a single CREATE statement here to create an entire sample graph. This is not how we would populate a graph in a running application, where individual end-user activities trigger the creation or modification of data. For such applications, we’d use a mixture of CREATE, SET, MERGE and DELETE to create

ovmag.com

Page 19 of 52

and modify portions of the graph. You can read more about these operations in the online Cypher documentation.)

The following diagram shows a portion of the sample data. Within this structure you can clearly see multiple instances of our prototypical path:

Find Colleagues With Similar Skills Now that we’ve a sample dataset that exemplifies the path expressions we derived from our user story, we can return to the question we want to ask of our domain, and express it more formally as a Cypher query. Here’s the question again:

Which people, who work for the same company as me, have similar skills to me?

To answer this question, we’re going to have to find a particular graph pattern in our sample data. Let’s assume that somewhere in the existing data is a node labelled Person that represents me (I have the username “ian”). That node will be connected to a node labelled Company by way of an outgoing WORKS_FOR relationship. It will also be connected to one or more nodes labelled Skill by way of several outgoing HAS_SKILL relationships. To find colleagues who share my skillset, we’re going to have to find all the other nodes labelled

Person that are connected to the same company node as me, and which are also connected to at least one of the skill nodes to which I’m connected. In diagrammatic form, this is the pattern we’re looking for:

ovmag.com

Page 20 of 52

Our query will look for multiple instances of this pattern inside the existing graph data. For each colleague who shares one skill with me, we’ll match the pattern once. If a person has two skills in common with me, we’ll match the

pattern twice, and so on. Each match will be anchored on the node that represents me. Using Cypher path expressions, we can describe this pattern to Neo4j. Here’s the full query:

// Find colleagues with similar skills MATCH (me:Person{username:'ian'}) -[:WORKS_FOR]->(company:Company), (me)-[:HAS_SKILL]->(skill:Skill), (colleague:Person)-[:WORKS_FOR]->(company), (colleague)-[:HAS_SKILL]->(skill) RETURN colleague.username AS username, count(skill) AS score, collect(skill.name) AS skills ORDER BY score DESC

This query comprises two clauses: a MATCH clause and a RETURN clause. The MATCH clause describes the graph pattern we want to find in the existing data; the RETURN clause generates a projection of the results on behalf of the client. The first line of the MATCH clause, (me:Person{username:'ian'}) , locates the node in the existing data that represents me—a node labelled Person with a username property whose value is “ian”—and assigns it to the identifier me. If there are multiple nodes matching these criteria (unlikely, because username ought to be unique), me will be bound to a list of nodes. The rest of the MATCH clause then describes the diamond-shaped pattern we want to find in the graph. In describing this pattern, we specify the labels that must be attached to a node for it to match (Company for companies, Skill for skills, Person for colleagues), and the names and the directions of the relationships that must be present between nodes for them to match (a Person must be connected to a Company with an outgoing WORKS_FOR relationship, and to a Skill with an outgoing HAS_SKILL relationship). Where we want to refer to a matched node later in the query, we assign it to an identifier (we’ve chosen colleague , company and skill ). By being as explicit as we can about the pattern, we help ensure Cypher explores no more of the graph than is strictly necessary to answer the query. The RETURN clause generates a tabular projection of the results. As I mentioned earlier, we’re matching multiple instances of the

pattern. Colleagues with more than one skill in common with me will match multiple times. In the results, however, we only want to see one line per colleague. Using the count and collect functions, we aggregate the results on a per colleague basis. The count function counts the number of skills we’ve matched per colleague, and aliases this as their score . The collect function creates a comma-separated list of the skills that each colleague has in common with me, and aliases this as skills . Finally, we order the results, highest score first. Executing this query against the sample dataset generates the following results: username score skills ben 2 ['Neo4j', 'REST'] charlie 1 ['Neo4j'] The important point about this query, and the process that led to its formulation, is that the paths we use to search the data are very similar to the paths we use to create the data in the first place. The diamond-shaped pattern at the heart of our query has two legs, each comprising a path that joins a person to a company and a skill: (:Company)<-[:WORKS_FOR]-(:Person)-[:HAS_SKILL]->(:Skill)

This is the very same path structure we came up with for our data model. The similarity shouldn’t surprise us: after all, both the underlying model and the query we execute against that model are derived from the question we wanted to ask of our domain.

ovmag.com

Page 21 of 52

Filter By Skill Level In our sample graph we qualified each HAS_SKILL relationship with a level property that indicates an individual’s proficiency with regard to the skill to which the relationship points: 1 for beginner, 2 for intermediate, 3 for expert. We can use this property in our query to restrict matches to only those people who are, for example, level 2 or above in the skills they share with us: // Find colleagues with shared skills, level 2 or above MATCH (me:Person{username:'ian'}) -[:WORKS_FOR]->(company), (me)-[:HAS_SKILL]->(skill), (colleague)-[:WORKS_FOR]->(company), (colleague)- [r:HAS_SKILL]->(skill) WHERE r.level >= 2 RETURN colleague.username AS username, count(skill) AS score, collect(skill.name) AS skills ORDER BY score DESC

I’ve highlighted the changes to the original query. In the MATCH clause we now assign a colleague’s HAS_SKILL relationships to an identifier r (meaning that r will be bound to a list of such relationships). We then introduce a WHERE clause that limits the match to cases where the value of the level property on the relationships bound to r is 2 or greater. Running this query against the sample data returns the following results:

Username score skills

Charlie 1 ['Neo4j']

Ben 1 ['Neo4j']

Search Across Companies As a final illustration of the flexibility of our simple data model, we’ll tweak the query again so that we no longer limit it to the company where I work, but instead search across all companies for people with skills in common with me: // Find people with shared skills, level 2 or above MATCH (me:Person{username:'ian'}) -[:HAS_SKILL]->(skill), (other)-[:WORKS_FOR]->(company), (other)-[r:HAS_SKILL]->(skill) WHERE r.level >= 2 RETURN other.username AS username,

company.name AS company, count(skill) AS score, collect(skill.name) AS skills ORDER BY score DESC

To facilitate this search, we’ve removed the requirement that the other person must be connected to the same company node as me. We do, however, still identify the company for whom this other person works. This then allows us to add the company name to the results. The pattern described by the MATCH clause now looks like this:

Running this query against the sample data returns the following results:

username company score skills

sarah Startup, Ltd 2 ['Java', 'REST']

ben Acme, Inc 1 ['REST']

emily Startup, Ltd 1 ['Neo4j']

charlie Acme, Inc 1 ['Neo4j']

Modelling Strategies and Tips We’ve looked at how we derive an application’s graph data model and associated queries from end-user requirements. In summary:

• Describe the client or end-user goals that motivate our model;

• Rewrite those goals as questions we would have to ask of our domain;

• Identify the entities and the relationships between them that appear in these questions;

ovmag.com

Page 22 of 52

• Translate these entities and relationships into Cypher path expressions;

• Express the questions we want to ask of our domain as graph patterns using path expressions similar to the ones we used to model the domain.

In these last sections we’ll discuss a few strategies and tips to bear in mind as we undertake this design process.

Use Cypher to Describe Your Model Use Cypher path expressions, rather than an intermediate modelling language such as UML, to describe your domain and its model. As we've seen, many of the noun and verb phrases in the questions we want to ask of our domain can be straightforwardly transformed into Cypher path expressions, which then become the basis of both the model itself, and the queries we want to execute against that model. In such circumstances, the use of an intermediate modeling language adds very little. This is not to say that Cypher path expressions comprehensively address all of our modelling needs. Besides capturing the structure of the graph, we also need to describe in what ways both the graph structure and the values of individual node and relationship properties ought to be constrained. Cypher does provide for some constraints today, and the number of constraints it supports will rise with each release, but there are occasions today where domain invariants must be expressed as annotations to the expressions we use to capture the core of the model.

Name Relationships Based on Use Cases Derive your relationship names from your use cases. Doing so creates paths in your model that align easily with the patterns you want to find in your data. This ensures that queries that take advantage of these paths will ignore all other nodes and relationships. Relationships both compose and partition the graph. In connecting nodes, they structure the whole, creating a complex composite from what would otherwise be simple islands of data. At the same time, because they can be differentiated from one another based on their name, direction and property values, relationships also serve to partition the graph, allowing us to identity specific subgraphs within a larger, more generally connected structure. By focussing our queries on certain relationship

names and directions, and the paths they form, we exclude other relationships and other paths, effectively materializing a particular view of the graph dedicated to addressing a particular need. You might think this smacks somewhat of an overly specializing approach, and indeed, in many ways it is. But it’s rarely an issue. Graphs don't exhibit the same degree of specialization tax as relational models. The relational world has an uneasy relationship with specialization, both abhoring it and yet requiring it, and then suffering when it does so. Consider: we apply the normal forms in order to derive a logical structure capable of supporting ad hoc queries—that is, queries we haven't yet thought of. All well and good—until we go into production. At that point, for the sake of performance, we denormalize the data, effectively specializing it on behalf of an application's specific access patterns. This denormalization helps in the near term, but poses a risk for the future, for in specializing for one access pattern, we effectively close the door on many others. Relational modellers are frequently faced with these kinds of either/or dilemmas: either stick with the normal forms and have performance suffer, or denormalize, and limit the scope for evolving the application further down the line. Not so with graph modelling. Because the graph allows us to introduce new relationships at the level of individual node instances, we can specialize it over and over again, use case by use case, in an additive fashion—that is, by adding new routes to an existing structure. We don't need to destroy the old to accommodate the new; rather, we simply introduce the new configuration by connecting old nodes with new relationships. These new relationships effectively materialize previously unthought of graph structures to new queries. Their being introduced into the graph, however, need not upset the view enjoyed by existing queries.

Pay Attention to Language In our modelling example, we derived a couple of path expressions from the noun and verb phrases we used to describe common. There are a few rules of thumb when analyzing a natural language representation of a domain. Common nouns become candidates for labels: “person”, “company” and “skill” become Person , Company and Skill respectively. Verbs that take an object—”owns”, “wrote” and

ovmag.com

Page 23 of 52

“bought”, for example—become candidate relationship names. Proper nouns—a person or company's name, for example—refer to an instance of a thing, which we then typically model as a node. Things aren’t always so straightforward. Subject-verb-object constructs are easily transformed into graph structures, but a lot of the sentences we use to describe our domain are not as simple as this. Adverbial phrases, for example—those additional parts of a sentence that describe how, when or where an action was performed—result in what entity-relational modelling calls n-ary relationships; that is, complex, multi-dimensional relationships that bind together several things and concepts. N-ary relationships would appear to require something more sophisticated than the property graph for their representation; a model that allows relationships to connect more than two nodes, or that permits one relationship to connect to, and thereby, qualify another. Such data model constructs are, however, almost always unnecessary. To express a complex interrelation of several different things, we need only introduce an intermediate node—a hub-like node that connects all the parties to an n-ary relationship. Intermediate nodes are a common occurrence in many application graph data models. Does their widespread use imply that there is a deficiency in the property graph model? I think not. More often than not, an intermediate node makes visible one more element of the domain—a hidden or implicit concept with informational content and a meaningful domain semantic all of its own. Intermediate nodes are usually self-evident wherever an adverbial phrase qualifies a clause. “Bill worked at Acme, from 2005-2007, as a Software Engineer” leads us to introduce an intermediate node that connects Bill, Acme and the role of Software Engineer. It quickly becomes apparent that this node represents a job, or an instance of employment, to which we can attach the date properties from and to . It's not always so straightforward. Some intermediate nodes lie hidden in far more obscure locales. Verbing—the language habit whereby a noun is transformed into a verb—can often occlude the presence of an intermediate node. Technical and business

jargon is particularly rife with such neologisms: we “email” one another, rather than send an email, “google” for results, rather than search Google. The verb “email” provides a ready example of the kinds of difficulties we can encounter if we miss out on the noun origins of some verbs. The following path shows the result of us treating “email” as a relationship name: (:Person{name:'Alice'}) -[:EMAILED]->(:Person{name:'Lucy'})

This looks straightforward enough. In fact, it's a little too straightforward, for with this construct it becomes extremely difficult to indicate that Alice also copied in Alex. But if we unpack the noun origins of “email”, we discover both an important domain concept—the electronic communication itself—and an intermediate node that connects senders and receivers: (:Person{name:'Alice'}) -[:SENT]->(e:Email{subject:'Annual report'}) -[:TO]->(:Person{name:'Lucy'}), (e)-[:CC]->(:Person{name:'Alex'})

If you're struggling to come up with a graph structure that captures the complex interdependencies between several things in your domain, look for the nouns, and hence the domain concepts, hidden on the far side of some of the verb phrases you use to describe the structuring of your domain.

Conclusion Once a niche academic topic, graphs are now a commodity technology. In this article we’ve looked at the kinds of problems graph databases are intended to solve. We’ve seen how Neo4j makes it easy to model, store and query large amounts of variably structured, densely connected data, and how we can design and implement an application graph data model by transforming user stories into graph structures and declarative graph pattern matching queries. If, having got this far, you’re now beginning to think in graphs, head over to http://neo4j.com and grab a copy of Neo4j. Ian Robinson is an Engineer for Neo Technology, the company behind Neo4j. Blog: http://iansrobinson.com Follow him: @iansrobinson

ovmag.com

Page 24 of 52

Agile India 2014 - A Report

Naresh Jain (@nashjain) and Pramod Sadalage discus s the 2014 event … Agile Software Community of India was happy to host 1236 Attendees from 28 different countries. We had attendees playing 342 different roles from 226 different companies sharing, learning, networking and enabling the community to improve their agility as part of Agile India 2014 conference. For 10 years we have been running these conferences and every year the community-feeling keeps getting better. This year, finally one could sense the true spirit of large scale community at the conference. It was not a one person show anymore. Also, it was amazing to see how well folks were networking and learning from each other

(peer-to-peer learning.) This year we got tremendous support from a diverse set of companies sponsoring the event. Many people appreciated that the conference was not only supported by Agile tools & consulting companies, but was also supported by companies like JP Morgan, HP and Siemens. This clearly shows that the industry believes in the agile movement and wants to invest in nurturing our budding community. Another thing the participants really appreciated was, how inclusive the conference program was. In the early days of Agile India, we were very heavily influenced by eXtreme Programming. But over the years, we’ve tried our best to be more inclusive of other methods (Scrum, Kanban, Lean Startup, DSDM, etc.) and frameworks (SAFe, DAD, etc.) We strongly believe that the conferences job is to create an equal platform for everyone, get the best in the industry and let people decide what makes most sense to them, in their context. The entire conference program was put together by a committee of volunteers (http://2014.agileindia.org/organizers/), who are selected via a self-nomination process. Also anyone is allowed to put in a proposal via our open submission system (http://present.agileindia.org). We got 263 proposals for talk, out of which 64 proposals were selected.

The conference was run for four days with each day day dedicated to specific topic Scaling Agile Adoption - Day focused on adopting Agile in organizations of all types and sizes, scaling from single team doing agile to multiple teams, departments and non IT adoption of agile practices. Ellen Grove headed this theme as its chair. Offshore/Distributed Agile - Day focused on Offshore teams, distributed teams either in same time-zone or different time-zones and its impact on agile projects. Ravi Kumar headed this theme as its chair.

ovmag.com

Page 25 of 52

Agile Lifecycle - Day focused on the entire lifecycle of a project/product starting from product discovery, project kickoff, release planning, user story mapping, development & testing practices, CI pipelines from development to deployment, measuring feature usage, doing A/B testing and beyond. Michael Norton (Doc) headed this theme as its chair Beyond Agile - Day focused on taking Agile methods to the next level such as Lean Startups for Enterprises, Continuous Delivery, using cloud services such as Iaas or Paas, Automation in device/embedded software, Agile in mobile development including CI & A/B testing, challenges in managing generalizing specialists. Tathagat Verma headed this theme as its chair.

Morning keynotes by Martin Fowler (@martinfowler), Todd Little, Ash Maurya (@ashmaurya) and Dave Thomas (@pragdave) and evening keynotes by Rae Abileah, Ryan Martens where inspiring not just from technology perspective but even opened our eyes in many different ways. We tried many unique concepts during this years conference:

Agile Art!:

During all the three evening receptions, the participants created a visual art piece together with the help of Richard Kasperowski and the team from McAfee. This helped the participants to create new connections and build/reinforce the community of Agilists in India and around the world.

Book Signing and Book Store: Every year Agile India attracts top speakers from around the world. Most of these speakers have a track record of writing very influential books. To enable the fan/follower of these authors, we set up a book store at the conference and had book signing events where attendees were able to get a personal autographed book by the authors. Many folks appreciated this initiative. And we plan to make it even stronger next year.

Agile India Webinar Series: We invited many speakers to the Agile India 2014 Conference. However due to travel constraints or other conflicts, they were not able to make it. However few of them agreed to do an exclusive webinar (Google Hangout) with us. The recordings of their webinar is available at: http://2014.agileindia.org/program/webinars/

Agile India Job Fair: Agile India was happy to host the world’s first job fair dedicated for hiring Agile practitioners. The goal of the Agile Job Fair was to create a platform dedicated for the Agile practitioners to meet their potential Agile employers and for companies to find Agile practitioners to enable their journey to Agile adoption and excellence.

See videos of all the talks 2014.agileindia.org. Find out about next year’s event at agileindia.org

ovmag.com

Page 26 of 52

NoSQL Databases: An overview

Relational databases have dominated the software in dustry for a long time , but this dominance is cracking with the rise of new types of

databases - NoSQL databases.

Pramod Sadalage investigates...

NoSQL what does it mean What does NoSQL mean and how do you categorize these databases? NoSQL means Not Only SQL, implying that when designing a software solution or product, there are more than one storage mechanism that could be used based on the needs. NoSQL was a hashtag (#nosql) choosen for a meetup to discuss these new databases. The most important result of the rise of NoSQL is Polyglot Persistence.

NoSQL does not have a prescriptive definition but we can make a set of common observations, such as:

• Not using the relational model • Running well on clusters • Mostly Open-source • Built for the 21st century web estates • Schemaless

Why NoSQL Databases Application developers have been frustrated with the impedance mismatch between the relational data structures and the in-memory data structures of the application. Using NoSQL databases allows developers to develop without having to convert in-memory structures to relational structures. There is also movement away from using databases as

integration points in favor of encapsulating databases with applications and integrating using services. The rise of web as a platform also created a vital factor for change in data storage as the need to support large volumes of data by running on clusters. Relational databases were not designed to run efficiently on clusters. The data storage needs of an ERP application

ovmag.com

Page 27 of 52

are lot more different than the data storage needs of facebook or etsy or stripe.

Aggregate Data Models:

Relational database modelling is vastly different than the types of data structures that application developers use. Using the data structures as modelled by the developers to solve different problem domains has given rise to movement away from relational modelling and towards aggregate models, most of this is driven by Domain Driven Design a book by Eric Evans. An aggregate is a collection of data that we interact with as a unit. These units of data or aggregates form the boundaries for ACID operations with the database, Key-value, Document, and Column-family databases can all be seen as forms of aggregate-oriented database. Aggregates make it easier for the database to manage data storage over clusters, since the unit of data now could reside on any machine and when retrieved from the database gets all the related data along with it. Aggregate-oriented databases work best when most data interaction is done with the same aggregate, for example when there is need to get an order and all its details, it better to store order as an aggregate object but dealing with these aggregates to get item details on all the orders is not elegant. Aggregate-oriented databases make inter-aggregate relationships more difficult to handle than intra-aggregate relationships. Aggregate-ignorant databases are better when interactions

use data organized in many different formations. Aggregate-oriented databases often compute materialized views to provide data organized differently from their primary aggregates. This is often done with map-reduce computations, such as a map-reduce job to get items sold per day.

Distribution Models: Aggregate oriented databases make distribution of data easier, since the distribution mechanism has to move the aggregate and not have to worry about related data, as all the related data is contained in the aggregate. There are two styles of distributing data:

• Sharding: Sharding distributes different data across multiple servers, so each server acts as the single source for a subset of data.

• Replication: Replication copies data

across multiple servers, so each bit of data can be found in multiple places.

Replication comes in two forms:

• Master-slave replication makes one node the authoritative copy that handles writes while slaves synchronize with the master and may handle reads.

• Peer-to-peer replication allows writes to

any node; the nodes coordinate to synchronize their copies of the data.

Master-slave replication reduces the chance of update conflicts but peer-to-peer replication avoids loading all writes onto a single server creating a single point of failure. A system may use either or both techniques. Like Riak database shards the data and also replicates it based on the replication factor.

CAP theorem: In a distributed system, managing consistency(C), availibility(A) and partition toleration(P) is important. Eric Brewer put forth the CAP theorem which states that in any distributed system we can choose only two of consistency, availability or partition tolerance. Many NoSQL databases try to provide options where the developer has

ovmag.com

Page 28 of 52

choices where they can tune the database as per their needs. For example if you consider Riak a distributed key-value database. There are essentially three variables r, w, n where:

r =number of nodes that should respond to a read request before its considered successful. w=number of nodes that should respond to a write request before its considered successful. n=number of nodes where the data is replicated - aka replication factor.

In a Riak cluster with 5 nodes, we can tweak the r,w,n values to make the system very consistent by setting r=5 and w=5 but now we have made the cluster susceptible to network partitions since any write will not be considered successful when any node is not responding. We can make the same cluster highly available for writes or reads by setting r=1 and w=1 but now consistency can be compromised since some nodes may not have the latest copy of the data. The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency of data. Durability can also be traded off against latency, particularly if you want to survive failures with replicated data. NoSQL databases provide developers lot of options to choose from and with which to fine tune the system to their specific requirements. Understanding how the data is going to be consumed by the system, questions such as is it read heavy vs write heavy, is there a need to query data with random query parameters, will the system be able handle inconsistent data, etc. become much more important. For a long time now we have been used to the default RDBMS which comes with a standard set of features no matter which product is chosen, and with which there is no possibility of choosing some features over other. The availability of choice in NoSQL databases, is both good and bad at the same time. Good because now we have choice to design the system according to wider requirements. But bad because we have to get our choices right!

One example of a feature provided by default in an RDBMS is the transaction. We are so used to this feature that we have stopped thinking about what would happen when the database does not provide transactions. Most NoSQL databases do not provide transaction support by default, which means the developers have to think how to implement them., asking questions such as:

• does every write have to have the safety of transactions?

• can the write be segregated into “critical that they succeed” and “its okay if I lose this write” categories?

Sometimes deploying external transaction managers like ZooKeeper can also be a possibility.

Types of NoSQL Databases: NoSQL databases can broadly be categorized in four types.

Key-Value databases

Key-value stores are the simplest NoSQL data stores to use from an API perspective. The client can either get the value for the key, put a value for a key, or delete a key from the data store. The value is a blob that the data store just stores, without caring or knowing what's inside; it is the responsibility of the application to understand what was stored.

ovmag.com

Page 29 of 52

Since key-value stores always use primary-key access, they generally have great performance and can be easily scaled. Some of the popular key-value databases are Riak, Redis (often referred to as Data Structure server), Memcached and its flavors, Berkeley DB, HamsterDB (especially suited for embedded use), Amazon DynamoDB (not open-source), Project Voldemort and Couchbase. All key-value databases are not the same, there are major differences between these products. For example: Memcached data is not persistent whereas Riak data is. It is important to consider our requirements in choosing which DB to use.

Consider implementing caching of user preferences. Doing this in Memcached means that if the node goes down all the data will be lost - and will need to be refreshed from source system. If we store the same data in Riak we may not need to worry about losing data, but we must also consider how to update stale data. These types of issue need to be thought through before choosing a key-value database.

Document Databases

Document Database Fragment.

Perhaps unsurprisingly documents are the main concept in document databases. The database stores and retrieves documents, which can be XML, JSON, BSON, and so on. These documents are self-describing, hierarchical tree data structures which can consist of maps, collections, and scalar values. The documents stored are similar to each other but do not have to be exactly the same.

Document databases store documents in the value part of the key-value store; think about document databases as key-value stores where the value is examinable. Document databases such as Mongodb provide a rich query language and constructs such as database, indexes etc allowing for easier transition from relational databases.

ovmag.com

Page 30 of 52

Some of the popular document databases we have seen are MongoDB, CouchDB , Terrastore, OrientDB, RavenDB – not forgetting the well-known and often reviled Lotus Notes that uses document storage.

Column family stores Column-family databases store data in column families as rows that have many columns associated with a row key (Figure 10.1). Column families are groups of related data that is often accessed together. For a Customer, we would often access their Profile information at the same time, but not their Orders.

Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists of multiple columns. The difference is that various rows do not have to have the same columns, and a column can be added to any row at any time without having to add it to other rows. When a column consists of a map of columns, then we have a super column. A super column consists of a name and a value which is a map of columns. Think of a super column as a container of columns.

Cassandra is one of the popular column-family databases; there are others, such as HBase, Hypertable, and Amazon DynamoDB. Cassandra can be described as fast and easily scalable with write operations spread across the cluster. The cluster does not have a master node, so any read and write can be handled by any node in the cluster.

Graph Databases Graph databases allow you to store entities and relationships between these entities. Entities are also known as nodes, which have properties. Think of a node as an instance of an object in the application. Relations are known as edges that can have properties. Edges have directional significance; nodes are organized by relationships which allow you to find interesting patterns between the nodes. The organization of the graph lets the data to be stored once and then interpreted in different ways based on relationships.

Usually, when we store a graph-like structure in RDBMS, it's for a single type of relationship ("who is my manager" is a common example). Adding another relationship to the mix usually means a lot of schema changes and data movement, which is not the case when we are using graph databases. Similarly, in relational databases we model the graph beforehand based on the Traversal we want; if the Traversal changes, the data will have to change. In graph databases, traversing the joins or relationships is very fast. The relationship between nodes is not calculated at query time but is actually persisted as a relationship. Traversing persisted relationships is faster than calculating them for every query. Nodes can have different types of relationships between them, allowing you to both represent relationships between the domain entities and to have secondary relationships for things like category, path, time-trees, quad-trees for

ovmag.com

Page 31 of 52

spatial indexing, or linked lists for sorted access. Since there is no limit to the number and kind of relationships a node can have, they all can be represented in the same graph database. Relationships are first-class citizens in graph databases; most of the value of graph databases is derived from the relationships. Relationships don't only have a type, a start

node, and an end node, but can have properties of their own. Using these properties on the relationships, we can add intelligence to the relationship—for example, since when did they become friends, what is the distance between the nodes, or what aspects are shared between the nodes. These properties on the relationships can be used to query the graph.

Since most of the power from the graph databases comes from the relationships and their properties, a lot of thought and design work is needed to model the relationships in the domain that we are trying to work with. Adding new relationship types is easy; changing existing nodes and their relationships is similar to data migration, because these changes will have to be done on each node and each relationship in the existing data. There are many graph databases available, such as Neo4J, Infinite Graph, OrientDB, or FlockDB (which is a special case: a graph

database that only supports single-depth relationships or adjacency lists, where you cannot traverse more than one level deep for relationships).

Why choose NoSQL database We've covered a lot of the general issues you need to be aware of to make decisions in the new world of NoSQL databases. It's now time to talk about why you would choose NoSQL databases for future development work. Here are some broad reasons to consider the use of NoSQL databases.

ovmag.com

Page 32 of 52

Graph Database Fragment.

Why use a NoSQL database:

• To improve programmer productivity by using a database that better matches an application's needs.

• To improve data access performance via

some combination of handling larger data volumes, reducing latency, and improving throughput.

It's essential to test your expectations about programmer productivity and/or performance before committing to using a NoSQL technology. Since most of the NoSQL databases are open source, testing them is a simple matter of downloading these products and setting up a test environment. Even if NoSQL cannot be used as of now, designing the system using service encapsulation supports changing data storage technologies as needs and technology evolve. Separating parts of applications into services also allows you to introduce NoSQL into an existing application.

Choosing a NoSQL database Given so much choice, how do we choose which NoSQL database? As described much depends on the system requirements, here are some general guidelines

• Key-value databases are generally useful for storing session information, user profiles, preferences, shopping cart data. We would avoid using Key-value databases when we need to query by data, have relationships between the data being stored or we need to operate on multiple keys at the same time.

• Document databases are generally

useful for content management systems, blogging platforms, web analytics, real-time analytics, ecommerce-applications. We would avoid using document databases for systems that need complex transactions spanning multiple operations or queries against varying aggregate structures.

ovmag.com

Page 33 of 52

• Column family databases are generally useful for content management systems, blogging platforms, maintaining counters, expiring usage, heavy write volume such as log aggregation. We would avoid using column family databases for systems that are in early development, changing query patterns.

• Graph databases are very well suited to

problem spaces where we have connected data, such as social networks, spatial data, routing information for goods and money, recommendation engines

Schema-less ramifications All NoSQL databases claim to be schema-less, which means there is no schema enforced by the database themselves. Databases with strong schemas, such as relational databases, can be migrated by saving each schema change, plus its data migration, in a version-controlled sequence. Schema less databases

still need careful migration due to the implicit schema in any code that accesses the data. Schemaless databases can use the same migration techniques as databases with strong schemas, in schemaless databases we can also read data in a way that's tolerant to changes in the data's implicit schema and use incremental migration to update data, thus allowing for zero downtime deployments, making them more popular with 24*7 systems.

Conclusion All the choice provided by the rise of NoSQL databases does not mean the demise of RDBMS databases. We are entering an era of polyglot persistence, a technique about using different data storage technologies to handle varying data storage needs. Polyglot persistence can apply across an enterprise or within a single application.

Pramod Sadalage is principal consultant at ThoughtWorks where he enjoys the rare role of bridging the divide between database professionals and application developers. He is usually sent in to clients with particularly challenging data needs, which require new technologies and techniques. In the early 00's he developed techniques to allow relational databases to be designed in an evolutionary manner based on version-controlled schema migrations. He is the co-author of Refactoring Databases, co-author of NoSQL Distilled and continues to speak and write about the insights he and his clients learn.

Book Recommendations

ovmag.com

Page 34 of 52

Apache CouchDB™: The Definitive Introduction

Jan Lehnardt gives an overview of the Apache Couch DB database system - delving into the more technical details of CouchDB, but also explaining where CouchDB comes from, what problems it solves particu larly well and why it is

different from all other databases.

Introduction Apache CouchDB is an Open Source database management software published by the Apache Software Foundation. It is developed as a community project with several commercial vendors supporting the core development as well as offering support and services. CouchDB is written in Erlang, a functional programming language with a focus on writing robust, fault tolerant and highly concurrent applications. CouchDB uses HTTP as its main programming interface and JSON for data storage. It has a sophisticated replication feature, that powers most of the more interesting use-cases for CouchDB. Released first in 2005, CouchDB has been in development for nearly a decade and is used all over the world from independent enthusiasts to industry giants. It has healthy developer and support communities and a steadily growing fan-base. Its main website is http://couchdb.apache.org.

What sets CouchDB apart from other databases? Traditionally, databases are the single source of truth for an application or a set of applications. They manage data storage and access, data integrity, schemata, permissions and so on. Usually, the boundary for a database system is a single server, a single piece of hardware with global access to all storage, memory and CPU in a coherent fashion. Databases systems like these include MySQL or PostgreSQL. With the advent of the web in the late 90s and it’s mass-success in the 2000s, the requirements for backends of websites have changed dramatically. And with them changed the ways people were using these databases. The main requirements can be placed along two axes: Reliability and Capacity.

A common MySQL setup in 2002 consisted of a single database server. When a particular application became popular, or the source of business, reliability becomes a must. To address this, people started setting up MySQL replication from the single database, now called the “primary” to a “hot spare secondary” database. Should the primary database server crash or become otherwise unavailable, the hot spare secondary database server could be promoted to be the new primary. This can happen quickly and often with little or no interruption to the application. On the other axis, websites and -apps of the early and mid-2000s had a particular access pattern: 85%-99% of all requests were read requests, e.g. requests that only retrieved existing information from the application, and thus the database. Only a very few percentage of requests would actually create any new data in the database. At the same time, the amount of requests for a website could easily outnumber the resources of a single hardware server. This lead to a plethora of solutions to address the problem of using more hardware to serve the load. As a first tier, caching with things like memcached was (and still is) often deployed. And then, as this wouldn’t suffice to keep up with the number of requests, people turned to MySQL’s replication feature again, for creating what is called “read-only secondary” databases. These are databases that would continuously read all changes going into the primary database server. The application then could direct all write requests to the primary server and balance read requests to one of the many read-only secondary servers, thus effectively distributing the load. With this setup, some new problems come up, now that CPU, ram and i/o no longer live in a single machine and a network is involved. With replication lag, it could take some time for a write request to arrive at a secondary and if an application’s user was unlucky, it might appear to them that the write request has not succeeded. But overall, this was fairly reliable

ovmag.com

Page 35 of 52

and good monitoring could help keep this and other issues at bay. In the late 2000s the web turned from a read-mostly medium to a full read-write web and the above solutions started to show their limits. Companies like Google and Amazon started building custom in-house database systems that were designed to address these and more upcoming issues head-on. Both companies published notable papers about their work, and they are the foundation for many database systems that are today known as “NoSQL“ or “BigData” Databases”, including, but not limited to: Riak, Cassandra and HBase.

The Outlier: CouchDB At the same time, but, at first, without being influenced by the above developments, CouchDB was started as an Open Source and from-scratch implementation of the database behind Lotus Notes (of all things!). In 2007 CouchDB added an indexing and querying system based on the MapReduce paper, but other than that, the design of CouchDB is mainly influenced by Lotus Notes. Before you stop reading in anger: Nobody likes to use Lotus Notes the application, but its underlying database has some remarkable features that CouchDB inventor Damien Katz thought would be worth preserving. If CouchDB shares little history with other databases of the same era or the era before, why the long introduction? It is easier to explain CouchDB’s features in the context of the larger database ecosystem.

What makes CouchDB special? CouchDB is more like git than MySQL. If you remember SVN, CVS, or similar centralised version control systems: for every interaction you had to talk to the server and wait for the result. svn log , send the request to the server, wait for the server to process the request, wait for the result to come back, display the result. Every. Time. In git, *snaps fingers*, interactions are instant. Why is that so? All operations are local. Your copy of the repository includes all the information the remote server has and therefore, there is no need to run every interaction over the network to some server.

In this view of the world, CouchDB works like git, you can have a copy of all your data on your local machine, as well as a remote server, or multiple servers, and a continuous integration setup, and spread across all over the globe, if you want. All powered by CouchDB’s replication feature (while it has the same name as the feature in, say, MySQL, CouchDB’s replication is significantly different, we’ll soon see why). WIth this in mind, let’s revisit the various scenarios that other databases were first augmented, and now are struggling with:

• The single database server: like a classic MySQL setup, CouchDB supports this just fine. It is the default for many users.

• The primary-secondary setup with a hot-

failover or for reliability (or both): CouchDB’s replication let’s us set up a hot failover easily. A standard HTTP proxy in front can manage the failover scenario.

• The primary-many-secondaries scenario

for scaling out read requests: as trivial as before, nothing to see here.

So we’ve got all that covered, but with CouchDB, we don’t have to stop here. Say you have an office in London and a Customer Relationship Management (CRM) software that uses CouchDB. Everybody in the London office can access the application and its data with local LAN speeds. All is well. Now you open an office in Tokyo and the people there need access to the same CRM data. But making requests around the globe adds a significant latency to each and every request (remember the SVN scenario above?) and your colleagues in Tokyo are quickly frustrated. In addition, if your London office network connection, or any network in between has any issues, Tokyo was effectively cut off from the data they need to do their work with their customers. Luckily, you know that CouchDB has replication built in, and you start setting up a database and application and database server in the Tokyo office that people there can have access to using the local LAN. In the back, both CouchDB instances can replicate changes to each other,

ovmag.com

Page 36 of 52

so that data added in Tokyo eventually makes its way to London and and vice versa.

All employees are productive and happy and the extent of your software configuration work can be summed up in these two simple curl commands:

(line breaks are not required) curl -X POST https://tokyo.office/_replicate -d '{"source":" https://london.office/crm ", "target":"crm","continuous":true}' // replicate all data in the “crm” database from Lo ndon to Tokyo // continuously curl -X POST https://london.office/_replicate -d '{"source":" https://tokyo.office/crm ", "target":"crm","continuous":true}' // replicate all data in the “crm” database from To kyo to London continuously

ovmag.com

Page 37 of 52

When you open your New York office, you already know what to do and you can make sure, you don’t let the people there have the same painful experience that their colleagues in Tokyo had to go through. And you don’t have to trust your data to one of the cloud-providers that may or may not take your data’s confidentiality seriously. Before we continue with interesting use-cases like this, let’s look at one of the major steps in CouchDB’s development history.

BigCouch BigCouch started as a fork of Apache CouchDB by the company Cloudant, who operate a big-data-as-a-service platform based on CouchDB. Their platform includes more things, but at its core sits BigCouch. After running the platform in production for a while, Cloudant decided to release its core, BigCouch, as an Open Source project. Fast forward a few years and BigCouch is now actually being added into the main Apache CouchDB codebase. The upcoming CouchDB 2.0 release will include the full BigCouch feature set and make CouchDB a fully clusterable database solution. Why is this significant? Easy: The “C” in “CouchDB” stands for “Cluster”, the full name is “Cluster Of Unreliable Commodity Hardware” and until BigCouch, CouchDB did not have any cluster management features built in. CouchDB’s features, however, were carefully designed that should someone use CouchDB in a cluster (either behind a simple HTTP proxy, or a sophisticated system like BigCouch), that its semantics would stay the same. The promise CouchDB made was that should you start on a single machine setup and at some point outgrow the capacity of that machine, you could move to a cluster without having to rewrite your application. A big promise that can now be fulfilled. BigCouch is an implementation of the aforementioned Dynamo paper by Amazon. Dynamo defines a cluster and data management system that allows any number of machines to behave as if it was one while handling magnitudes more data and requests

than a single server could handle, and on top of that be very resilient against server faults. In other words: BigCouch addresses both axes of the reliability and and capacity spectrum. BigCouch achieves that by splitting up each databases into a number of shards. Each shard can live on a separate hardware server. Each server can return data from anywhere in the database: if it is local to the server, it just returns it, if it lives on another server, it fetches it from there and returns it to the client as a proxy. In addition to sharding databases, BigCouch also keeps replicas of each shard on other server nodes in the cluster. In case a shard becomes unavailable through a network, software or hardware failure, the replica shards will be able to continue to serve requests in the face of one or more missing shards.

Now that we’ve learned that CouchDB is more like git than other databases, and that it is designed to scale in a cluster with BigCouch, it is time for another ludicrous statement.

CouchDB is not just a database; it is a protocol The promise of CouchDB, being able to store data close to where it is needed is so attractive that people started porting CouchDB over to other languages and environments, to make sure they can benefit from it’s sophisticated replication feature. The most notable implementations of The Couch Replication Protocol are PouchDB, Couchbase Lite (née TouchDB), and Cloudant Sync for Mobile. PouchDB is implemented in JavaScript and is designed to run in a modern web browser (including mobile browsers). Couchbase Lite and Cloudant Sync come in two flavours: one for iOS written in Objective-C and one for Android written in Java and both are meant to be embedded in native mobile applications. They are all Open Source projects separate from Apache CouchDB, but they share the same replication capabilities, although some implementation details that we explain for Apache CouchDB below differ in the various other projects.

ovmag.com

Page 38 of 52

Why would you want to have a database in your browser or on your phone? Well, you already do, but none of the existing databases in these places have a powerful replication feature built in and that can have a number of significant advantages: 1. With PouchDB, you can have a copy of your active user-data in your browser and you can treat it as a normal database. That means that all operations from your web application only ever talk to your in-browser PouchDB database, communication with the server happens asynchronously. The benefits here are manyfold: because all end-user interaction only ever hits the local database, the user never has to wait for any action to take place, things happen immediately as far as they are concerned. This creates a significantly better user experience than having to wait for the network for every interaction. Both Amazon and Google have published studies that show that even 100ms worth of extra wait time turns people away from engaging with a website. Another benefit is that, if you are on a spotty wifi connection, or even on a phone in subway, you can just keep using the app as if you were

online. Any changes you make or are made on the server side are synchronised, when you are online again. 2. With the iOS and Android implementations you get the same benefits, but for native mobile apps. Imagine having something that works like IMAP for email clients, but for any app: full access to all data locally, working synchronisation back and forth between multiple clients and full offline functionality. In the mobile use-case the latency of the network is even higher than on WiFi, the connectivity is less consistent, even with the latest mobile broadband technologies. Waiting a second or two for every user interaction is frustrating at best. In addition, the radios on battery powered devices are a huge drain on the power when they are active. That’s why mobile operating systems do all sorts of tricks to avoid having to power up the radio, and when the radio is running, make the most of it, so it doesn’t have to be powered up again any time soon. With fully offline applications the radio does not have to be powered up a lot, let alone for every user interaction, resulting in significantly better

ovmag.com

Page 39 of 52

battery life and thus user experience, happier users and happier operators, who now get more out of their infrastructure. In short, having a database that implements The Couch Replication Protocol gives you the following advantages:

• Improved user experience through zero latency access to user data

• Network-independent app usage, apps work offline

• Massive savings in battery power All of the above can be summarised as “Offline First”, which is an initiative to promote the techniques and technologies with the above benefits.

The vision All this is already very compelling, but things can go even further. Imagine a Couch replication capable database on every Linksys router, built into every phone and web browser: People would have better access to their data and more control of their data in a world that is increasingly centralising resources and power around a few major corporations.

Now that understand why CouchDB and other implementations of The Couch Replication Protocol have a number of compelling features for a modern computing world, let’s have a look at things work in detail.

Technical details Fundamental to CouchDB are HTTP as the main access protocol and JSON as the data storage format. Let’s start with these.

HTTP HTTP is most widely deployed end-user visible protocol in existence. It is easy to understand, powerful, supported everywhere in programming environments and comes with its own fleet of custom hard- and software that handles everything from serving, routing, proxying, monitoring, measuring and debugging it. Little other software is as ubiquitous as HTTP. The main way to do anything with CouchDB is via HTTP. Create a database: make an HTTP

request; create some data: make an HTTP request; query your data: make an HTTP request; set up replication: make an HTTP request; configure the database: make an HTTP request. You get the idea. Just to give you a hint of what this looks like: curl -X PUT http://127.0.0.1:5984/my_database

This creates a database from your command line, simple as that. A database is the bucket, the collection for data. Where relational databases store tables and rows in a database, CouchDB stores documents in a database. A document contains both the structure and the data for a specific data item. That is why CouchDB is often classified as a document oriented database. An easy way to think of a document from a programmer’s perspective is that of an object; or the serialisation of an object.

JSON Documents CouchDB documents are encoded in JSON. JSON has this nice property that it doesn’t try to be all things to all people. It is not a superset, not all data structures all programming environments support can be adequately represented in JSON. What makes JSON so nice is that it is a subset of all the native types that are shared among all programming environments: numbers, booleans, strings, lists and hashes (and a few odds and ends). This makes JSON a great format for data interchange between different systems because all you need is a JSON parser, that translates JSON into the native types of a programming language and that’s considerably simpler than translation layers that would map all sorts of sophisticated things between different environments where they really wouldn’t fit in. In addition, JSON is native to web programming, it is fairly concise and compresses well, so it is a natural choice for and mobile application programming. With CouchDB, you get JSON out of the box. Let’s create a document:

curl -X PUT http://127.0.0.1:5984/my_database/my_document \

-d '{"name": "Dave Grohl", "age": 42}'

ovmag.com

Page 40 of 52

CouchDB will happily store whatever JSON you will sent it. In that sense, CouchDB is a schemaless database by default. This helps with prototyping applications, as one doesn’t have to spend countless hours defining a data model upfront, you can just start programming and serialise your objects into JSON and store them into CouchDB. This also cuts out the middle layer known as Object Relational Mappers (ORMs). Superficially speaking, an ORM turns a relational database into something that feels natural to object oriented programmers. With CouchDB, you get that same interface natively, leaving a whole class of problems behind you from the start. In addition, the source code of many popular ORMs is larger than CouchDB’s source code. CouchDB also supports binary data storage. The mechanism is called attachments, and it works like email attachments: arbitrary binary data is stored under a name and with a corresponding content type in the _attachments member of a document. There is no size limit for documents or attachments.

Schema enforcement optional Being able to store arbitrary data is of course a blessing when starting out to programme, but further down the development cycle of an app, you do want to be able to make sure, that the database only allows writing of documents that have properties that you expect. For that reason, CouchDB supports optional schema enforcement. Sparing you a few details, all you have to do is provide CouchDB with a small JavaScript function that decides whether a document conforms to the expected standard or not: function(doc) { if(!doc.name) { throw({ "forbidden": "document must have a name property" }); } }

This function is run every time you attempt to write a new document or update an existing document to the database.

You can also load supporting libraries that do more declarative schema enforcement using JSON Schema, for example.

Changes, or “What happened since?” Imagine a groupware application that has a dashboard that includes a quick overview what is currently happening. The dashboard would need to have information about what is currently going on, and it should update in real-time with more information arriving. One of the more exciting features of CouchDB is what is called a Changes Feed. Think of it as git log , but for your database. The CouchDB changes feed is a list of all documents in the a database sorted by the last recent change. It is stored in a fast index structure and can efficiently answer the question “What happened since?” for any range of the history of the database. Be it from the beginning, or only the last 1000 changes that were made to the database. The changes feed is available over HTTP in a few different modi, and it enables a few very interesting use cases: 1. Continuous mode: our dashboard can open a connection to the changes feed and CouchDB will just leave the connection open until a change occurs in the database. Then it sends a single line of JSON with information about the document that was just written to the dashboard. The dashboard then can update its internal data structures, and end-user views to represent this new data. Other examples are email: the changes feed could be used to implement email push for example. For a mobile messaging app, it could be push notifications. 2. Continuous since: CouchDB understands itself as the main hub for your data. But it is a hub that is very easy to integrate with other pieces of software that would want to have access to that data. On top of the information about the new documents or document changes or deletions from the database, the changes feed also includes a sequence number that is a bit like an auto increment integer that gets updated every time a change to the database occurs.

ovmag.com

Page 41 of 52

The changes feed is indexed by this sequence id, so asking CouchDB to send you the documents since the time you talked to it is a very efficient operation. All you have to do is remember the last sequence id you received from CouchDB and use that as the since parameter for your next request. That way you can maintain a copy of the data in your database, for example for a back, or a full text search system, or whatever else you can envision (and people come up with the more remarkable things here) that allow for efficient delta updates, where you only need to request the amount of data that changed since the last time you talked to CouchDB. In addition, this architecture is very resilient, because in case the receiving end terminates or crashes for whatever reason, it can just pick up where it left, when it starts up again. Sequence IDs are different from document IDs or Revision IDs in that it is maintained per-database, and is increased every time a change is made to a database. 3. The document state machine: another common pattern is to use a document to track the state of a multi step operation, say sending an email. The steps could be: 1. end-user initiates the email-send procedure; 2. backend received the user’s intent and email details; 3. sub-system responsible for sending email reserves email for sending (so other parallel sub-systems don’t send the email twice); 4. sub-system responsible for sending the email attempts SMTP delivery; 5. sub-system records state (success or failure); 6. frontend picks up email send state and updates user interface accordingly. All these discrete steps (and maybe there are more) can use a single document to ensure consistency of the operation (guaranteed sending, but no sending twice) and the changes feed can be used to loosely couple all the sub-systems required to perform the whole procedure. (Yes, this is a poor-person’s message queue, but for persistent queues, this is not a bad one). These just as a few examples for the various things that the changes feed enables.

Replication We have already established that data synchronisation is what sets CouchDB apart from other databases. Let’s now take some time to dive into how it all works.

To start, let’s go back to how we write single documents: curl -X PUT \ http://127.0.0.1:5984/my_database/my_document \ --data-binary \ '{"name": "Dave Grohl", "age": 42}'

When we issue the above command, CouchDB replies with: { "ok": true, "id":"my_document", "rev":"1-e7eef663cf24b39ac342a6627ecb879" }

What you see here is that the document in question was indeed received and committed to disk ("ok": true ), we get the id back that we chose in the URL ("id":"my_document" , if this doesn’t seem immediately useful to you, you can also POST a document to CouchDB and have its id auto-generated, then you’d see which one you got), and finally we get the revision of the document: "rev":"1-2e7eef663cf24b39ac342a6627ecb879"

A revision signifies a specific version of a document. In our case it is the first version (1- ) and a hash over the contents of that document. With every change to the document, we get a new revision. In order to make an update to our document, we must provide the existing revision to prove that we know what we are changing. So technically, the revision is an MVCC token that ensures that no client can accidentally overwrite any data that they didn’t mean to. Revisions are also what enable CouchDB replication. For example, during replication, if the target already has a document with the same id, CouchDB can compare revisions to see whether they are the same or whether they differ, or whether one revision is an ancestor of the other. But let’s start at the beginning. Fundamentally, replication is an operation that involves a source database and a target database . The default mode for the operation is to take all documents that are in the source database and replicate them to the target

ovmag.com

Page 42 of 52

database. Once all documents from the source are on the target, replication stops; it is a unidirectional, one-off operation. curl -X POST \ http://127.0.0.1:5984/_replicate \ -d '{"source":"my_database", \ "target":"my_copy" \ }'

There are various other modes. If source and target have replicated before, replication is smart enough to only replicate the documents that were added, changed or deleted on the client since the last replication to the target (c.f. the changes feed). Replication can also be continuous, then it behaves like the regular replication, but instead of stopping when it is done at the end of replicating all documents from the source, it keeps listening to the source database’s changes feed (that we learned about above) for further documents and it replicates them as they appear in the feed. There are various more options, but the most important one here is filtered replication. It allows you to specify, again, a JavaScript function that gets to decide, whether a document should be replicated or not. In fact, this function operates on the changes feed, so you can use it outside of replication as well. An example function, that you would pass to CouchDB, looks like this: function(doc) { if(doc.type == 'horse') { return false; } return true; }

This forbids documents that have a type of 'horse' . A type, by the way is nothing CouchDB would enforce out of the box, but it is a useful convention that many people have used. See validation functions above, for how to enforce specific document formats. In a multiple-primary-databases situation, it would be nice if replication would work backwards as well. And it does, we can simply start a second replication where we switch the source and the target and CouchDB will do the right thing. CouchDB is smart enough to not keep replicating documents in a circle with this setup.

Automatic conflict management The avid reader is now sitting by the edge of their seat and biting their nails, burstling with the one big question that comes up, when talking about systems with multiple primary databases: “What about conflicts?”, what a relief, we finally got to ask this question! CouchDB prides itself with automatic conflict detection support. It is powered by one more data structure that CouchDB maintains that we haven’t explored yet: The revision tree for each document. Instead of only storing the latest version of a document, we also store a list of revisions (not the data, just the N-Hash information) in the order of their occurrence. When we now replicate a source and a target and there is a document on the target that has the same id as one of the documents in the source database, CouchDB can simply compare the revision trees to figure out whether the version on the target is the same or an ancestor of the one on the source and whether it can do a simple update or has to create a conflict. If CouchDB detects that a document is in a conflict state, it adds a new field into its JSON structure: _conflicts: ['1-abc…', '1-def…']

In practice, CouchDB will keep both versions of the document and lets us know which of the revisions are now in conflict. With that information, a client can go in try to resolve the document conflict my picking one of the two revisions and deleting the other, or by merging the two into a new third resolved version. It works very similar to conflicts in version control systems, where you get the >>>>>>>>HEAD and <<<<<<<<VERSION markers that you have to resolve before continuing your work. In contrast to version control systems, where conflicts are marked up so that no compiler would accidentally allow them, CouchDB will arbitrarily and deterministically pick a winning revision, that will be returned by default, should a client ask for it. The determinism adds the

ovmag.com

Page 43 of 52

nice property that after a conflict replicates through a whole cluster, all nodes will respond with the same default revision, ensuring data consistency even in the face of conflicts. The conflict state is a natural state in CouchDB that is not more or less scary than any other, but it has to be dealt with, as keeping lots of conflicts around will make CouchDB less efficient over time. Client software is expected to add a basic conflict resolution mechanism. In theory, CouchDB could provide a few default cases here, but since it depends on the application how conflicts should be resolved, the developers have shied away from this so far. The worst case scenario is that applications and users lose data that they previously assumed to be safe and that is not something that fits with the philosophy of CouchDB. We have mentioned the Changes Feed and filter functions already. This allows you to create a real-time feed of all the conflicts that are occurring and you can have your application logic that deals with conflicts subscribe to that and handle conflicts as they come in. The filter function in question would look like this: function(doc) { if(doc._conflicts) { // if the current document has // a property _conflicts // send it through the changes feed return true; } else { return false; } }

Queries A database that is just a key-value store is relatively straightforward to build, but it is limited its applicability. To be useful to a wide variety of applications, it should support some mechanism for querying. Relational databases are entirely based around the SQL query model. In CouchDB, the querying system, called Views, sits on top of the core database and makes use of the MapReduce paradigm and JavaScript functions to create indexes that provide access to the data in ways that are more useful to applications than just the core data store. Using MapReduce here allows the

querying model to be clustered with the core data. More on that later. Querying CouchDB is a two phase operation. First we create an index definition and second, we make requests against that index. As before, CouchDB allows you to write little JavaScript functions to do the work. Let’s say we want a list of all documents in our database sorted by last name, the equivalent of a SELECT lastname FROM people ORDER BY lastname in SQL. Assume we have the following documents already in the database: { "_id":"fd3ca61a", "name": {"first": "Beyoncé’", "last":"Knowles"}, "birthday":"1981-09-04" } { "_id":"2c87bab4", "name": {"first": "Dave", "last":"Grohl"}, "birthday":"1969-01-14" } { "_id":"0ce27d5f", "name": {"first": "Henry", "last":"Rollins"}, "birthday":"1961-02-13" }

View definitions are specified on special documents, that CouchDB calls design documents. The only thing that makes them different from regular documents is that their ID starts with _design/ . To create an index that is sorted by last name, we need to write this JavaScript function: function(doc) { emit(doc.name.last); }

This function is called a map function as it is run during the “Map”-part of MapReduce and we need to store it inside a design document under the name of people/by_last_name . See the CouchDB documentation for the exact details. Now that everything is in place, we can query CouchDB:

ovmag.com

Page 44 of 52

curl http://127.0.0.1:5984/database/_design/people/ _view/by_name {"offset": 0, "rows": [ {"key": "Grohl", "value": null, "id":"2c87bab4"}, {"key": "Knowles", "value": null, "id":"fd3ca61a"} , {"key": "Rollins", "value": null, "id":"0ce27d5f"} ]}

We can now specify a magnitude of query options to limit our result set to what we exactly need. The official documentation on views explains it all in painstaking detail: http://docs.couchdb.org/en/latest/couchapp/views/index.html The "offset" is useful when paginating over a view result, but we won’t cover this here, as the CouchDB documentation does a good job of explaining it.

Under the hood What happens under the hood when you make that first HTTP request to a view after defining it in a design document is as follows: CouchDB’s view engine notices, that there is no index to answers this view yet, so it knows that it needs to create one. To do that, it opens the changes feed of the database that the view definition is stored in and reads the results from the changes feed one by one. For each result, it fetches the corresponding document from the database itself and applies it to the map function that we have defined. For every time the emit(key, value) function is called, the view engine will create a key-value pair in the view’s index. The index is stored in a file on disk that is separate from the main database file. When it is done building the index, it opens the index file and reads it from top to bottom and returns the result as we see above. Now for every subsequent request to the view, the view engine can just open the index and read the result back to the client. Before it does, though, it checks, using the database’s changes feed, whether there are any new updates or deletion in the database since the last time the index was queried, and if that’s the case, the view engine will incorporate these new changes into the index before returning the new index result to the client.

That means CouchDB indexes are built lazily, when they are queried, instead of when new data is inserted into the database, as it is common in other database systems. The benefit here is twofold: 1. having many indexes doesn’t slow down the insertion of new updates into the database and 2. bulk updating an index with new changes instead of one-by-one is a lot more space, time and computationally efficient. This also means that the very first query to a view that has been created on a database that already has millions of documents in it, can take a while to complete. This is the functional equivalent of a ALTER TABLE CREATE INDEX

call in SQL.

Reduce The above example only shows the “Map”-part from MapReduce. The “Reduce”-part allows you to do calculations on top of the result of the Map-part. Reduce is purely optional, if the Map-part does what you need, there is no need to bother with a Reduce. An easy example is the _count reduce, that simply counts all the rows of your view result, or your sub-selection of it. But more sophisticated things are possible. Assume these expense documents: { "_id":"85bf3910", "date": "2014-01-31", "amount": 20 } { "_id":"fia8japh", "date": "2014-01-31", "amount": 15 } { "_id":"peup0aec", "date": "2014-02-01", "amount": 25 }

ovmag.com

Page 45 of 52

{ "_id":"uvaivah6", "date": "2014-02-01", "amount": 35 } { "_id":"uthoit3i", "date": "2014-02-02", "amount": 75 } { "_id":"iumuzai4", "date": "2014-02-02", "amount": 10 } { "_id":"gei3tova", "date": "2014-02-02", "amount": 55 }

Imagine this map function and the built in _sum reduce, that sums up a list of integer values, like a SELECT SUM(amount) FROM table would in SQL: function(doc) { emit( doc.date.split("-"), doc.amount ); }

The default result looks like this: {"rows":[ {"key":null,"value":235} ]}

The amount of all expenses together, so far so good. To make things more interesting, we can now apply the CouchDB view option group_level : group_level=1: {"rows":[ {"key":["2014"],"value":235} ]}

Nothing different in the value, but we can see that now the key is specified and it seems we are getting all the results for 2014. group_level=2: {"rows":[ {"key":["2014","01"],"value":35}, {"key":["2014","02"],"value":200} ]}

With a group_level of 2 we can see that our results are now grouped by the second element in our date array. E.g. we get all the expenses grouped per month. group_level=3: {"rows":[ {"key":["2014","01","31"],"value":35}, {"key":["2014","02","01"],"value":60}, {"key":["2014","02","02"],"value":140} ]}

With a group_level of 3 we get a group by the day of the expense. All results above are created from the same index and they can all are part of the index structure with only a minimal part required to be calculated on query time. Large collection time-indexed data can be queried very efficiently from a single index and grouped in many ways.

MapReduce and clustered queries So far we only have looked at the case of a single CouchDB instance running on a single server and you might be wondering why we are going through the hassle of learning MapReduce and how CouchDB uses it just to query some data. Remember back to the introduction of BigCouch when we learned that while before BigCouch/CouchDB 2.0 the whole system was a single-server system, but all its features were designed carefully to retain semantics in case CouchDB was going to be used in a clustered system? The reason CouchDB uses MapReduce queries is exactly that: It allows the execution of semantically equivalent queries on a single server and a cluster of servers of any size. So whether one computer, 10 computers or 100 computers are used to produce a query result, the result is always the same.

Transactions Another hot topic with databases is transactions. CouchDB does not support transactions in the common sense that you start one, do a bunch of operations and then end it and only at the end you know if all operations succeeded or not. CouchDB’s scope for operations is a single HTTP request that does a document read, or a document write, or a view query, but there is no way to make these operations inter-dependent in any way. The reason is simple: as soon as replication kicks

ovmag.com

Page 46 of 52

in, which is after any of the individual requests have been completed, there is no notion of a transaction, so replicating a set of documents in one piece is not something that CouchDB allows, and that’s on purpose: writing such a system is very complex, error prone, slow and not very scalable. Does that mean CouchDB cannot express what other databases do with transactions? Not quite, for the most common things, CouchDB can emulate transactions on top if its document storage and view semantics. For certain scenarios, a little more work on the client is required to get the same behaviour, but this exceeds the scope of this document. In 2007, Pat Helland, then Platform Architect at Microsoft, and formerly Amazon wrote a seminal blog post titled “Accountants don’t use erasers”. The basic premise is that in traditional computer science education, transactions are the cornerstone for any financial transactions. While actual, real-world financial applications couldn’t even function legally if they were using transactions. In accounting, everything is written to a log. Want to transfer some money from A to B, it goes in the log. Your funds are sufficient and the receiving end exists, money makes the move and a record goes in the log. Should your funds not suffice to complete the transaction, that new information that goes into the log. Instead of erasing the log entry that started the transaction, we add another one that records its failure state. If that wasn’t the case, auditing banks and other money trails would be near impossible. We can use that image to make transactions work in CouchDB. Instead of keeping a single document that has the balance of a bank account, we simply record all the transactions that make up the balance. Consider these four documents: {"amount": 200, "currency": "€"} {"amount": -50, "currency": "€"} {"amount": 150, "currency": "£", "conversion":1.21} {"amount": -100, "currency": "€"}

To get the balance, we create this map function:

function(doc) { var amount = doc.amount; if(doc.conversion) { amount = amount * doc.conversion; } emit(amount); }

With a _sum reduce function and full grouping, we get this result: {rows: [{"value": 231.5}]}

Now, the way that views work ensure that the result is always a consistent view of the balance. The index for a view is outside of the main database file. In order to produce results that are consistent with the database, views use this procedure: 1. a request to a view is made 2. the view engine looks up the current index and reads its sequence id. 3. the view engine asks the database engine to send all changes that happened since the sequence id that was recorded with the view index. 4. the view engine incorporates all document additions, changes, and deletes into the view index and records the last sequence id. 5. the view engine returns the result of the view index request to the caller. In addition, single document write-, update-, or delete-operations in CouchDB have ACID semantics. This isn’t to claim that CouchDB is an ACID database, but the core storage operations adhere to the same principles, and views can give you a consistent view of your data, so as far as your application is concerned, CouchDB behaves like any other database you would expect. That way, CouchDB can guarantee that the view result is consistent with the database at the time the request was made. There are some options where you can trade result latency for accuracy, but for our transaction example, we use the default case.

Internals This section is about CouchDB’s internals. It explains how the various features in CouchDB are implemented and how the combination of them all make for a resilient, fast, and efficient database system.

ovmag.com

Page 47 of 52

Core data storage CouchDB is a database management system. It can manage any number of logical databases. The only limitations are available disk space and file descriptors from the operating system. In fact, it is not uncommon to set up a database architecture with CouchDB where every single user gets their own database. CouchDB can handle the resulting potentially hundreds of thousands or millions of databases just fine. Each database is backed by a single file in the filesystem. All data that goes into the database goes into that file. Indexes for views use the same underlying storage mechanics, but each view gets its own file in the file system that is separate from the main database file. Both database files and view index files are operated in an append-only fashion. That means that any new data that comes into the database is appended to the end of the file. Even when documents are deleted, that is information that goes into the end of the file. The result is an extreme resilience against data loss: because once data has been committed to the file and that file has been flushed to the underlying disk hardware, CouchDB will never attempt to fully or partially overwrite that data. That means in any error scenario (software crash, hardware crash, power outage, disk full etc.) CouchDB guarantees that previously committed data is still pristine. The only way problems can creep in is when the underlying disk or the file system corrupt the data, and even then CouchDB uses checksums to detect these errors. Data safety is one of the big design goals of CouchDB and the above design ensures a maximal degree of resilience. In addition, operating always at the end of the file allows the underlying storage layer to operate in large and convenient bulks without many seeks and it turns out that is the best case scenario for both spinning disks and modern SSD systems. The one trade-off that CouchDB makes here in lieu of a more complex storage subsystem that can be found, for example in InnoDB is the need for a process called compaction to clean up extra database space and old document revisions. Compaction walks the changes feed of a database from the beginning and copies all most recent document versions into a new file.

Once done, it atomically swaps the old and the new file and then deletes the old file.

B+-trees One level up, both databases and view indexes use a B+-tree variant to manage the actual data storage. B+-trees are very wide and shallow. In CouchDB’s case, a tree with a height of 3 can handle over 4 billion documents. There is no upper limit to the number of documents or amount of data stored in a single one of CouchDB’s B+-tree. The advantage of a wide tree is operational speed. The upper layers of a tree do not hold any actual user data (a function of the +-ness of the B-tree) and always fit in the file-system cache. So for any read or write, even with hundreds of billions of documents, CouchDB only needs a single disk seek to find the data for a document, or a place to write a new document.

Concurrency CouchDB is implemented in Erlang, a functional programming language and virtual machine that comes with a rich standard library that makes it easy to build large-scale robust applications that supports a high degree of concurrency. Erlang’s heritage is the telecommunications industry and a core application are telephone routers, until today Erlang powers many major telecom phone and SMS exchanges. It turns out that the problems the telecommunications industry had when it designed Erlang mirror closely the issues of the modern computing landscape: millions and billions of individual users, no possibility for maintenance windows for, e.g. software updates and the requirement for extreme isolation: if a single user has an issue, it should not affect any of the other users that are using the system at the same time. Erlang is built to solve all the above problems and CouchDB makes full use of all of them.

JavaScript CouchDB embraces JavaScript as a first class language for in-database scripting tasks. It embeds Mozilla’s SpiderMonkey engine. At the time of writing this, there are a few experiments revolving around Google’s V8 engine as well as Node.js as a platform for embedded scripting needs.

ovmag.com

Page 48 of 52

Plugins CouchDB is extensible with a comprehensive plugins system that allows to augment CouchDB’s core features and semantics in any way a user needs. At this point, plugins will need to be written in Erlang and there are efforts underway to provide a common registry of plugins as well as a single-click installation process for end users. A plugin, for example, could be a secondary query engine, like GeoCouch, a two-dimensional indexing and query engine that works much like views, but is optimised for geo-spatial queries.

The Apache Software Foundation CouchDB is developed, maintained and supported under the umbrella of the Apache Software Foundation (ASF). The ASF is an organisation with a strong focus on building healthy communities for producing Open Source software under the Apache 2.0 License. That means that CouchDB is available free of charge and can be used in conjunction with any commercial projects. It also means that the development roadmap and management is not tied to any single corporation or person. A community consisting of engineers, designers, documenters, project managers and community managers that either work for companies that use or support CouchDB, or that work indepently work together in the the open on the future of CouchDB. It is guaranteed that anyone can follow the ups and downs of the project o

public mailing list. As such, CouchDB is secured against any single vendor dominating the project or a lock-in by a particular party that has it’s own agenda. At the same time, the ASF provides a level playing field for commercial enterprises to enhance the larger CouchDB

ecosystem on a strong independent core.

Conclusion CouchDB is different from all other databases, but it borrows a good number of well known and well understood paradigms to provide a unique feature set to its users. Core data storage in JSON and over HTTP make for a flexible and modern data store with a flexible query engine. Replication allows data to live where it is needed, whether it is in a cluster spanning a number of data-centers or on the phone of an end-user. Designed around data-safety,

CouchDB provides a very efficient data storage solution that is robust in the face of system faults, networking faults as well sudden spikes in traffic. It is developed as an independent Open Source project under a liberal license.

Jan Lehnardt is the Vice President of Apache CouchDB at The Apache Software foundation. He is the longest standing contributor and started working on CouchDB in 2007. He’s the co-author of CouchDB: The Definitive Guide. You can hire Jan and his team for your CouchDB support, consulting, training, and services needs. He lives in Berlin and likes cycling, drumming and perfecting the craft of making pizza in his wood-burning oven. Apache CouchDB is developed by a dedicated team of contributors at the Apache Software Foundation. Follow him: @janl

Email: [email protected]

Hoodie: A CouchDB case-study - See hood.ie Hoodie is an open source web-framework that allows people with only minimal frontend web design skills to build fully featured web applications. It’s core is a friendly API that encapsulates all backend complexities in an easy to use way. In way, it does for frontend developers what Ruby on Rails did for backend developers: hide many recurring problems that applications have behind a common abstraction in order to allow application builders to concentrate on what makes their apps special and not on re-inventing the millionth password-reset system. One of the core features of Hoodie is that it allows applications to function offline. That is, all end-user interaction only occurs with a client-side, in-browser database. Communication with a server happens asynchronously and only when an internet connection is available and fast. This communication system is just CouchDB’s replication mechanism, wrapped in a nice API. Application developers don’t need to worry about the details. Hoodie chose CouchDB for the backend solution specifically because it enables this Offline First design for applications. While the web-version of Hoodie is furthest along, there are also ongoing efforts to port the frontend bits to iOS and Android for use in native apps there. The Hoodie developers believe that Offline First application design is going to allow application developers to build superior user interaction experiences in the face of the slow rollout of mobile broadband, network contention in densely populated areas or architecture that works as a faraday cage. While the technology exists to solve these problems, application developers are slow to adopt them, because it requires a rather massive re-thinking of how they build their apps and that’s why Hoodie tries to reach user experience experts, designers and frontend developers who would otherwise couldn’t build a full application and let Hoodie care for all their backend needs, while giving them Offline First apps for free.

ovmag.com

Page 49 of 52

Clean Coding with Uncle Bob

Mark and Bob chew the cud over what it means to be a clean coder - and why it is important.

Mark: High Bob, thanks for coming to discuss your clean coding video series (see cleancoders.com). Perhaps we can chat about some of the themes of the first two episodes? Bob: Sure, that's fine.

Episode One Mark: Bob, In the first episode of your clean coding series you discuss mainly the motivation issues behind adopting a clean coding approach - including how a supposed short-cut (short-termism in one’s development approach) can turn out to be a long-cut, or worse, a complete dead end. Could you explain a little more on why you think this happens, where you’ve seen it and what you think should be done about? Bob: Software development is a very young industry. It is young in two ways. First, the industry itself is only about 60 years old. Second, most software developers are less than 35 years old. The reason for that last statistic is that the population of software developers is doubling every decade or so, and the new folks entering the field, are in their early 20s. What these two facts mean is that our industry hasn’t had much time to understand itself and is composed of people who have relatively little experience. Of course this explains why our industry has no defined professional standards, nor any code of professional ethics. Those few programmers who have been in the industry long enough to form a reasonable set of professional standards and ethics are vastly outnumbered by the folks who have not. Therefore the only code of behavior we have is: To Rush. It’s easy to see why rushing is the norm. Programmers are expensive. In the US, the loaded rate for a programmer is $200K or more. So every line of deployed code costs a lot of money. This, in turn, leads managers and

developers alike to try to increase the number of lines of code written per day; and the obvious way to do that is to rush. What managers and young programmers don’t realize is that rushing slows them down by a huge factor. Older programmers have learned that they can write far more lines of code per day by taking their time and using a disciplined approach. It turns out that care is an accelerant, and rushing is a retardant. Of course all older professions know this, and know it well. Doctors, lawyers, mechanical engineers, electrical engineers, carpenters, brick layers, you name it. These professions have learned, over the centuries, that the best way to go fast is to go well. How in particular would you deal with the oft stated argument that being first to market is everything? Bob: I would remind everyone that Facebook wasn’t first. Facebook wasn’t even close to being first. Chrome wasn’t first either. Minecraft wasn’t first. Microsoft wasn’t first. IBM wasn’t first. Everybody thinks that being first is critical. It’s not. Being better is. But the very question is based on the premise that getting to market first requires rushing. This is absurd. If you rush you will slow down. If you adopt a careful and disciplined approach, you will go fast. The way to get to market first is to do a good job. I completely reject the attitude that is prevalent in startup companies which is, if I may paraphrase, “Make a mess, and make a million.” That’s nuts. While I agree that speed may be especially important to a startup, I also stress that speed is attained through care and discipline, not through overtime and rushing.

ovmag.com

Page 50 of 52

Episode two : Mark: Episode 2 covers naming - naming of classes, variables, functions, enums, etc. In some respects it is quite surprising that you can get so much mileage, and have so many points to make about such a seemingly ‘simple’ issue. It struck me that underpinning this whole section is the desire that the code should be self-evident - self-documenting - in what it does? Bob: In the first decades of our industry we believed that code was so cryptic that we needed lots of external documentation to explain it. And we were right. Because in those days, code was cryptic. Indeed the words “code” and “cryptic” are synonyms. We called it code because it was a code. Back then names were limited in size. Fortran allowed names of 6 characters. PL/1 allowed 8. I worked on an assembler once that had a limit of 4. Basic has a limit of one letter and one number! So programs were codes indeed. When you are writing in a code you need a key to decipher that code. That key was comments and documents. We strongly urged programmers to spend significant amounts of time writing both. Without them, understanding the code was nearly impossible. But things have changed. Our languages are far more expressive than they used to be. One of the factors that has increased expressiveness is that the limits on the length of names have been removed. This new expressiveness allows us to change our emphasis away from comments and documents and towards writing programs that are able to clearly express themselves. One of the most important disciplines in that change of emphasis is naming. Several years ago Tim Ottinger wrote a document on the Object Mentor website. It was simply known as Ottinger’s naming rules. Of all the documents on our site, Tim’s was the most downloaded. Chapter 2 of the Clean Code book is merely an expansion of that paper. Nowadays we name lots of things. We name files, directories, functions, variables, arguments, types, data structures, classes, namespaces, etc. Since we do so much of it,

we probably ought to get good at it. Apart from the obvious role of names in self-documenting code, it turns out that choosing the right names is one of the best ways to help you design and partition the program properly. When naming something is hard, it’s usually because you haven’t broken the problem down properly. Mark: Why haven’t (the many) other naming conventions addressed this to the correct degree, in your opinion? Why did they get it wrong? Was the motivation different? If so why? Bob: Naming conventions are constrained and impacted by technology. Consider Hungarian Notation (so-called because it was invented by Charles Simonyi the Hungarian chief architect at Microsoft during the DOS era). Charles invented a naming convention for C that encoded type information into the names of variables. So the variable pszName was a pointer to a zero terminated string, and it represented the name of something. Why was this necessary? In the 1980s we didn’t have IDEs. Our editors would not tell us the types of variables. And our compilers did not enforce those types. Back then, if you declared a function to take an integer argument, but you passed a floating point number to that function, the compiler would not report that as an error. The result was a runtime failure; and usually one that was very hard to debug. Hungarian notation was the obvious self-defense for the inadequate tooling of the day. Nowadays, however, our IDEs report type information just by hovering over a name. Our compilers reject type conflicts (in statically typed languages). And so the old Hungarian notation is both redundant and obfuscating. It has changed from an asset to a liability. It has rightly been dropped by almost everyone. Except of course for that damned ‘I’ in front of interface names. Just as an aside, why in the world do we loudly advertise our interfaces? The whole point of an interface is that you aren’t supposed to know it is an interface. If we have to adorn some class with a prefix or a suffix, wouldn’t it be better to adorn the concrete derivative; the class that no one us supposed to know about?

ovmag.com

Page 51 of 52

Mark: So can source code really be self-documenting - without the need for comments - do you believe?

Bob: No, not entirely. Nowadays code can be remarkably expressive; but it is not a human language, there are concepts that it cannot adequately express. For those, we should use comments. However, our attitude towards comments nowadays should be negative. Every comment we write represents a failure to express ourselves in code. Indeed, the only time we should write a comment is when we have exhausted all attempts to make the code speak for itself. Comments are failures, not successes. We should not congratulate ourselves for writing comprehensive comments. Rather, we should consider every comment to be a liability, and remove all but the utterly necessary ones. But what about Javadocs, or the documentation of public APIs? The same rule applies. The best public API is one that doesn’t need a document. If, however, the API is so obscure that your users cannot divine it’s behavior from it’s name, argument list, and class, then by all means write a comment. But know that you have failed. Mark: For clarity: In what way is a variable name that shows its intent better than a comment that also shows it’s intent - why is one better than the other? Bob: Two ways. First, comments aren’t compiled, and so they rot. Consider a variable named gallonsOfGas. Let’s say there’s a comment somewhere that refers to this variable. Now let’s say we change the name of the variable to litersOfFuel. There’s a real good change that comment won’t get changed, because the person who changed the variable name didn’t see it.

Of course that was just the most obvious way that comments rot. Comments rot for all kinds of reasons that we don’t need to go into here. Just remember, the older a comment is, the greater the chance that it’s lying to you. The second way that a good variable name is superior to a comment is that the descriptive nature of the name appears throughout the code, not just where the variable is defined. For example let’s say we declare the following variable: Date d; // publication date. Now scattered throughout the code are lines that use the variable ‘d’. But the only place that tells us that it’s a publication date is the comment in the declaration. If we change the name of the variable to ‘publicationDate’ then the meaning of the variable is shown everywhere the variable is used.

Mark: It’s interesting that you say that abstract classes should not - for example - have the ‘I’ prefix (e.g. IAccount - interface to account) - stating that the user of a class need not know if it is abstract or concrete. This idea of hiding or removing the need to look into unnecessary detail and presenting names (function, method, …) from the class user’s perspective seems to be a common theme. Is that right? Bob: I would strengthen your statement. The user of a class should not know whether that class is an interface, an abstract class, or a concrete class. That information is none of the user’s business. Indeed, the author of the class may want to change that class from one to the other. So why would we ever advertise this in the name. That ‘I’ is just awful. Of course it is a common theme to hide information. We call it ‘information hiding’. It

ovmag.com

Page 52 of 52

was a concept described more than forty years ago by David Parnas. In short: don’t let any part of a program know anything it doesn’t need to know. Do you need to know that a class is an interface in order to use it? No? Then drop that dumb ‘I’. Here’s a little experiment you can run. Walk in to a modern software development shop and ask the programmers who David Parnas is. How much do you want to bet that no one will know? Why is David Parnas, the father of information hiding, the father of formal acceptance testing, the father of much of modern software design, unknown to most programmers?

uncle bob investigates clean code

Mark: A final point on Episode 2 - you say that variable names should be shorted if their scope is very limited, but that method names should be longer in the same context. Could you briefly explain this apparent contradiction? Bob: Consider the following code: for (int i=0; i<someLimit; i++) processElement(i); There are two lines of code that know about the variable ‘i’. Now for someone reading this program, they only have to remember the name ‘i’ for two lines of code. That’s not hard to do. Indeed, the shorter the variable name, the easier that name is to remember. For example: for (int indexOfElementToBeProcessed = 0; indexOfElementToBeProcessed<someLimit; indexOfElementToBeProcessed++) processElement(indexOfElementToBeProcessed);

This is much harder to read because the poor reader is forced to read through the variable name four time, and make sure that all four of those uses are the same. The annoyance involved can be so distracting that the poor reader misses the intent of the code because

they are so busy being angry about the stupid variable name. On the other hand, a variable with a very long scope — like the global scope — should have a long name so that it can remind the reader what it is used for. The variable: indexOfItemToBeProcessed

is a perfectly acceptable variable if that variable is global, or widely known through a large class hierarchy. Oddly, the rule is the exact opposite for functions; and for a remarkable reason. The more widely a function is known the more general purpose that function is. Thank about that, it’s obviously true. The more widely it is known, the more users it will have. The more users, the more diverse those users, and the more general the function. When something is general, it has very few qualifiers. That’s what general means. And in function names, qualifiers are adjectives. So, for example, the function named openBinaryFile is qualified by both the words binary and file. The scope of this function is reduce to just that code that needs to open binary files. Whereas the function named ‘open’ is much more general. It is not restricted to binary files. Indeed, it is not even restricted to files. So the larger the scope of a function, the more general that function will be, and the shorter it’s name must be. Mark: Bob, thanks for spending the time with ObjectiveView today. Bob: My pleasure.

See the videos at http://cleancoders.com Follow him: @unclebobmartin Robert Martin , known colloquially as "Uncle Bob", is an American software consultant and author. Martin has been a software professional since 1970 and an international software consultant since 1990. In 2001, he initiated the meeting of the group that created agile software development from extreme programming techniques. He is also a leading member of the software craftsmanship movement.