Quinton Anderson's Blog

software engineering, mostly

View on GitHub

Welcome to My GitHub Pages.

Welcome to my software pages, well, its more like a single large page, with lots of entries... I am a software engineer, and depending on the day I either think I am getting good at this, or I know nothing. That doesn't stop me from writing software and writing about software. I will continue to grow this knowledge base in the hope that you find it useful.

Please take a look at some of my code:

You can also check out my Linked-In Profile or follow me on twitter. I will be adding to my slideshare in the near future.


Tutorial: Scalding REPL and Emacs

In this tutorial I will show you how to use the Scalding REPL and emacs together to create a very productive environment for exploratory work with Scalding. The first question you will be, why? Well, explicitly in cases where your dev environment is too "far" from the cluster and you need to interact with the data in order to build out your pipeline. In these cases you cant use an IDE, you need to something deployed in your cluster, and preferably light weight, not requiring remote X sessions or anything brittle and crazy. In these cases a REPL is ideal, but a REPL on its own lacks alot of functionality... A REPL is great for basic exploring a language, but it isn't great for building up functionality in a way that is persistent. You need to have source files, and a REPL and you need to be able to move source between the 2 in an interactive way. In walks essh

Emacs

Follow the instructions here, and install essh.

I then also add the short cuts into the global context:

(global-set-key (kbd "C-c C-v") 'pipe-line-to-shell) (global-set-key (kbd "C-c C-j") 'pipe-region-to-shell) (global-set-key (kbd "C-z") 'undo)

Scalding REPL

To use the Scalding REPL, follow the instructions that can be found here

Interactive workflow

The first part is to have a split window, one half with the source files, and the other half with a bash session. Start by splitting the window with C-x 2, and then select a window using C-x o. You can then open a bash session using M-x shell. In the bash session open up the Scalding REPL. You can then "push" code to the REPL and develop a great interactive workflow from a remote SSH session.

When in the code window (code can be found using C-x d), simply execute the current line in the REPL using C-c C-v, or select a block using C-space and send it to the REPL using C-c C-j. It takes some slight getting used to, but you will quickly get to understand the power of this interactive coding mode.


The Storm Cookbook

For those interested in Storm, I encourage you to purchase my book on Storm.

For those who have found their way here looking for the code bases from the Storm cookbook:


Strange Cycles in technical approach

Something that has always bothered me in IT is that we seem to land up doing things that have already been done in previous generations, and then branding them as new and innovative. Much like the fashion world, something is cool until it isn’t, and then just as people have forgotten about it, it someone introduces it as the latest thing and it gets a second chance.

I was just watching some very good talks around architecture on infoQ, the first one being about the event sourcing pattern, the second relating to data architecture and single system of record. In the case of event sourcing Martin Thompson of LMAX was talking about their overall architecture. He was extremely upfront in saying that this approach isn’t new at all, he had some interesting examples dating all the way back to the 196os, where on that hardware they were achieving over 2000 transactions per second. Most systems in this day and age, on massive hardware can’t achieve that. They have adopted these approaches and are able to process extremely complex transactions, millions per second on fairly low cost hardware. Largely due to the event sourcing approach.Somewhere between then and now we “forgot” about this, build massively complex databases, spent huge money and reduced performance.

In the case of data, we went from a largely federated model where polarity was standard, to one where a single system of record became popular. Only to discover that it was brittle and created UN-intended couplings, so we are now discussing ways to go back to what we used to have.

Our industry is littered with these sorts of examples. Just think of your experiences in IT, how many things that you are doing now did you actually do over 10 years ago in a slightly different way or my a different name?

I often wonder how much of what we invest our time and money in isn’t a clear case of ignorance and a lack of critical thinking. Thick glasses, shoulder pads and permed hair anyone?


inventing the MVP from first principles

I was reading a block discussion about Vagrant today, and there is a comment there that I found complete classic from a guy by the name of Brandon Bloom:

“Forgive me for derailing the conversation, but I’ve been following this thread about the Cathedral & The Bazzar: http://news.ycombinator.com/item?id=4407188

In particular, I’ve found the discussion of autoconf to be fascinating. I think what fundamentally bothers me about Vagrant is that it feels like autoconf. By that, I mean it’s a system which 1) hides a ton of hard work that users don’t need to do to target multiple backends 2) accomplishes that hiding through complexity 3) further hides that complexity through an “easy” interface 4) generally just f***ing works most of the time 5) is completely maddening to deal with when it does break 6) ideally shouldn’t need to exist.

Anyway, I realize that ideals != reality. I also realize that I’ve got a learned behavior of reinventing the minimal viable tool from fundamentals to avoid dealing with the complexity of problems that I don’t have. And furthermore, I realize that my learned behavior is of limited applicability outside of narrow environments, such as my own startup where I own the full stack and make all the rules.”

Of course I find it classic because I often go through the process of re-inventing the MVP myself.


Hacking Vs Quality Validated Learning

On occasion I teach a certification from the Software Engineer Institute called the Personal Software Process (PSP). The fundamental concepts that are put across is that as software developers we can apply good engineering practice to the way we develop software. This involves identifying measures that tell us something about the way we work, we can then use these measures to improve and ultimately become much better at what we do.

The approach definitely works in the class setting. The class learns how to measure their time, quality, schedule and the size of their outputs. The class does this through writing some small applications and we collect the class data as we go, with some helpful tooling (process dashboard, http://www.processdash.com/). The code quality always improves, drastically. The estimation variance always comes down, and the class always leaves with an appreciation of the fact that we can apply engineering practice.

The issue is always that these kinds of disciplines are extremely difficult to apply in a real world setting, partly because of ingrained habits, partly because of organizational and business drivers (I will post more on this soon), and partly because of poor change management.I will post a story of success and failure on this front where I tried to implement this approach at a large financial institution shortly too.

The focus of this post comes from a common question/criticism I receive around the process, which talks to a more general concept and understanding of quality vs hacking which is driven by complexity and uncertainty.

The common comment in the class setting is that when embarking on the estimation process prescribed we need to know with a great level of certainty how we are going to implement the given solution. This is fine in the class setting, but in the real world we are often faced with problems that contain too much complexity or uncertainty. This can stem from poor requirements (obviously) or poor domain understanding or even new technology. In these kinds of situations statistical estimation techniques fall apart because we don’t have any good proxies on which to base our estimates. The default here is to fall back into hacking or prototyping as a means of learning, and so all good practice, measurement and quality goes out the window.

What I would like to put forward is that quality is still key in these kinds of situations for the following reasons:

In the lean manufacturing world they put forward the following premise: Any activity that does not result in validated learning is a waste, where validated learning is typically a learning from actual customer behavior. This implies that we setup experiments with the explicit goal of discovering which of our assumptions are correct and which aren’t. The more often we do this, the quicker we learn and improve and deliver and make more money. Now validate learning doesn’t only come from customers, it can come from internal customers (operational users, etc…) or whichever party is the ultimate consumer of the thing you are trying to build. The agile community tries to embrace this concept too, reduce the time to learning.

What this means we have to apply the scientific method:

We can easily do this within software, but it means a slight mind set change from “I am going to build X”, to “I am going to learn Y though X and exposing to Z”. E.g. I am going to build this UI and get the user to play with it ASAP. The key point for me is that this still requires process discipline (follow the scientific method, and measure) and must include quality as s key consideration so that we get quality learning’s and/or products.


Transaction Processing: Key concepts and considerations

Introduction

Given the extreme requirements on business solutions these days, the enterprise software ecosystem has become quite large and complex. I am always interested to see how technology divisions make their technology choices, from very strict process all the way through to pure emotion. There are roles that are supposed to be able to guide these kinds of processes (such as Chief Technology Officer) and there are many good commercial stacks available like Oracle and IBM. As a technologist sitting on my Mac at home, I don't have the money or infrastructure to even evaluate most (if not all) of these stacks. And so, I can simply hope to gain key insight into building enterprise solutions in my day to day or I need to take another approach. Now, obviously I get exposed to these kinds of stacks in my day to day, but that isn't really the best place to learn and experiment and grow beyond the constraints of the current technology choices and project demands.

Unlike a few years ago, however, I can still experiment and learn and grow, thanks to open source and readily available cloud infrastructure such as Amazon EC2. I am not pushing Amazon, or open source. The economics of open source versus not is well understood and widely documented, and there are many drivers that many push you in either direction and I believe that choice should be made in a mature structured manner for your organization. But for my personal purposes I need a complete open source Enterprise stack to play with.

With that out the way, what is an enterprise stack? Well, that really depends on the particular industry, company and requirements. So for the purposes of this discussion I will constrain things to financial services, particularly a traditional bank. And we can then suppose that, at the very least, this list of open source solutions would be required to be provisioned, integrated and managed in order to deliver some basic capabilities to the business units and customers that need them, for the purposes of this article we will also ignore COTS systems(Pre-Built payments systems for example). We will ignore COTS because the purpose of the article is to discuss building, not configuring and secondly there may just be quite a trend in the future towards build and away from buy (despite the current commonly accepted knowledge). Ok, so the open source capabilities:

And the list could go on for much longer.

What has always struck me is the that they key element within this stack is always the least understood, being core transaction processing. Everything hinges of a strong core transactional capability, everything else is there to provide operational support. But the key element is the ability to process transactions deterministically, at scale and enable core value propositions.

The purpose of this blog entry is to put forward my thoughts on what is required for core transaction engine(s) and what concepts need to be considered before defining its requirements or design. The bulk of the post will talk to technology agnostic concepts, but I will put forward some open source options to consider later, of course.

Definitions

Ok, before we dive into the details, lets just clear up the most important and most overloaded term...

What is a transaction?

Put simply, a transaction is a unit of related work. This is very vague, which is fine because it lets us tailor the meaning, depending on what we are doing. So, here are some transactions:

What is important to note here, is that a transactions boundaries vary from case to case and many transactions can make up a transaction at a higher level of abstraction E.g. in the case of a payment, the business level transaction is potentially made up of the following "sub" transactions: the posting to the debit account, the posting to credit account and finally the acknowledgement across whatever channel instructed the payment.

So for the purposes of this post I will use the following convention:

Concepts to Consider

There are many things that one needs to consider when approaching transaction processing. The first and obvious one is a key understanding of the particular problem domain you are partaking in. I don't want to harp on about requirements management, customer affinity, agility and domain knowledge here, suffice is to say that if you don't understand what you are trying to solve you will land up with the wrong solution.

With a clear understanding in mind the key issues to consider next are the functional requirements, data consistency and the non-functional requirements. The non functional requirements are often overlooked, but here we need to understand the actual throughput, latency and availability requirements of the system (among other non functionals).

Fact: The free ride is over

Despite Moor's law, the humble processor and hard disk can't keep up with the ever increasing demand on systems. We can no longer think of solution design in terms of single systems, we have to think in terms of distributed systems and massive concurrency. This is not only a factual constraint, it is also an economic consideration. Is Oracle RAC really required or would a cluster of low cost hardware achieve more?

The CAP Theorem

The CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees

  1. Consistency: All nodes have all the same data at the same time (implies ACID type of semantics)
  2. Availability: grantees that every request is serviced
  3. Partition Tolerance: The system continues to function in the event of partial system failure or message loss

Popular approaches up until now have favored consistency, in fact they have made it an absolute focus, extending to try and guarantee that no data can be lost or inconsistent. In practice such guarantees are impossible.

Essentially what the theorem means is that we have to sacrifice something if we want to achieve real scale and high availability.

ACID vs BASE

Ok, time for a little bit of technical transaction theory.

ACID stands for Atomic, Consistent, Isolated and Durable. This means that a transaction must be fully rolled back or committed, it can't be anywhere inbetween. It also means that transactions run independently of each other, but once the technical transaction has committed then all copies of the data are consistent. This is the standard model we are used to as software developers, especially those who use relational databases and technologies like Java. Concepts like XA and two phase commit are ACID implementations of transaction management.

XA and related technologies make our lives quite easy at design time, largely because of the amount of mature framework to achieve these things (the implementations are mostly not simple at all, in fact quite complex). There are however a few inherent issues with these kinds of approaches:

It is for largely these reasons that typical enterprise stacks achieve only hundreds of transactions per second, despite massive investments in hardware. There are ways to achieve ACID semantics on technical transactions where the throughput is in the thousands and millions of transactions per second, but more on that later on.

BASE is a nice term that was coined to give us the old ACID or BASE comparison from science. BASE in this context stands for Basically Available, Soft state, Eventually consistent. This represents a mind set change on a few fronts. Firstly there is no concept of a rollback, it simply doesn't exist, which means if it didn't happen we don't know about it immediately in the same way as we would in say XA. This means that we have to compensate or failures in other ways, like compensating transactions. These are therefore specifically created for business scenarios of importance. This sounds onerous, but in practice it isn't because the technologies are made to be highly available and so these cases are rare, but more importantly we only deal with a few failures pessimistically. Where as XA takes a pessimistic approach on each transaction (causing all that overhead), we are forced in this way to only do that when required, giving us huge performance gains. The second mindset change is that of eventual consistency. But before we dive into the consistency discussion lets just talk about state quickly...

State

What is state?

I often interview candidates for development positions and try to phase questions around state to try and gauge their conceptual understanding in this space. There are so many different perspectives, largely because developers have come across the term in so many different contexts. In the old EJB days people discussed stateless and stateful session beans, people discuss state at a POJO level (and terms like immutability get thrown in), functional programmers love talking about immutable object exchange and the resulting gains, when we start talking about long running concerns we have put the state into a file or a queue or a database.

So what is state? State is all of those things, or more correctly all of those things are state. State is essentially a piece of information that is left behind after a transaction has completed, with the intention that this information gets used in later transactions or other functionality (we could argue that all other functionality is a type of transaction, but I won't do that here).

It is the class member variable.

It is the database table entry

It is the momento on the disk

Etc...

So?

So, why do we have to be so concerned about state. Well, firstly from a design principles perspective encapsulation is about hiding implementation and state, so we have to understand what state is before we can achieve encapsulation and without that we can't achieve low coupling. Secondly we need to understand how state is accessed and updated, because this effects how we scale and finally we need to consider the consistency of the state. When changes occur in our state, who needs them by when?

Is consistency really required?

The term eventual consistency usually engenders some level of fear among technical and business users alike. How can different parts of our system have different versions of the data? Lets go through a use case and apply some critical thinking to each.

Lets start with user interactions, specifically when using a think or rich client application in which the presentation is performed on the client machine or browser (AJAX, Flex/Flash, etc…). There are typically 2 interactions patterns between the client and server. Namely synchronous and asynchronous. I would like to focus on asynchronous given that synchronous isn’t the best approach, but the point is applicable in either case. Consider the following diagram:

This is an extremely simplified example of how data might flow through some potential layers. The data will only exist on a single data node until the replication of said data has been completed to the other data nodes, meaning that the system is in an un-consistent state for some period of time. Lets now understand the impact of that. The customer, being the important player in this scenario isn’t affected at all because consistency is in place within the context of the session.  The data is then going to be used by the Management Information System (MIS) for management reporting and the FI system to post to the ledger. But neither of these tasks requires that the data becomes consistent immediately, when the data becomes available to them on their data nodes then it will be visible in the books and management reports. So, consistency doesn’t really matter there.

There are of course some use cases where eventual consistency technical can’t be tolerated, such as trading platforms or stock exchanges, but it must be noted again that the consistency is only required within the stateful component making the decisions. Other consumers of the data such as billing and reporting can consume the data eventually in an appropriate structure.

Command Query Responsibility Segregation

I won’t go into too much detail here, because people like Martin Fowler have done a much better job than I could at explaining (http://martinfowler.com/bliki/CQRS.html), however the key point for me is that we really need to consider the purpose of the model we are busy building and don’t assume that we have to have one large shared model, especially when we take the eventually consistency considerations into account.

Just the right amount of consistency

Obviously you can get varying levels of consistency. Consistency in this landscape is also a very subjective thing, depending on where I am and what I am doing. Most modern cluster technologies will allow you to choose the level of consistency you would like to achieve for a given action:

Similar concepts can be applied to the writes. The importance here is that we can tweak our consistency, but whether we should or not depends on our problem domain and a key understanding of it and our architecture.

Concurrency considerations

How can we achieve scale?

Scale is either achieved through vertical or horizontal scaling. Vertical scaling involves deploying more process on a single node, more threads or instances of a service, often this essentially means buying bigger hardware. Horizontal scaling involves deploying processes across many nodes in a cluster. This is also achieved in 2 ways, depending on the state and the transactional semantics applied to it. Firstly we can do a functional segregation, meaning that we divide the problem domain into portions and assign different portions to different nodes. Think partitioning in the database world. This of course doesn't allow us to scale linearly (as load increases just add nodes, 2 nodes process twice as much as 1 node). The second approach is to deploy the same functionality across many nodes, but this brings it own set of problems, and this is where we need to start being clever.

Commutative Vs Associative

These terms come from mathematics and usually refer to the relationship and effect of operators on groups of numbers. So, communtative can be described as ab == ba, meaning the order of the numbers doesn’t matter, so multiplication of integers is said to be commutative. Associative can be described as a/b != b/a, meaning that the order of the numbers matter when doing division.

Simple enough, but what does this mean for us in the computer science world? When processing transactions we have to understand weather we are required to process them in an associative manner or a commutative manner. This understanding is borne out of an understanding of the particular problem domain, and like most things there isn’t only a black and white, there is some middle ground. I will try explain with some examples.

The first example is that of a stock exchange. A stock exchange involves holding a number of securities within groupings of financial instruments which are subject to different regulatory constraints and are generally traded differently. Despite all the differences the key issue is that everything in the market can potentially effect everything else in the market because all the values of the shares are based on perception of their value (mostly based on demand, given certain drivers in the market). Perception can easily be influenced between securities, meaning that the security itself influences its value based on its fundamentals, but traders will also evaluate it relative to other securities. There are event some instruments that are based on other instruments.  The point here is that order is vitally important, at a global scale, for all transactions. The order in which bids are received is the order in which they must be processed and all bids and orders can act on a central set of state, being the share’s value. These transactions are therefore said to be associative.

The second example is that of account opening. A core banking system must be able to accept account opening transactions. These contain all the required information to create a new account of some type. Requests will be received from many individuals via the online banking system. One request has no material impact on the other, there is no relation of any kind between them and as a result the order in which they are processed does not matter. These transactions are said to be commutative.

Many people mistakenly believe that certain transactions are associative when they really aren’t. Lets take the example of a bank account. A bank account is simply a “bucket” which knows its balance and limits and maybe some interest concerns. Debits and credits to the account affect is balance, and because of this shared state many people assume that the deposits and withdrawals on that account are associative, when they many not actually be. The key issue with an account is that the balance is known at the end of the financial day so that accounting can be performed. Lets take the example where an account starts with a balance of 100. Two transactions are received simultaneously at the bank for processing, a deposit of 10 and a withdrawal of 30. That means that the balance for some small period of time could either be either 110 or 70, but it would eventually become 80. And we need to understand that “eventually” is also some very small period of time. So in their very nature these transactions are not necessarily associative.

The middle ground example is that currently trading where the values of the currencies have no real bearing on each other for the given trade. A currently trading platform could be receiving many simultaneous trade requests, however only the trades for the dollar must be ordered, and like wise the trades for the pound must be ordered, but they don’t have to be ordered relative to each other. These transactions are said to be locally associative, and provide us with an opportunity for domain based segregation and scaling.

Locking is evil

Locking involved protecting a shared piece of data or resource to ensure that multiple threads or nodes can’t effect it simultaneously, because if they do then the behavior is undefined (all those who have ever chased a race condition down 50 different allies will appreciate what that means). Generally multiple write is problematic, but multiple read is fine provided of course that no one is currently writing. So common knowledge is that locks are a good approach to dealing with the concurrency problem of locking and many good frameworks have been created to deal with this, containing many good primitives to help make the implementation easier and more performant.

All that being said, locks are actually evil for a few reasons. Firstly they kill performance. Secondly if a process dies while holding a lock it is typically quite difficult to recover the system without any downtime. Finally, and again related to performance, over and above the locking itself being slow, if the process holding the lock is slow then everything slows down.

For me one of the best illustrations of a core understanding of concurrency concerns is the LMAX team that invented the disruptor. They did some really impressive optimization of concurrency through this understanding and I will go into their architecture a bit more later on. They did some detailed analysis of the performance impact of locking on performance and I would like to quote them to illustrate the point:

“The Disruptor paper talks about an experiment we did.  The test calls a function incrementing a 64-bit counter in a loop 500 million times.  For a single thread with no locking, the test takes 300ms.  If you add a lock (and this is for a single thread, no contention, and no additional complexity other than the lock) the test takes 10,000ms.  That's, like, two orders of magnitude slower.  Even more astounding, if you add a second thread (which logic suggests should take maybe half the time of the single thread with a lock) it takes 224,000ms.  Incrementing a counter 500 million times takes nearly a thousand times longer when you split it over two threads instead of running it on one with no lock. “

Immutability

The final, but vital consideration in concurrency is immutability. An immutable object/component is one who’s state can’t be modified after it has been created. This is important because it is therefore guaranteed to be safe to use in a concurrent environment, and this is a staple of functional languages.

Summary

So, in this post I have taken a look at just some of the concepts that one needs to consider when choosing or implementing a core transaction engine for an enterprise. My next post will include some conceptual approaches to implementing such a system and I will also identify some useful open source technologies to use.

References


Lean, Batch Sizes and the impact on Skills

I run a services team, we essentially build up technology and domain skills that we believe are valuable, and on sell them. I am fairly new to this particular business, from a traditionally delivery background of running an internal IT department (the other side of the supply chain as it were). Traditionally this particular business focussed on deep specialization in areas like SAP, SAS, tibco, etc... Particular technology stacks where the depth of skills was called for by the market, but also particular delivery models were in place which leant itself to waterfall type delivery models which are characterized by large batch sizes.

I have been grappling with my current strategy to ensure that I can deliver more and new value, not only to our existing customer base, but also into different market segments and micro verticals. The call from the market seems to be leaning towards more versatility of skills as apposed to simply narrow and deep, which is intuitively correct from my perspective, because that is how I staffed my teams previously. So I have been trying to understand the root cause as it is germain to my positioning and strategy.

There is a technology element to it. We are finally getting to that promised land in IT where we can effectively combine a wide range of technologies effectively and deliver business value quite quickly. This is largely enabled by open source and collaborative development. I can quickly wip up a solution based on linux, rails, postgres, hadoop, casandra and S3/Akka/Storm and have a complete solution that looks good, scales, etc... And with advancements in operational maturity (DevOps, Chef, Puppet, etc..) I have deliver it reliably, incrementally and rapidly. And for those you have COTS applications, largely these have become a combination of open source themselves, with a certain amount of domain functionality already in place which is where the value lies. These will always exist, and should, however the absolute range of technologies has increased immensely, but in a seemingly mature and sustainable way. One only has to look at the Thoughtworks quarterly technology report.

The conclusion I have come to is that range of technologies is only a part it, and probably a smaller part. There has always been a wide range of technical solutions to any problem, what has changed is the way in which we deliver. Driven by valid business drivers, Agile and Lean have emerged as the dominant approaches to delivery, and I for one totally subscribe to the Lean principles. One of the underlying principles is that of reducing batch sizes, get a product from start to finish, end to end as quickly as possible. Global efficiency is more important than local efficiency. This reduces waste and actually ends up delivering faster by combining a whole lot of local inefficiency into a global efficiency, where are large batch sizes (depending on your situation of course), tend to combine a whole lot of local efficiency into a global inefficiency. The purpose of the post if not to convince you of this of provide anymore background, the purpose of the post is to say that I think there is a correlation and causality between the skills profile being called for and this approach. As agile and lean is applied more and more, the skills profile of the team members has to change in the same way as the machines on the factory floor have to change. I believe the future software developer won't ever be an "Oracle Developer", or an SAP developer or any of those others. They are certainly going to be around for a while still, but I think their place is the market will reduce.