2015: More innovation, but still a year of transition

First things first: I could use this title for every year, it is an evergreen. In order for this title to make sense, there must be a specific context and in this case the context is Big Data. We have seen new ideas and many announcements in 2014, and in 2015 those ideas will shape up and early versions of innovative products will start flourishing.

Like many other people, I prepared some comments and opinions to post back in early January then, soon after the season’s break, I started flying around the world and the daily routine kept me away from the blog for some time. So, as a good last blogger, it may be time for me to post my own predictions, for the joy of my usual 25 readers.

Small Data, Big Data, Any Data

The term Big Data is often misused. Many different architectures, objectives, projects and issues deviate from its initial meaning. Everything today seems to be “Big Data” – whether you collect structured or unstructured information, documents, text and patterns, there is so much hype that every company and marketing department wants to associate its offering with Big Data.

Big Data is becoming the way to say that your organisation is dealing with a vast amount of information and it is becoming a synonym for Database. Marketing aside, there are reasons behind this misuse. We are literally inundated by a vast amount of data of all sorts, and we have been told that all this data has some value. Fact is, more and more organisations want to use this data, and in some way they are pushing for the commoditisation of Big Data solutions.

There are valid reasons behind the commoditisation of Big Data. The first one is that data is data and, big or small, it should be simple and easy to manage and use. If this is not the case, then it is an issue that database providers should solve, and an opportunity to inspire entrepreneurs to provide new products. Managers, users and administrators demand this commoditisation. They do not want to treat Big Data differently from any other data: they do not want a batch-only mode, a Lambda junction or another complex architecture. Many organisations need real time analysis, small queries and transactions for the data they collect or generate.

Developers and devops have their say on Big Data too. They need more ways to access Hadoop and Lambda architectures. They long for the simplicity of the good old days of the LAMP stack or for today’s agility of Node.js and MongoDB. They want to code faster, release often, run and fix bugs in minutes (not weeks or months), also on Big Data.

In my humble opinion, the key point for Big Data in 2015 is the convergence towards Hadoop. Everything will be in some way related to Hadoop, whether it is a distributed file system, a map/reduce approach or other related technologies. In some way, established Big Data vendors will create more interfaces. Other SQLs and NoSQLs will reach the Hadoop haven, by integrating their existing products, creating more connectors, or providing hybrid architectures.

The two big issues to tackle are on the administration side and the user side. For administrators, Big Data architectures must be simple to provision, configure and deploy, and eventually modify. For users, Big Data solutions must be simple to use for their analysis or online applications. In both cases, the issue is currently Big Data = Big Complexity.

Some predictions

A convergence towards Hadoop is inevitable. Even the most traditional companies active in the DB world, like Oracle and Microsoft, are taking large steps in this direction. Here we are not talking about integration through adapters or loaders, we are referring to a deeper convergence where Hadoop will be (in some way) part of the commercial products.

There will be more interfaces that allow developers to reuse their skills or existing code to work with Hadoop. This aspect will be interesting for ad-hoc applications, but even more important for BI and Business Analytics vendors, who will integrate their tools with Hadoop with “minimal” effort. An evolution in this area will have the same impact that tools like

Business Objects, Cognos and MicroStrategy had for data warehousing in the ‘90s. Users will have the ability to consume data in a DIY fashion, saving money and ultimately bringing commoditisation to Big Data.

But we need more innovation to make Big Data a real commodity. We need more Hadoop as a service, something that is starting only this year. We need cloud-friendly, or “cloudified” architectures. The natural distribution of the Lambda architecture fits well with the Cloud model, but now the issue is to optimise performance and avoid unnecessary resource consumption in cloud-based Big Data infrastructures.

Orchestration is the magic word for Big Data in 2015 and certainly for one or two more years to come. Too many moving parts create complex architectures that are difficult to manage. Orchestration tools will play the most important role into the commoditisation of Big Data infrastructures. Projects will be delivered faster and in a more agile way in order to cut the costs and make the technology suitable for more uses.

The missing players

In this scenario, we sadly miss PostgreSQL, MySQL and some others.

PostgreSQL has still a large number of enthusiasts and great developers who provide improvements, but big investments are missing. EnterpriseDB monetises migrations of costly Oracle-based applications to PostgreSQL. This is, in my opinion, a pretty correct and pragmatic approach, from a tactical business perspective. The support business around Postgres will go on for many years, but we should not expect any innovations in this area. We can see the use of Postgres technology in Greenplum and in Pivotal HAWQ, but that product would fall more into the bucket of the Hadoop adapters than into a standard PostgreSQL engine.

MySQL is another player that is missing the boat. The great improvements made in MySQL 5.7, in WebScaleSQL and in MariaDB all move in one direction: the MySQL install base. It looks like the world stopped in 2006 and no more technologies have emerged since then. Fact is, almost all the developers have adopted Hadoop and NoSQL technologies for their new projects, leaving [as it happens for Postgres] the MySQL ecosystem still in business for the support of existing installations.

Finally, the traditional NoSQL players are catching up. The fact that they do not have a large install base allows these players to change directions faster and sometimes drastically. Datastax leads the pack, adding Hadoop to its Enterprise solution based on Cassandra. MongoDB benefits from large investments that give this database more bandwidth in the long term. The first step for MongoDB has been the introduction of a new pluggable storage architecture. Now we need to wait for the next step towards an Hadoop pluggable engine. Couchbase and Basho/Riak still maintain their position as servers that can be integrated with Hadoop, but Hadoop is not a component of their enterprise products.

Obviously, I may be completely wrong with my predictions and in 12 months’ time we might see Hadoop more concentrated on real Big Data and none of the missing players joining the bandwagon. Let’s just wait and see.

In the meantime, there is more to come in this area. The future of Big Data is very much connected to the Internet of Things, which will bring even more complexity, along with the need for real time analytics combined with batch data analysis. On top of everything, the orchestration of a large number of components is an essential piece of technology for Big Data and IoT. Without the right orchestration, Devops will spend 80% of their time on operations and 20% on development, but it should be the other way around.
More to come in Jan 2016.