If one thing is universally true about New Year’s predictions, it’s that they are extrapolations of existing trends. Most published predictions about Big Data in 2014 have to do with enterprise adoption of various technologies because, among other reasons, those trends are more visible. The predictions in this post are a little different. They’re based on experience in the Data Analysis startup community (as opposed to the larger Big Data market), which is a little more hidden, and where some particularly exciting trends are emerging.
1. A shift in focus from infrastructure technology to analysis tools
Right now, the Big Data community is obsessive over database performance and data storage stats. Speakers at Big Data conferences love to introduce themselves as being from X company, where they log X giga/tera/petabytes each day. The evolution of this trend makes sense: To the right audience (the Big Data community), storage and compute metrics are meaningful and universally understood. But this is also indicative of what has been a larger focus on the companies and products that support Big Data infrastructure rather than the individuals who actually perform analysis. And while new infrastructure technology can make Data Scientists’ code run faster, it doesn’t necessarily improve their output.
The understanding that human interaction with data is critical to produce useful and accurate work has gained some traction in the startup community, and will gain broad traction in 2014. As a result, the Big Data community will grow to include as many analysts as computer scientists and technologists. I do want to draw a clear distinction between this and other predictions, though: I don’t believe that enabling “ordinary workers” to produce advanced models without training as some have suggested will yield positive outcomes. Instead, the greatest benefits will come from a shift toward software that leverages the abilities of those who are trained in quantitative disciplines (a rapidly growing group) to add more value faster.
2. Investment in the right tools, not necessarily the hottest technology
The shift toward analysis has likely been spurred in part by fruitless technology investments. Nobody writes about failed tech rollouts, but they do happen. Ask around and you’ll hear plenty of stories that involve a lot of money spent and very little insight delivered in return. And while some take such a pessimistic view as to predict higher rates of failure than success, it seems likely that buyers will start thinking more carefully about their specific applications before rushing into the latest tech. Put differently, we’re approaching the end of the “trial period” for a lot of Big Data technology; the next round of buyers will be better informed than the last.
There’s an interesting corollary here, which is that existing tools are going to see more love. The continuing increase in popularity of relational columnar-store databases like Redshift and SQL-based Hadoop implementations like HIVE and Impala exemplify this trend — SQL is about as old as it gets in the data analysis world, but it’s the primary analysis tool at many of Silicon Valley’s leading tech companies. Amazon’s recent inclusion of Impala on AWS is a strong indication that the rest of the market is coming around to this conclusion as well.
In 2014, we’re likely to see companies make smarter choices about their big data projects, beginning with selecting technology that is appropriate for both its users and use cases.
3. Unification of the data analysis community
This might surprise you: At the time of this writing, more GitHub users have starred d3.js, a data visualization library, than have starred Rails, a mainstay of the open source software community. So what are all of those followers using d3.js for? It’s actually very difficult to tell. The Rails community is easy to follow — projects that use this library are very frequently also stored on GitHub. d3 is a little different, though. Members of the data visualization community publish their work in other places like blogs or standalone sites. The problem with this is that the massive amount of output generated by this community is inaccessible because it’s not indexed in an intelligent way.
In 2014, we’re going to see the beginnings of a unified community for data analysis and visualization. It’s not going to happen quickly —the d3 community is only one segment within this group, and it alone is highly fragmented. But the number of people building and sharing custom visualizations and analyses is reaching a critical mass, and such consolidation is inevitable.
4. Standards will emerge in the data analysis community
When smart people form communities and work together, good things happen. 2014's congregation of Analysts will yield early standards for data analysis.
We can already see the beginnings of this trend: Dat and Data Packages are good examples of emerging standards for distributing data and analysis. In some sense, though, these are both software standards. They require a software developer to implement. This prediction is about analysis standards — collaboratively written, broadly applicable, modular scripts that solve common problems within this new community. This, too, is already beginning to happen; in 2014 it will begin to become centralized.
There’s a corollary to this as well, and it’s a fitting one to end with because it’s more of a prediction for the next decade than the next year. As we’ve seen in the software community, centralized distribution and discussion of standards (in this example, frameworks) leads to refinement of those frameworks. More people try them, tweak them, and share modifications. Better frameworks see more widespread adoption and, eventually, people start to design new systems with those frameworks in mind. Bootstrap’s popularization of the 12-column grid — now a standard in modern web applications — is an excellent example of this.
Over time, we will see the same thing happen in the world of data analysis. Groups of smart people will figure out good solutions to common problems in analysis. Eventually, companies will build systems that take advantage of these standard solutions. It won’t happen in 2014, but when it does, the impact will be tremendous.