As companies strive to become more agile in today’s ever-changing business world, a common theme is getting data faster and, in turn, getting insights from data faster. That’s where the notion of schema-free queries often comes in, where all sorts of unstructured data goes into files in Hadoop (Hadoop Distributed File System w/ e.g. HIVE or Drill querying), SQL and NoSQL databases that support late binding. Late binding, to get on the same page, is the practice of transforming and binding data based on relationships at program runtime, vs. early binding where transformations are done when data moves from source systems into the database.
These databases or data stores often enable rapid exploration via schema-free queries. And, it’s true that rapid exploration is a key piece of any agile company’s foundation, just as it’s true that some corners of the technology world are evolving so quickly that having to slow down and put governance and forethought into data storage and structure can be the difference between success and failure.
But with schema-free queries, it also pays to be prudent. If you’re not careful, they can make your data dishonest.
The fact that data isn’t wrapped in governance is fine (and preferred) for just poking around. We opt for schema-free queries in the first place because a lot is changing around us and new data sources are emerging regularly. The fact is that schema less/free is great for an initial prototype, but once we move past the prototype stage, the lack of schema quickly becomes a governance nightmare.
A Crumbling Analytics House Built on Schema-Free
Otherwise, whatever you produce – whether it’s a dashboard, or some metric read-out – could begin lying to you. This is the exact problem we faced in the mid 2000s during my tenure at eBay when an entire experimentation platform, with hundreds of experiments built on late binding, was starting to fold like a house of cards. The reason was that the incoming data started changing on us without any controls in place, but there was no governance to catch the change.
It only takes one developer upstream going about his day-to-day work to change the meaning of a tag, thinking he is the only one using it. Once that happens, everything built with that data could produce slightly to completely different results. Plus, there is no lineage with schema-free queries, so you won’t even know that anything has been changed!
Put simply, schema-free queries can quickly become a foundation for a house that crumbles after it’s built.
Don’t get me wrong: Late binding is a must have capability in today’s data infrastructure. We have long been working on getting more and more late binding features into our various products with the latest example being high performance and binary JSON storage and processing natively within the Teradata database.
Building Trust in Your Data
While systems need to support both late and early binding, tight and loose coupling, the evolution towards schema (even if only for subsets of data) is a must have step for any data product development process.
Schema is not just a nuisance. Its not there to be painful, it’s there to control structure and actually reject mismatches along the way. It forces a different thinking on production quality, than a free flowing unstructured lake that changes by the minute and is hard to rely on in terms of repeatability. Trust in repeatable and consistent results is key to the success of Big Data.
The lesson is that you need to constantly check if schema-less data is being used for production purposes. Similarly, the moment you find something with your data exploration, figure out what tags you’re using for the production-like environment, and make sure you have the ability to check on them. While there’s often value in getting to data quickly to uncover new things, there is also value in knowing that a particular tag has a certain meaning – especially once you make the move from exploration to production.
As part of the series of articles on the concept of the Sentient Enterprise I have talked about the need for the Layered Data Architecture – a data classification framework that allows for the rapid and agile integration of unstructured or late binding data. The key to success is to properly classify all your incoming data as it is being accessed, used and relied on and to elevate data elements from none- to loosely- to tightly coupled status.
When we build algorithms, models, reports – any form of repeatable usage of data, we are obligated to have control and authority over the data behind it, so we can make sure it will continue to do what it claims to do.
Don’t let your data lie to you.
(Author):
Oliver Ratzesberger
Mr. Ratzesberger has a proven track record in executive management, as well as 20+ years of experience in analytics, large data processing and software engineering.
Oliver’s journey started with Teradata as a customer, driving innovation on its scalable technology base. His vision of how the technology could be applied to solve complex business problems led to him joining the company. At Teradata, he has been the architect of the strategy and roadmap, aimed at transformation. Under Oliver’s leadership, the company has challenged itself to become a cloud enabled, subscription business with a new flagship product. Teradata’s integrated analytical platform is the fastest growing product in its history, achieving record adoption.
During Oliver’s tenure at Teradata he has held the roles of Chief Operating Officer and Chief Product Officer, overseeing various business units, including go-to-market, product, services and marketing. Prior to Teradata, Oliver worked for both Fortune 500 and early-stage companies, holding positions of increasing responsibility in technology and software development, including leading the expansion of analytics during the early days of eBay.
A pragmatic visionary, Oliver frequently speaks and writes about leveraging data and analytics to improve business outcomes. His book with co-author Professor Mohanbir Sawhney, “The Sentient Enterprise: The Evolution of Decision Making,” was published in 2017 and was named to the Wall Street Journal Best Seller List. Oliver’s vision of the Sentient Enterprise is recognized by customers, analysts and partners as a leading model for bringing agility and analytic power to enterprises operating in a digital world.
Oliver is a graduate of Harvard Business School’s Advanced Management Program and earned his engineering degree in Electronics and Telecommunications from HTL Steyr in Austria.
He lives in San Diego with his wife and two daughters.
View all posts by Oliver Ratzesberger