Q&A with Cloudera's Kirk Dunn and Charles Zedlewski
Cloudera had a big presence at the recent Strata Conference/Hadoop World in New York the week before Hurricane Sandy hit. The company made a big splash with its CDH4 release and the addition of a real-time query engine for Hadoop. Executives Kirk Dunn, COO, and Charles Zedlewski, vice president of products, sat down with FierceBigData editor Tim McElligott at the show to talk more generally about the big data space and the role of Hadoop.
What is the big difference between big data and previous efforts in analytics?
Dunn: The real opportunity for big data is the ability to analyze more data, more flexibly. The common framework for data analysis prior to Hadoop and big data was schema-on-write, which meant as you were writing data into your data store, you were categorizing it and organizing it as it comes in. But now, in many cases, there are many more types and a larger volume of data, and it changes frequently, so the minute you write that schema and bring in the data and get ready to ask your question, the data has changed. So the fundamental premise for big data is to allow it to live organically in the system and not have to figure everything out up front.
In the old days, you had to know the question you wanted answered beforehand. In the big data world, you let the data tell you what the question is you should be asking. Fundamentally, that is the issue. But it also allows you to go from analyzing roughly 15 percent of your data to looking at 100 percent. You don't want to constrain the data going in because you're not sure what's going to change.
Can you give an example of how the question might change?
Dunn: We had a financial services case where they wanted to correlate trading activity to network log analysis. They were looking at what was going on across the network and wanted to see how that correlated to anomalies on the trading floor. You would normally not correlate those different data types to draw a conclusion. If you did in the traditional mode, it would be very expensive and take a very long time. If the result proved inconclusive, you would have just wasted your time. Now you can do a random, ad hoc query and ask whatever question of the data you want, no matter how far-fetched or spot on. And you can do it 10 times faster for about a tenth of the cost.
Once you kind of turned data analysis on its side like that, it opens up a whole series of workloads that previously were never contemplated, so you can imagine all sorts of things you can do when you don't have limits.
Have you heard the phrase, let the data find the data, and do you agree?
Dunn: I would say let the data instruct you. Data on its own is inert; it won't find something without you questioning it. There are unlimited possibilities for blending data types. Take supply chain data in a retail scenario. You might never think to ask what the impact is on products moving through the Eastern U.S. during hurricane season. But today you have that possibility.
Is there an art to asking the right questions of your data?
Dunn: Don't try to solve a problem that isn't already a business problem. In other words, if you are in retail and you're trying to increases the efficiency of your supply chain, don't try to figure out in the same vein which cardboard boxes are most sturdy. If it isn't a core principle for your business then don't go there. In that sense, there are constraints on big data regarding the things you would want to look at. That's where some people get lost. Though we create unlimited possibilities to ask questions, the right questions should be constrained by what kind of business you are running. Just because everything is possible doesn't mean that businesses should ask those everything is possible type of questions.
Why does it seem easy in a big data environment to analyze data across silos when the telecommunications industry couldn't get its systems to talk to each other after 15 years of effort?
Dunn: One of the big problems telecom faced with data is the time to respond. They have all this information but by the time you're able to load it into a processing engine and get an answer out, you probably already have a number of bad customer experiences. The big problem is churn and the quicker they can respond to an outage or delay or something that will influence churn, the greater impact they will have. In the old days, you have a bunch of log data and crunch it by the time the answer came back, the problem is moved. So their ability to respond quickly is the issue.
What was your main message at the event?
Dunn: The main message is about extending the platform. The core of Hadoop was started years ago and Cloudera added consistency to it, added HBase to it and is now putting real-time into it as well as a management frameworks to make it enterprise ready. So it is no longer a black art where you need a whole room full of PhD's to understand how to write jobs and stitch together all or components of Hadoop.
Is the advertising market still the biggest user of your platform?
Zedlewski: No. It is an obvious one but not even close to being the biggest anymore. Four years ago a lot of our customers were ad tech companies. There are still a decent number, but in terms of our revenue, it is more driven by financial services, telecom, the federal government, retail and media. We are starting to see more now growth in utilities and pharmaceuticals. Utilities seem to be the next wave.
Dunn: And anything in the data warehouse where regular analytics is going on is the target. You can use the term data-driven environments. There are a lot of marketing companies out there that are not Fortune 200 or 500, but data is how they make their decisions and they are big data users.
Who leads the ecosystems in big data?
Zedlewski: That's an important point. There are really three different communities or ecosystems. One is the customers or the users who are typically large enterprise customers. That ecosystem wants things out of Hadoop that are not different from what they want from any other platform, which is compelling functionality that is secure and scalable, predictable, and they want support all for a fair price. Next there is the open source ecosystem with about 17 or 18 different Apache projects that have something to do with Hadoop. This is comprised of developers. Third is the vendor ecosystem. The beauty of this whole model is that we exist as a buffer between all these three worlds, so we need to participate with each one of them. All these ecosystems are very vital to our business.
What is the most interesting problem you've worked on with big data?
Dunn: That requires some thought because there are so many. Monsanto is a good one. It turns out they have very sophisticated measurement mechanisms on their Caterpillars (tractors) that do laser leveling on small sections of the cornfield and actually test soil chemical composition in each area to measure its propensity to not carry water well. They can then plant a field with drought-resistant seeds in a particular part of the field to maximize the yield of each crop. They use a high degree of technology and science to make sure they are optimizing the output of every square meter of field.
Zedlewski: I like how big data enables pretty much everything that has to do with maps. When you get in your car and use the navigation system and they tell you where to get a good cup of coffee, all that stuff is coming off a big data system, Increasingly, the ads and recommendations that come to you when you're in a particular place is one big data application. I don't think there's a single example anywhere of a digital map that is not actually backed by a Hadoop-style system.