A matter of statistical relevance versus certainty


It was one of those two-headed questions. While at the TM Forum's Management World Americas event this week talking to their big data experts, I asked a question that prompted them to look at me as if I had two heads.

The (rather long-winded) question was this: If you have been providing predictive analytics to communications companies for years and assuring them that the statistically meaningful samples you use in your analysis provide accurate assumptions about their business or their customers, upon which they can take sometimes risky, actionable intelligence, how can you say now--with databases so huge and processing power and memory so advanced you can analyze 100 percent of the data rather than filtered data--that your recommendations will be better, and if you can, then by how much better? What happens to the value of a relevant sample?


In other words, is it worth an investment in big data to know that 73 percent rather than 75.6 percent of the people in Indiana who put hot sauce on their scrambled eggs are more likely to buy a Dodge truck over a Chevy? When does it become overkill to analyze 100 percent of a data set and what kind of improvement over the "old" method of sampling can be expected?

Of the eight people that question was posed to, one offered a specific example, about the improvement a mobile operator saw in churn rates directly related to the full analysis of a fraud data set. It was 14 percent better. If the operator's churn rates was two percent, they would have a decision to make about the cost benefit analysis. If the churn rates were 20 percent, it wouldn't be much of a decision.

Over the coming months, we hope to explore this idea of 100 percent analysis versus filtering and sampling for statistically relevant samples. If you have any suggestions or any examples of such improvements, please join the party and send me a note. We'll be looking for examples where 100 percent certainty is necessary or possible. -Tim