8 top challenges big data brings to statisticians

Guest post by Ronald L. Wasserstein, executive director of the American Statistical Association
Tools

Big data is "undoubtedly the greatest challenge and opportunity" facing the future of statistical sciences. That is the consensus of more than 100 prominent statistical scientists from around the world who gathered in London late last year to chart a course for the scientific field.

Big Data was the most discussed trend at the Future of the Statistical Sciences Workshop held last November. The event was the capstone of the International Year of Statistics, celebrated worldwide in 2013. The workshop's key findings, including those focused on big data, are communicated in a recently released report titled Statistics and Science: A Report of the London Workshop on the Future of the Statistical Sciences.

For statisticians, the report notes, big data introduces a whole different set of issues, including the following:

Data is not just big, but different. For statisticians, big data challenges some basic paradigms such as the "large p, small n" problem. Classical statistics provides methods to analyze data when the number of variables, or p, is small and the number of data points, or n, is large. Statisticians have developed several technical advances to deal with this "look-everywhere effect" and extract a needle of meaningful information from a haystack of data. Yet more work is needed.

Problems of scale. Many popular algorithms for statistical analysis do not scale up well and run hopelessly slow on terabyte-scale data sets. Statisticians need to improve these algorithms or design new ones that trade off theoretical accuracy for speed.

Time pressure. In the era of big data, time means everything. To address this issue, statisticians are adopting and adapting ideas from computer scientists. The objective in some cases may not be to deliver a perfect answer, but to deliver a good answer fast! Yet statisticians cannot stop thinking like statisticians. They understand uncertainty and predictions are better when thought of as forecasts, which have inherent uncertainty.

Different kinds of data. Big data is complex and comes in different forms from what statisticians are accustomed to--for instance images or networks. This next-generation functional data requires the invention or importing of ideas from areas of mathematics outside what is conventionally thought of as statistics, such as geometry (for abstract shapes) or topology (for the spaces from which the data are sampled).

Privacy and confidentiality. This is the area of greatest public concern regarding big data, and statisticians cannot ignore it. There are several methods to anonymize data to protect personal information, but there is no such thing as perfect security. One of the most exciting trends in big data is the growth of collaboration between the statistics and cryptography communities.

Reinventing the wheel. Some collectors of big data--most notably, web companies--may not realize statisticians have generations of experience extracting information from data, as well as avoiding common fallacies.

Quality of data. An underrated service statisticians can provide to big data enterprises is to look at the quality of data with a skeptical eye. They can ask the following key questions: Are the data collected in a way that introduces bias? Are there missing or incomplete data? Are there different kinds of data? Statisticians not only know how to ask the right questions, but may have practical solutions already available.

A statistician by any other name. Big data has forced statisticians to confront a question of their own identity. The companies that work with big data are hiring people they call "data scientists." The exact meaning of this term is a matter of some debate; it seems like a hybrid of a computer scientist and a statistician.

This new job category brings both opportunity and risk to the statistics community. The value statisticians can bring to a big data enterprise is their ability to ask and answer such questions as these: Are the data representative? What is the nature of the uncertainty? It may be an uphill battle to convince collectors of big data their data are subject to uncertainty and, more importantly, bias.

On the other hand, it is imperative for statisticians not to be such purists that they miss the important scientific developments of the 21st century. Perhaps statisticians will have to embrace a new identity. Alternatively, they might have to accept the idea of a more fragmented discipline in which standard practices and core knowledge differ from branch to branch.

Preparing statisticians for big data. Many statisticians are concerned statistics graduate students may get shut out of the data science field. Job openings these students are applying for are not designated for "statisticians;" they are for "data scientists." Statistics graduate students want these jobs, but don't always get them. Employers want candidates who can write software that works and solve problems they didn't learn about in books. The perception is newly minted statistics PhDs often don't have those abilities.

More computer science training seems to be a good idea, and it needs to go beyond simply learning more computer languages. The students need to learn how to produce software that is robust and timely.

"The advent of big data, data science, analytics, and the like requires that we as a discipline cannot sit idly by … but must be proactive in establishing both our role in and our response to the 'data revolution' and develop a unified set of principles that all academic units involved in research, training, and collaboration should be following," said workshop participant Marie Davidian. "We should be expending our energy to promote statistics as a discipline and to clarify its critical role in any data-related activity."

The bottom line for statisticians--in academia, industry, and government--is big data should not be viewed only as a challenge. It is an opportunity for statisticians to re-evaluate their assumptions and bring new ideas to the forefront.

About the author: Ronald L. (Ron) Wasserstein is the executive director of the American Statistical Association (ASA). Prior to joining the ASA, Wasserstein was a mathematics and statistics department faculty member and administrator at Washburn University in Topeka, Kan., from 1984–2007. During his last seven years at the school, he served as the university's vice president for academic affairs. Wasserstein is a Fellow of the ASA and American Association for the Advancement of Science. He was presented the John Ritchie Alumni Award and Muriel Clarke Student Life Award from Washburn University and the Manning Distinguished Service Award from the North American Association of Summer Schools. 

For more:
- read the workshop report

Related Articles:
Seeking caped statisticians to save big data's rep
Are statisticians the modern explorers?
'Mutual information' superior statistical method for big data, say researchers