“Big Data” is a buzzword that has some companies convinced they will find useful and profitable patterns in large datasets. A byproduct of the Big Data paradigm is that analysts and developers with little to no statistical background are now developing models.
Not everyone needs to be a professional statistician to find value in data. However, that doesn’t mean the data itself is so full of value and obvious answers that one can afford to sit back and let the value show itself.
Less Noise, More Signal
As someone who mines data on a daily basis, I understand the temptation to take a pattern one discovers in data, box it up and call it a product.
In the age of Big Data, though, the signal-to-noise ratio leans heavily toward the noise end of the spectrum. (See Nate Silver’s bestseller, “The Signal and the Noise.”)
How NOT to Data Mine
You have a website that enables customers to search based on ZIP Code. You’ve collected each of these ZIP Codes, giving you millions of data points. Your thinking might go something like this:
Great! Look at all this data! We know our top search ZIP Codes and can now, using readily available American Community Survey data from the U.S. Census, define the type of people that use our service!
And herein lies the danger. You’ve just gone from data to solution without clearly thinking out the problem.
Careful Data Mining
First: Slow down. You are the domain expert or else you wouldn’t be in business. The data is not an omnipotent being with all the answers.
Then: Ask yourself, What do you think you see in the data? Form a hypothesis and test it.
Remember that finding nothing is actually good, because it helps you narrow down what you should analyze in the future. It also prevents you from selling data “noise” to your customers dressed up as information.

When data takes over. Image credit: xkcd
Maponics Spatial Data Experts
Here at Maponics, data comes in as torrential rivers and goes out as focused streams of information. Does that mean we lose or don’t account for all the data we see? Of course not.
As the aforementioned discusses, much of that data is noise. We know that because we slow down, bring the right domain experts in one room and form our questions and hypotheses.
If we’re convinced of value in a data stream given evidence-based practices, then we begin wrangling the data (no dataset is ever clean) and developing the model. Output from the resulting model is then scrutinized.
A product only comes into being if the process from data to information is reliable (expected output every time) and valid (correct output every time).
Data provides clues and indicators – and sometimes nothing useful at all. To get information out of that data, you must – using all the tools at your disposal – formulate hypotheses, all the while staying focused on the questions that create value for your customers.
And be a skeptic! Borrowing anonymously from a quote I enjoy: “All models are wrong but some are useful.”
Do you have particular strategies that help you glean useful information from data? Share them in a comment below. Also, take a look at Maponics geospatial datasets.
Aaron Burgess is a Spatial Data Specialist at Maponics. He is a former research fellow at Indiana University School of Medicine with a focus on GIS, informatics and statistics. He specializes in Natural Language Processing (NLP), machine learning and information retrieval from the web. Aaron earned his M.S. in Geographic Information Systems and B.A. in Geography from Indiana University-Purdue University at Indianapolis (IUPUI).