The Paradox of Big Data

My local gym plays the same radio station whenever I work out.  They prefer a 1980’s throwback station – which tells you something about the demographics of the patrons there.  During my workouts, I have listened to more Gowan, Corey Hart, and Duran Duran than I ever wanted to since turning 18.   I only listen to that station 4 times a week for an hour at a time, but I have a pretty good idea of what that radio station is all about.  I can feel confident that I won’t hear any Mozart or Beyonce while I’m at the gym.

My personal radio station of choice is the local public radio.  On this station I can expect mostly topical talk shows, with some fairly tame satire or comedy, and with only the occasional smattering of music thrown in.   But I know there won’t be any top 40 from the 1980’s.  Also, no Beyonce.

You might wonder how I could be so confident about the content of these radio stations, when I only listen to them for a few hours every week.  I only have a small sample after all.  If I really want to know what kind of station they are, shouldn’t I listen all day, for several days?  Or even a few months?

The reason I don’t need to do that is that I have a representative sample of data from each of these stations. 

In this era of Big Data, data sampling is something that can be poorly understood.  Many people think that unless we are collecting all the data, the data will be of limited value.  Or even no value.

I recently spoke with a prospect about doing some social media analysis to investigate what the citizens of their country were discussing on Twitter.   This country has a population of almost 30  million people, and has a very high rate of engagement on social media.  As a result, the volume of data was very large, and it would have been outrageously expensive to collect every single tweet over an extended period of time.  To make the project more affordable, I proposed that we take a sample of the Twitter conversations in 10 minute chunks, several times a day, for a couple of months.   This would still give us huge amounts of data and would be more than sufficient to give valuable insights.

The client was taken aback.  Not collect ALL the data?  Surely that would leave too much information on the cutting room floor?  In the end, they decided to not proceed because of this issue.  Ultimately, they chose to have no data, rather than a representative sample because they felt that leaving out any data at all would make the project meaningless.

This reaction was surprising to me because we know that representative samples work and researchers use that principle all the time.

When scientists want to find out the water quality of a river, they take a sample from a few different places at different times.  They don’t try to collect all the water from the entire river.

When research firms perform opinion surveys, they don’t ask every single citizen about how they will vote.  They speak to a randomized sample.

These examples seem self-evident because it is intuitively obvious that it would be impossible to collect data from the entire population, or the entire river.

However, when it comes to Big Data, somehow the fact that it is possible to have all the data, makes us feel that we must use all that data.  But the underlying principles of representative samples still apply.

This is one of the paradoxes of Big Data.  We have more data available to us than ever before but the sheer volume of it makes it difficult to turn that data into insights.  We often get bogged down in just cataloging it all.

Here is the key point.  Like anything else you do in your business, gathering insights from big data requires you to take a look at what your constraints are in terms of time and money.  Everyone has constraints and that’s OK.  If you take a smart approach to sampling and analysis, chances are that you need less data than you think.

Image credit: Gregor Stoermchen (Creative Commons Commercial License)