You have just created data by clicking on this page. 2.5 exabytes (or 2.5 billion gigabytes) of data are generated each day from everything from our online browsing to our daily activity as tracked by mobile sensors. To give you a sense of how much data that is, 1 exabyte is about 3000 times the size of all the content in the Library of Congress. Up until 2012, humanity had only created a total of 5 exobytes of data. Then in 2013, with the explosion of the Internet, we created the same amount of data in a single year that had ever been created before.
That is big data. Big data is everywhere, and it is growing exponentially.
Companies have been quick to harness the power of big data for various purposes, including improving consumer experience through personalization and identifying high-risk customers. The personalized recommendation of Netflix and Amazon may come to mind. In one famous case, Target was able to predict a teenage girl's pregnancy and send her relevant coupons at her house before she revealed her pregnancy to her family. Credit card companies have utilized correlations in their data (e.g., between purchases of anti-scuff pads for furniture and good credit behavior) to identify high and low risk customers.
Until the recent years, researchers have meticulously conducted studies with carefully thought-out methods, collecting small representative samples and asking participants targeted questions. However, big data is changing the way we do research in the following ways:
1. Big data is effortlessly collected.
Big data uses all sample available (think millions) created spontaneously in real time whereas small data uses a limited representative sample (could be less than 50 participants total). Most participants in a big dataset probably don't even know they are contributing to research! Meanwhile, small data is time consuming to collect and participants typically know that they are answering specific questions or participating in specific activities for the research.
2. Big data is data-driven.
Certainly, a person could delve into big data with a hypothesis as they would with small data, but due to the immense amount of information in big data, researchers more often turn to advanced analytical methods such as machine-learning to detect patterns and uncover as many insights as possible. With small data, researchers usually design a study to test a specific set of hypotheses.
3. Bigger isn't better.
Big data may sound awfully appealing. It does not need to be purposefully collected and there are no concerns with adequate power to detect effects that we find with small data. What could go wrong? It turns out, a lot. Due to the size of big data, there is a very high likelihood of obtaining significant findings merely by chance. Imagine that you are testing 10,000 correlations in your dataset versus just 1. Your chances of having a spurious finding multiplies, such that a 95% confidence criterion is no longer usable. There are ways to adjust for such issues (like making the confidence criteria much more stringent), but this requires careful practices that researchers don't always abide by. The practice of data dredging, or mining data for any possible correlations, is common. For example, the now discontinued Google Flu Trends was heeded for predicting flu outbreaks using search terms two weeks faster than the CDC (who used hospital records). Then, it failed massively in 2013. Its failure was found to be a result of poor practices such as overfitting models with terms that were not truly correlated with flu outbreaks, such as "high school basketball".
4. Big data can be unwieldy.
Can you imagine sifting through a dataset with millions of cases? It doesn't happen often. Currently, less than 0.5% of all data in the world has actually been analyzed. Big data is messy. Compared with small data, it is harder to organize, understand, and extract insights from. To analyze big data, researchers need technology that allows massive amounts of data to be processed quickly (e.g., parallel processing) as well as the appropriate skillset (e.g., training in statistics, machine-learning, programming, and research methods).
In addition, researchers are faced with the issues of storing the large amounts of data and keeping the data secure and free from identifying information of the participants. Data breaches are common. For instance, earlier this year, researchers scraped data from OkCupid and published it on Open Science Framework, a public and collaborative data-sharing site, arguing that the data was already publicly available. The data included OkCupid users' usernames, profile information, and answers to personal questions (e.g., drug usage and sexual preferences), which rendered many of the participants identifiable and caused outrage.
5. Big data is correlational.
Big data tends to look for patterns and relationships and has more limited abilities to test for causation due to the difficulty of manipulating variables and controlling for confounding factors on such a large scale. That is not to say that experiments cannot be done with big data. For instance, A/B tests, or comparisons between alternatives such as different web interfaces, can be done relatively easily. However, small data generally provide a more suitable environment for experiments and can test potential effects of changes before they are implemented, and are thus much less risky.
It is unquestionable that big data is in our future. But does that mean that small data is obsolete? Not at all. Whereas big data tells us, in great detail, what happens, small data can better tell us why. Both big data and small data have a place in research and each has its unique strengths and weaknesses. Together, the two methods can enhance our insight on how the world works.
How is big data changing the way you do research?