You’ve probably heard the term big data used more frequently in the last several years. Big data refers to the recent explosion of available data in substantially more extensive scopes. A lot of discussion about big data focuses on its technical challenges, often summarized as the 5 Vs – volume, velocity, variety, veracity and value.
For four Vs – volume, velocity, variety and veracity of big data – they are emerging at unprecedented rates. For example the growth of online social media has enabled everyone to contribute their observations, experience, and opinion as data. With a world population of 7.2 billion people, there exists 5.7 billion profiles on the top 21 social media sites. According Eric Schmidt, executive chairman of Google, people create more data every two days than was created from the beginning of time to 2003.
In another example, the latest advances in sensing technology, used in environmental, industrial and transportation environments, make it feasible to deploy capable and highly affordable sensors to monitor various interesting phenomena and procedures at a new scale and density.
Unlocking the potential of the other V –value – is in early stages thanks to greater data accessibility. Recent technology developments have been key to releasing the potential of big data and turning it from simply “information” to a source of strategic knowledge. The essential art of data-driven thinking is analogous to using a fork and knife to enjoy a steak. For example, compare buying a house today to 20 years ago; today, buyers use the Internet to find specs online (instead of waiting for your agent to feed this information to you) and use search engines to find out if the house is located in a high-crime or a desirable neighbourhood.
Companies and organizations have started to facilitate the second use of data, likely from multiple sources, by common people. A few years ago, Google demonstrated a successful second use of data. It used its search log data, which recorded what questions users asked the search engine and was originally collected for the purpose of improving the search engine’s answers to questions, to predict the spread of winter flu, since some questions frequently asked may be associated with possible winter flu victims.
While this practice is now highly practical for big companies, is it possible that we can enable everyone to make good second use of data? The good news is that more and more big data, such as the Open Data Project and the Population Data BC project, is starting to become publicly available.
Is big data really new? Since the very early days, human beings have been aware of the importance of comprehensive observations and information – essentially big data – as a super-power. What makes the difference now is that big data has become accessible to people.
Accessibility is only the first step. More serious challenges come from the capability of big data analysis, which is a more difficult skillset to acquire, unfortunately. Background and domain knowledge play a central role in deriving useful information and knowledge from big data. A person without sufficient medical training cannot derive a meaningful analysis from population data, even if the data is available.
In the sense of big data analysis, we still have a long road ahead to engage big data with people.
Marshall McLuhan said, “The medium is the message. … in the long run a medium’s content matters less than the medium itself in influencing how we think and act.” Big data is a current medium. Big data changes the way we live, the way we think, and the way we do business. Fundamentally, everyone in this world will be involved deeper and deeper into producing data and consuming data.
Of course there are many opportunities in big data education. A few specific educational programs on big data started recently, many of them advanced programs that require technical capabilities. However, big data will change education even for kids. We should educate our kids about the value of data, the value of contributing data, and the value of analyzing data. In parallel, we should teach them the value of privacy and respect of privacy.
We are “datafied” by ourselves and characterized by data. Imagine, in 1,000 years the historians of the future will have detailed information about our time by analyzing the tweets we send and read.
About the author
Jian Pei is a professor in computing science at Simon Fraser University. He is a Fellow of the Institute of Electrical and Electronics Engineers and ranks among the Top 10 most cited researchers worldwide in the field of data mining.