We are re-publishing this series of articles by Dr. Richers by popular demand!
This is the first part of a three part series based on a talk given for the Institute of Practical Philosophy at Vancouver Island University in April 2015. The series includes:
The views expressed in this series are my own.
Why talk about big data?
The topic of big data captures many of the issues we are faced with in a time of rapid technological change. In the examples shown here, some might immediately strike you as good applications of technology–they help human beings lead better lives–but others look like they infringe on basic rights without any recourse. Details about your life are now available to an extent that might make anyone concerned with individual privacy queasy. It is this range of issues surrounding big data, the presence of both good and bad issues, which makes the topic of big data such an interesting starting point.
Big data and the habitual surveillance of our lives were interesting topics in themselves before all the commotion surrounding Bill C-51 started in Canada. Proposed mass surveillance legislation like Bill C-51 would not be possible without the technology underlying big data. Around the globe, legislation like Bill C-51 fits into the framework of a larger trend that changes or even undermines some of the basic notions of what it means to have individual privacy. Bill C-51 is very likely to become law in Canada, but the discussion of this trend is not dictated by the passage of any one set of laws. It is a conversation about individual privacy which we have only just begun.
There is also a practical aspect to all of this: Big data is not some magical black box but a very specific technology. This post covers just enough of the technology to give you an idea of how it might work and why big data lets you do things that older technology even just a few short years ago did not. Overall, the message is this: Something has fundamentally changed in the last ten years and we are very unlikely to ever go back. So the question becomes: How do we cope with this change?
What is big data?
Before we get into some examples, let’s briefly talk about what big data is. Put simply, think of big data as living and doing something with trillions of gigabytes of data. To put this figure into some perspective:
“In 2013, there were almost as many bits in the Digital Universe as stars in the physical universe”–IDC
The figures on which this claim is based are from an IDC report in 2010 and they represent what IDC calls the size of the ‘digital universe.’ The size of the digital universe is “the amount of digital information created and replicated in a year” (Source). Sometime in 2010, humanity surpassed 1 zettabyte for the first time ever, which is one trillion gigabytes or 10^21 bytes. In 2013, the date of the quote, the digital universe was 4.4 zettabytes big.
With any claim like this, you need to take it with a grain of salt. Is it a reasonable claim? The number of stars in the observable universe is somewhere between 10^22 and 10^24 (Source). Based on this range, the claim in the quote is still somewhat optimistic, but we are well within the range of the lower end of this figure (4.4 x 10^21 x 8 bits = 3.5 x 10^22 bits). Keep in mind that the observable universe is considerably smaller than the entire universe.
In the image on the right, the small circle in red from 2009 represents the most data humankind had ever generated in a year, roughly 0.8 trillion gigabytes. By the time that 11 short years between 2009 and 2020 will have passed, we will reach 35 zettabytes, which is represented by the circle in orange. That is 44 times the amount of data generated in 2009, which in turn was the most data generated ever in a single year since the dawn of humanity. Just think about this sheer volume in data! And now compare the difference in scale to say, the early 2000s when the total volume of data was still ‘only’ in the petabyte range. What looked like a Herculean amount of data just a few short years ago is now absolutely dwarfed by the amount of data that will be generated annually by 2020.
As it turns out, the estimate was too low: In 2014, IDC revised the projected size of the digital universe upwards. The estimate for 2020 now sits at 44 zettabytes instead of 35 zettabytes (Source).
Every rain drop is counted, every movement, every change
How do you arrive at these massive volumes of data? The answer is simple, you collect absolutely everything. Data collection is everywhere, for everything, and it is growing rapidly.
For example: The twin engines on a Boeing 737 generate about 20 TB of operational data an hour each. A six hour flight generates 240 TB of data (20 TB/hour x 2 engines x 6 hours). One plane, one flight, and this figure does not include any data from other aircraft systems or any of the proposed live streaming of data. If you add it all up, in the US alone aircraft engine data might generate 2.5 billion TB of data a year for all commercial flights (Source). This is a tremendous amount of data, generated by nothing other than operational data logging for aircraft engines.
Aviation is just one example: We live in a world full of devices that generate data and many are connected, though commercial aircraft strangely still have a ways to go in that regard and you have to go and physically find the aircraft to get some critical operational data. In our connected world, there are many examples of devices that generate or consume large volumes of data. Sometime before 2010, the number of connected devices in the world exceeded the number of human beings on the planet and we are on the way to 50 billion connected devices by 2020 (Source). Just think about this: There are more connected devices on this planet today than there are people. All of these devices either consume data, generate data, or do both. And it is not just devices that generate data, there is lots of software that does, too. Things that might generate date include:
- Devices that are part of the Internet of Things, such as sensors and smart meters
- Smartphones, tablets, and more recently smartwatches
- Distance sensors on cars, car black boxes that record engine vitals, speed, direction, and location
- Social media (think tweets or trending topics)
- Financial transactions and other business application data
- Meta data, for example, for phone calls or in pictures (what, when, where, who)
“Data is the new oil”
If you collect everything, you will never run out of data, but you might well drown in it. So why do we collect all of this data? The answer is simple: Because data is valuable. The immediate corollary to this claim is that whilst data is valuable, it is not valuable by itself. You also need to do something with it, much like another resource, crude oil:
“Data is the new oil. Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analysed for it to have value.”–Clive Humby
The thing you do in this case is the statistical analysis of the data, which has become known as analytics and sometimes big data analytics. So, much of what this talk covers relates to the statistical analysis of the data and doing new things you could not do before, because you did not have the requisite data and often also not the tools to work with the sheer volume of data.
Going back to the aviation example, suppose you have a plane with an intermittent issue with one of its systems that appears in the log data. How do you correlate a blip in the engine data from, say, two weeks ago with another blip of data today? Of the 3.4 petabytes of data you collected during that time (14 x 240 TB), which data matters? You need to find that needle in the haystack that might tell you that a component is experiencing early failure. If you do not have the tool to find that needle in the haystack, you might not find out until it is potentially too late. And speaking from experience with the volumes of data generated, intermittent issues are always the hardest to diagnose, it is the kind of thing where you need to have data for everything and also, critically, the tools to analyze that data effectively.
A definition of big data
What sets big data apart from, say, any old data you collect? Big data is really an umbrella term that includes many different things, but there are certain generally agreed upon criteria that must be met. These criteria were first defined as the ‘three vees’:
- Volume – how much data you have to deal with
- Velocity – how fast data is coming in (or has to go out again after you process it)
- Variety – the different data types and variety of data sources
Much of big data is unstructured, in the sense that it does not neatly fall into the neat rows and columns of data that you might be used to in a traditional relational database. Big data can readily include sensor data, video streams, sound recordings, along with many other varieties of computer data.
Since the original three vees, the definition of big data has been updated somewhat by the research firm Gartner to reflect that new forms of processing–the tools already mentioned–are needed to cope with the volume of data:
“Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” (Source)
There are sometimes additional vees as well, such as veracity, which looks at how truthful or reliable your data is: You might have lots and lots of data, but if you cannot trust the data, you cannot perform your analysis in good faith.
New forms of processing
Now, let’s talk briefly about these “new forms of processing” that are mentioned in the definition by Gartner. What are these new forms of processing? The easiest way to draw a difference is to compare big data processing to the traditional data warehouse. In a data warehouse, you clean and extract the data that you want to work with and then you store it in the data warehouse. You run queries against the data in languages such as SQL and you return the results of these queries to the user:
In big data stream processing, you might never store the data in a data warehouse before you do something with it. You manipulate and work with the data on the fly, in real time and perhaps even in memory. Data might still end up being stored on a file system somewhere, but in real-time analytics, you try to get to extracting insights from the data as quickly as you can.
Note that this distinction is somewhat artificial, as big data processing might well interact with a data warehouse on the fly and there other forms of big data processing that are not stream processing, such as batch analytics. Just keep in mind the earlier claim that big data encompasses many things also applies to the tools: There are many different ones.
The perishable insight
The reason why we need these new forms of processing is that previous methods are often just not fast enough when faced with absurdly big amounts of data. What matters here is speed and in the context of big data, one minute to arrive at an insight is often too late. By comparison, a traditional data warehouse query might take you several hours. Insight at the speed of data is sometimes called a perishable insight, which is to say that an insight is not useful if it takes you too long to get there. Think of this as knowledge that has a best-before date, much like milk. Once the best-before data has passed, the milk may have gone bad and it may well no longer be consumable as-is.
Examples of perishable insights include:
- Alerting you of an accident ahead before you get there–not useful if I tell you after you have become part of a multi-car pileup
- Offering a loyalty programme discount to a customer passing by a store location–not useful if I tell you after you get home
- Looking up when the next bus is coming on Translink in Vancouver–not useful after I have taken the bus already
Another good example of an insight that has perished is when you buy something on Amazon and for the next week or so you continue to get ads for the very thing that you already bought. For some kinds of items, that might be useful–say, more of a series of collectible toys–but for something you buy only infrequently such as appliances or electronic devices, these ads are more nuissance than help: I have already bought a washer-dryer and I don’t really want another one.