As a starting point, here's Kirkpatrick defining Big Data: "[Bbig data is a term that has come into vogue only in the last couple of years, and it refers to the tremendous explosion in volume and velocity and variety of digital data that is being produced around the world. The statistics are somewhat astonishing: there was more data produced in 2011 alone than in all of the rest of human history combined back to the invention of the alphabet."
The May 2012 report offers this comment (footnotes and references to figures omitted: "The world is experiencing a data revolution, or “data deluge”. Whereas in previous generations, a relatively small volume of analog data was produced and made available through a limited number of channels, today a massive amount of data is regularly being generated and flowing from various sources, through different channels, every minute in today’s Digital Age. It is the speed and frequency with which data is emitted and transmitted on the one hand, and the rise in the number and variety of sources from which it emanates on the other hand, that jointly constitute the data deluge. The amount of available digital data at the global level grew from 150 exabytes in 2005 to 1200 exabytes in 2010. It is projected to increase by 40% annually in the next few years .. This rate of growth means that the stock of digital data is expected to increase 44 times between 2007 and 2020, doubling every 20 months."
The flood of data relevant for development issues includes four categories, according to Global Pulse: 1) "Data exhaust" created by people's transactions with digital services, including web searches, purchases, and mobile phone use; 2) "Online information" available in news media and social media, as well as job postings and e-commerce sites; 3) Physical sensors that look at landscapes, traffic patterns, weather, earthquakes, light emissions, and much else; 4) Citizen reporting, when information is submitted by citizens through surveys, hotlines, updating of maps,and the like.
Of course, there are enormous challenges in dealing with Big Data, including privacy concerns, the sheer size of the datasets, how quickly they are expanding, and how to digest and interpret it. But the potential for understanding what is happening much more quickly is becoming apparent. As Kirkpatrick says: "[W]e now live in this hyper-connected world where information moves at the speed of light, and a crisis can be all around the world very, very quickly, but we’re still using two- to three-year-old statistics to make most policy decisions. The irony is, we’re swimming in this ocean of digital data, which is being produced for free all around us."
Private sector firms like Google are already using Big Data. Some of the public sector and research studies include:
- A country's GDP can be estimated based on light emissions at night, as perceived by satellites.
- Outbreaks of flu or cholera or dengue fever can be identified much more quickly by looking at web searches. Another study used Twitter mentions of earthquakes as a way to get a faster response to quakes.
- One study was able to predict where people were at any time with greater than 90% accuracy based on cell-phone records showing past movements. Another study in developing countries could predict income with 90% accuracy based on how often you top off the air time on your mobile phone. Kirkpatrick says: "Even if you are looking at purely anonymized data on the use of mobile phones, carriers could predict your age to within in some cases plus or minus one year with over 70 percent accuracy. They can predict your gender with between 70 and 80 percent accuracy."
- A study in Indonesia was able to approximate a consumer price index for basic foods by looking at comments on social media. (Apparently, Jakarta produces more tweets than any other city in the world.) Other studies have sought evidence on food shortages or food price volatility by looking at social media.
I confess that the social scientist within me finds the research possibilities here to be fascinating. Kirkpatrick says:" Now think about this, this is astonishing: the ability to see in real time where beneficiaries are can allow us to understand exactly where the population is that we need to reach, and if you combine that with information on the size of air-time purchases, you can tell how much money these people have. You start to be able to extract basic demographic information, population movement, and behavior data from this information while fully protecting privacy in the process.
What we’re focused on now is working with mobile carriers around the world, including in Indonesia, to get access to archives of anonymized call records and purchase records, because what we do is essentially correlate that data with official statistics. You look at the movement patterns, the mobile service consumption patterns, the social-network patterns that you can derive from how people interact and compare that to food prices, fuel prices, unemployment rates, disease outbreaks, earthquakes, and look at how a population was affected. Or, you compare it to when a program was initiated in the field or when a policy initiative got off the ground: did it actually work? The potential for monitoring and evaluation here as well is quite remarkable."
Moreover, Kirkpatrick describes the effort by Global Pulse to find a middle ground in concerns about privacy and access to Bid Data: "Right now, the conversation around big data is very polarized. You might call it "Germany vs. Mark Zuckerberg." You have the very conservative prohibition against reuse without explicit permission that has become pervasive in the European Union; it’s a very guarded approach. At the opposite end of the spectrum, you have companies that live on big data, which are saying privacy is dead, profit is king. We’re trying to insert a third pole into this debate, which is to say, big data is a raw public good. But to do that we have to create a kind of R & D sandbox where we can experiment with it and learn how to use it safely."
At least to me, many of the existing efforts to use Big Data seem to me interesting--but relatively small potatoes. As the existing data increases 40-fold in the next few years, along with techniques and capabilities to digest and analyze that data, challenges and possibilities will probably emerge that I can't even imagine now. The May 2012 report quotes the comment from social technology guru Andreas Weigend, who said: "[D]ata is the new oil; like oil, it must be refined before it can be used."