The Joy of Data (2016) - BBC - Documentary
It sometimes seems we're being deluged with data. Wave upon wave of news and messages. Submerged by step counts. Constantly bailing out to make room for more.
We buy it, surf it, occasionally drown in it and with modern technology quantify ourselves and everything else with it. Data is the new currency of our time. Data has become almost a magic word for...anything.
Crime and lunacy and literacy and religion and...drunkenness. You name it, somebody was gathering information about it. It offers the ability to be transformationally positive. It's, in one sense, just the reduction in uncertainty. So what exactly is data? How is it captured, stored, shared and made sense of? The engineers of the data age are people that most of us have never heard of, despite the fact that they brought about a technological and philosophical revolution, and created a digital world that the mind boggles to comprehend. This is the story of THE word of our times...
..how the constant flow of more and better data has transformed society... ..and is even changing our sense of ourselves. I can't believe this is my life now. So come on in, because the water's lovely. My name is Hannah Fry, I'm a mathematician, and I'd like to begin with a confession. I haven't always loved data.
The truth is mathematicians just don't really like data that much. And for most of my professional life I was quite happy sitting in a windowless room with my equations, describing the world around me. You can capture the arc of a perfect free kick or the beautiful aerodynamics of a race car. The mathematics of the real world is clean and ordered and elegant, everything that data absolutely isn't. There was one moment that helped to change my mind.
It was in 2011 when I came across a little game that a teenage Wikipedia user called Mark J had invented. Now, Mark noticed that if you hit the first link in the main text of any Wikipedia page and then do the same for the next page, a pattern emerges. So the page for data, for example, links from "set" to "maths" to "quantity" to "property" and then "philosophy", which after a few more links will loop back onto itself. Now, the page "egg" ends up in the same place, and even that famously philosophical boyband One Direction will take you all the way through to "philosophy", although you have to go through "science" to get there. The same goes for "fungi", or "hairspray", "marmalade", even "mice", "dust" and "socks". It was a very strange finding and it called for some statistics.
Another Wikipedia user, Il Mare, wrote a computer program to try and investigate this phenomenon. Now, he discovered, amazingly, that for almost 95% of Wikipedia pages, you will end up getting to "philosophy" eventually. Now, that's pretty cool, but how did it change my mind about data? Well, the pattern that Mark J discovered and the data that was captured and analysed, it revealed a hidden mathematical structure, because Wikipedia is just a network with loops and chains hidden all over the place and it's something that can be described beautifully using mathematics.
For me this was the perfect example of how there are two parallel universes. There's the tangible, noisy, messy one, the one that you can see and touch and experience. But there's also the mathematical one, where I think the key to our understanding lies.
And data is the bridge between those two universes. Our understanding of everything from cities to crime, global trade, migration and even disease... ..it's all underpinned by data. Take this for example. Rural Wiltshire and a dairy farm gathering data from its cows wearing pedometers. We can't be out here 24-7.
The pedometers help us to have our eyes and ears everywhere. It turns out when cows go into heat they move around a lot more than normal. Constant monitoring of their steps and some background mathematics reveal the prime time for insemination. We'll be able to look at the data and within 24 hours there'll be a greater chance of getting her in calf. Data-driven farming is now big business, turning a centuries-old way of life into precision science.
Pretty much every industry you can think of now relies on data. We all agree that we are undergoing a major revolution in human history. The digital world replacing the analogue world. A world based on data, they are made of codes rather than a world made of biological or physical data, that is extraordinary. Why philosophy at this stage? Because when you face extraordinary challenges, the worst thing you can do is to get close to it. You need to take a long run-up.
The bigger the gap, the longer the run-up. And the run-up is called philosophy. In the spirit of taking a long run-up, we'll start with the word itself. "Data" is originally from the Latin "datum", meaning "that which is given". Data can be descriptions... ..counts, or measures... ..of anything... ..in any format. It's anything that when analysed becomes information, which in turn is the raw material for knowledge, the only true path to wisdom.
Look at the data on data. And before the scientific and industrial revolution, the word barely gets a look in, in English. But then it starts to appear in print as scientists and the state gather, observe and create more and more of it. This arrival of the age of data would change everything. Industrial Revolution Britain.
For Victorians, booming industry and the growth of major cities were changing both the landscape and daily life beyond recognition. Into this scene stepped an unlikely man of numbers, William Farr, one of the first people to manage data on an industrial scale. William Farr had a quite unusual upbringing, in that he was actually the son of a farm labourer but who had managed to get a medical education, which was really very unusual for someone of his class. Farr very quickly became absorbed in the study of statistics. He was particularly interested, as you might expect for somebody with medical training, in public health, life expectancy and about causes of death.
For anyone interested in statistics, there was only one place to be. Somerset House in London was home to the General Register Office, where, in 1839, Farr found his dream job. From up there in the north wing, William Farr, the apothecary, medical journalist and top statistician, would really rule the roost. Now, this place was almost like a factory. Here, they would collect, process and analyse vast amounts of data. So in would come the census returns, the records of every single birth, death and marriage in the country, and out would come the big picture, the usable information that could help inform policy and reform society.
I think it's sometimes difficult for us to remember just how little people knew in the early 19th century about the changes that Britain was going through. So when Farr did an analysis of population density and death rate, he was able to show that life expectancy in Liverpool was absolutely atrocious. It was far, far worse than the surrounding areas.
This came as a surprise to a lot of people who believed that Liverpool, a coastal town, was actually quite a salubrious place to live. At Somerset House, Farr spearheaded a revolution in the systematic collection of data to uncover the real picture of this changing society. Its scale and ambition was described in a newspaper at the time.
"In arched chambers of immense strength and extent "are, in many volumes, the "genuine certificates of upwards of 28 million persons "born into life, married or passed into the grave." Here, every person was recorded equally. A revolutionary idea.
"Here are to be found the records of nonentities, "side-by-side with those once learned in the law "or distinguished in "literature, art or science." But what really motivated William Farr was not just data collection, it was the possibility that data gathered could be analysed to help overcome society's greatest ill. Cholera was probably the most feared of all of the Victorian diseases. The terrifying thing was that you could wake up in the morning and feel absolutely fine, and then be dead by the evening. Between the 1830s and the 1860s, tens of thousands died in London alone.
The control of infectious diseases like cholera, which no-one fully understood, became the greatest public-health issue of the time. However great London might have looked back then, it would have smelled absolutely terrible. At that point, the Victorians didn't have really a great way of disposing of human waste, so it would have flowed down the gutters into open sewers and out into the Thames. Now, the city smelt so bad that it was pretty plausible that the foul air was responsible for carrying the disease. Farr collected a huge range of data during each cholera outbreak to try to identify what put people most at risk from the bad air.
He used income-tax data to try and measure the affluence of the different boroughs that were affected by cholera. He asked his friends at the Royal Observatory to provide data on the temperature and climatic conditions. But the one that he thought was most convincing was about the topography. It was about the elevation above the Thames. Using the data, Farr suggested a mathematical law of elevation.
Its equations described how cholera mortality falls the higher you live above the Thames. Now, he published his report in 1852, which the Lancet described as one of the most remarkable productions of type and pen in any age and country. The only problem was that Farr's work, although elegant and meticulous, was fundamentally flawed. Farr stuck to the prevailing theory that cholera was spread by air. Such is the power of the status quo.
But in 1866, 5,500 people died in just one square mile of London's East End, and that data made Farr change his mind. When Farr came to write his next report, the data told a different story which proved the turning point in combating the disease. The common factor among those who died was not elevation or air but sewage-contaminated drinking water. With this new report, Farr may seem to have contradicted much of his own work, but I think that this is the perfect example of what data can do. It provides that bridge essential to scientific discovery, from theory to proof, problem to solution. Good data, even in huge volumes, does not guarantee that you will arrive at the truth.
But, eventually, when the weight of the data tips the balance, even the strongest-held beliefs can be overcome. Of course, it was the weight of the data itself which, with the dawn of the 20th century, was becoming increasingly hard to manage. Data stored long form in things like census ledgers could take the best part of a decade to process, meaning the stats were often out of date.
When you're dealing with figures like these, it's one thing. But when you're counting the population like this it's quite a different matter. A deceptively simple solution got what's now called the information revolution under way, encoding data as holes punched in cards. These cards are passed over sorting machines, each of which handles 22,000 cards a minute.
By the 1950s, data processing and simple calculations were routinely mechanised, laying the groundwork for the next generation of data-processing machines. They would be put to pioneering work in a rather unlikely place. In a grand London dining hall, a group of men and women, many in their 80s and 90s, have gathered for a special work reunion. At its peak, their employer, J Lyons, purveyor of fine British tea and cakes, had hundreds of tea shops nationwide. There are hundreds of items of food.
All these in a varying quantity each day are delivered to a precise timetable to the tea shops. These people aren't former J Lyons bakers or tea-shop managers. They were hired for their mathematical skills. Lyons had a huge amount of data which has to be processed, often very low-value data. So, for example, the transaction from a tea shop would be a cup of tea. But each one had a voucher and had to be recorded, and had to go to the accounts for business reasons and for management reasons.
Every calculation you did, not only you had to do it twice, but you had to get it checked by someone else as well. The handling of these millions and millions of pieces of data, the storage of that data, are the key of the business problem. The Lyons team took the world by surprise when, in 1951, they unveiled the Lyons Electronic Office, or Leo for short. At this point, only a handful of computers existed, and they were used solely for scientific and military research, so a business computer was a radical reimagining of what this brand-new technology could be for.
Each manageress has a standing order depending on the day of the week. She speaks by telephone to head office, where her variations are taken quickly onto cards. What the girl hears, she punches.
The programme is fed first, laying down the sequence for the multiplicity of calculations Leo will perform. It was the first opportunity to process large volumes of clerical work, take all the hard work out of it, and put it on an automatic system. Before Leo, working out an employee's pay took an experienced clerk eight minutes, but with Leo that dropped to an astonishing one and a half seconds. It was all so exciting because we were breaking new ground the whole time. Absolutely everything which we did has never been done before.
By anybody anywhere. I don't think we realised the kind of transformation we were part of. The post-war years saw a boom in the application of this new computing technology. Leo ran on paper, tape and cards, but soon machines with magnetic tape and disks were developed, allowing for greater data storage and faster calculations.
As more businesses and institutions adopted these new machines, application of mathematics to a whole host of new, real-world challenges took off. And the word "data" went from relatively obscure to ubiquitous. "Data" has become almost a magic word for anything. The truth is that it is a kind of interface today between us and the rest of the world.
In fact, between us and ourselves, we understand our bodies in terms of data, we understand society in terms of data, we understand the physics of the universe in terms of data. The economy, social science, we play with data, so essentially it is what we interact with most regularly every day. Data underpins all human communication, regardless of the format. And it was the desire to communicate effectively and efficiently that led to one of the most important academic papers of the 20th century. A mathematical theory of communication has justifiably been called the Magna Carta for the information age. It was written by a very young and bright employee of Bell Laboratories, the American centre for telecoms research that was founded by one of the inventors of the telephone, Alexander Graham Bell.
Now, this paper was written by Claude Shannon in 1948 and it would effectively lay out the theoretical framework for the data revolution that was just beginning. Those that knew him described Shannon as a lifelong puzzle solver and inventor. To define the correct path it registers the information in its memory.
Later, I can put him down in any part of the maze that he has already explored and he will be able to go directly to the goal without making a single false turn. During World War II he worked on data-encryption systems, including one used by Churchill and Roosevelt. But at Bell Labs, Claude Shannon was trying to solve the very civilian problem of noisy telephone lines. # There's a call, there's a call # There's a call for you # There's a call on the phone for you. # In that analogue world of 20th-century phones, your speech was converted into an electrical signal using a handset like this and then transmitted down a series of wires.
The voice signals would travel along the wire, be detected by the receiver at the other end and then be converted back into sound waves to reach the ear of whoever had picked up. The problem was, the further the electrical signal travelled down the line, the weaker it would get. PHONE LINE CRACKLES Eventually you couldn't even hear the conversation for the amount of noise on the line. And you could boost the signal but it would mean boosting the noise, too.
Shannon's genius idea was just as simple as it was beautiful. The breakthrough was converting speech into an incredibly simple code. ON PHONE: Hello? First the audio wave is detected, then sampled. Each point is assigned a code of ones and zeros and the resulting long string of digits can then be sent down the wire with the zeros as brief low-voltage signals and ones as brief bursts of high voltage.
From this code, the original audio can be cleanly reconstructed and regenerated at the other end. ON PHONE: Hello? Shannon was the first person to publish the name for these ones and zeros, the smallest possible pieces of information, and they are called bits or binary digits, and the real power of the bit and the mathematics behind it applies way beyond telephones. They offered a new way for everything, including text and pictures, to be encoded as ones and zeros. The possibility to store and share data digitally in the form of bits was clearly going to transform the world. If anyone has to be identified as the genius who developed the foundational science of mathematics for our age, that is certainly Claude Shannon.
Now, one thing has to be clarified, the theory developed by Shannon is about data transmission and it has nothing to do with meaning, truth, relevance, importance of the data transmitted. So it doesn't matter whether the zero and one represent an answer to, "Heads or tails?", or to the question, "Will you marry me?", for a theory of information is data anyway and if it is a 50-50 chance that you will or will not marry me or that it is heads or tails, the amount of information, the Shannon information, communicated is the same. Shannon information is not information like you or I might think about it. Encoding any and every signal using just ones and zeros is a pretty remarkable breakthrough. However, Shannon also came up with a revolutionary bit of mathematics.
That equation there is the reason you can fit an entire HD movie on a flimsy bit of plastic or the reason why you can stream films online. I'll admit, it might not look too nice, but... don't get put off yet, because I'm going to explain how this equation works using Scrabble. Imagine that I created a new alphabet containing only the letter A. This bag would only have A tiles inside it and my chances of pulling out an A tile would be one.
You'd be completely certain of what was going to happen. Using Shannon's maths, the letter A contains zero bits of what's called Shannon information. Let's say then I got a little bit more creative, but not much, and had an alphabet with two letters, A and B, and equal numbers of both in this bag. Now my chances of pulling out an A are going to be a half and each letter contains one bit of Shannon information. Of course, when transmitting real messages, you'll use the full alphabet. But English, as with every other language, has some letters that are used more frequently than others.
If you take a quite common letter like H, which appear about 5.9% of the time, this will have a Shannon information of 4.1 bits. And incidentally, a Scrabble score of four. Of course, there are some much more exotic and rare letters, like Z, for instance, which appears about 0.07% of the time. That gives it 10.5 bits and Scrabble score of ten.
Bits measure our uncertainty. If you're guessing a three-letter word and you know this letter is Z, it gives you a lot of information about what the word could be. But if you know it's H, because it is a more common letter with less information, you're more uncertain about the answer.
Now if you wrap up all that uncertainty together, you end up with this, the Shannon entropy. It's the sum of the probability of each symbol turning up times the number of bits in each symbol. And this very clever bit of insight and mathematics means that the code for any message can be quantified. Not every letter, or any other signal for that matter, needs to be encoded equally. The digital code behind a movie like this one of my dog, Molly, for example, can usually be compressed by up to 50% without losing any information. But there's a limit.
Compressing more might make it easier to share or download, but the quality can never be the same as the original. DOG BARKS You can't really overstate the impact that Shannon's work has had, because without it we wouldn't have JPEGs or Zip files or HD movies or digital communications. But it doesn't just stop there, because while the mathematics of information theory doesn't tell you anything about the meaning of data, it does begin to open up a possibility of how we can understand ourselves and our society, because pretty much anything and everything can be measured and encoded as data. We say that signals flow through human society, that people use signals to get things done, that our social life is, in many ways, the sending back and forth of signals. So what is a signal? It's, in one sense, just the reduction in uncertainty. What it means to receive a signal is to be less uncertain than you were before and so, another way to think of measuring or quantifying signal is in that change in uncertainty.
Using Shannon's mathematics to quantify signals is common in the world of complexity science. It's rather less familiar to historians. I love maths, I love its precision, I love its beauty. I absolutely love its certainty, and that, Simon can bring that mathematical worldview, that mathematical certainty to what I work with.
The reason behind this remarkable marriage between history and science is the analysis of the largest single body of digital text ever collated about ordinary people. It's the Proceedings of London's Old Bailey, the central criminal court of England and Wales, which hosted close to 200,000 trials between 1674 and 1913. There are 127 million words of everyday speech in the mouths of orphans and women and servants and ne'er-do-wells, of criminals, certainly, but also people from every rank and station in society. And that made them unique.
What's exciting about the Old Bailey and the size of the dataset, the length and magnitude of it, is that not only can we detect a signal, but we are able to look at that signal's emergence over time. Shannon's mathematics can be used to capture the amount of information in every single word, and like the alphabet, the less you expect a word, the more bits of information it carries. Imagine that you walk into a courtroom at the time and you hear a single word, the question we ask is how much information does that word carry about the nature of the crime being tried? You hear the word "the". It's common across all trials and so gives you no bits of information. Most words you hear are poor signals of what's going on. But then you hear "purse".
It conveys real information. Then comes "coin", "grab" and "struck". The more rare a word, the more bits of information it carries, the stronger the signal becomes. One of the clearest signals that we see in the Old Bailey, one of the clearest processes that comes out, is something that is known as the civilising process.
It's an increasing sensitivity to, and attention to, the distinction between violent and nonviolent crime. If, for example, somebody hit you and stole your handkerchief, in the 18th-century context, in 1780, you would concentrate on the handkerchief. More worried about a few pence worth of dirty linen than the fact that somebody just broke your nose or cracked a rib. The fact that 100 years later, by 1880, every concern, every focus, both in terms of the words used in court, but also in terms of what people were brought to court for, focus on that broken nose and that cracked rib, speaks to a fundamental change in how we think about the world and how we think about how social relations work.
Look at the strongest word signals for violent crime across the period. In the 18th century, the age of highwaymen, words relating to property theft dominate. But by the 20th century, it's physical violence itself and the impact on the victim that carry the most weight.
That notion that one can trace change over time by looking at language and how it's used, who deploys it in what context, that I think gives this kind of work its real power. There are billions of words, there's all of Google Books, there's every printed newspaper, there is every speech made in Parliament, every sermon given at most churches. All of it is suddenly data and capable of being analysed. The rapid development of computers in the mid 20th century transformed our ability to encode, store and analyse data. It took a little longer for us to work out how to share it. This place is home to one of the most important UK scientific institutions, although it's one you've probably never heard of before.
But since the 1900s, this place has advanced all areas of physics, radio communications, engineering, materials science, aeronautics, even ship design. NPL, the National Physical Laboratory, in south-west London is where the first atomic clock was built and where radar and the Automatic Computer Engine, or Ace, were invented. The Ace computer was the brainchild of Alan Turing, who came to work here right after the Second World War. Now, Turing's contributions to the story of data are undoubtedly vast, but more important for our story is another person who worked here with Turing, someone who arguably is even less well known than this place, Donald Davies.
Davies worked on secret British nuclear weapons research during the war... ..later joining Turing at NPL, climbing the ranks to be put in charge of computing in 1966. As well as the new digital computers, Davies had a lifelong fascination with telephones and communication. His mother had worked in the Post Office telephone exchange, so even when he was a kid, he had a real understanding of how these phone calls were routed and rerouted through this growing network, and that was the perfect training for what was to follow.
What was Donald Davies like, then? He was a super boss because he was very approachable. Everybody realised he'd got huge intellect but not difficult with it. Very nice guy.
Davies' innovation was to develop, with his team, a way of sharing data between computers, a prototype network. Donald had spotted that there was a need to connect computers together and to connect people to computers, not by punch cards or paper tape or on a motorcycle, but over the wires, where you can move files or programs, or run a program remotely on another computer, and the telephone network is not really suited for that. In the pre-digital era, sending an encoded file along a telephone line meant that the line was engaged for as long as the transmission took. So the opportunity here was because we owned the site, 78 acres with some 50 buildings, we could build a network. Davies' team sidestepped the telephone problem by laying high-bandwidth data cables before instituting a new way of moving data around the network.
The technique he came up with was packet switching, the idea being that you take whatever it is you're going to send, you chop it up into uniform pieces, like having a standard envelope, and you put the pieces into the envelope and you post them off and they go separately through the network and get reassembled at the far end. To demonstrate this idea, Roger and I are convening NPL's first-ever packet-switching data-dash... ..which is a bit more complicated than your average sports-day event. The course is a data network. There are two computers, represented here as the start and finish signs.
Those computers are connected by a series of network cables and nodes. In our case, cables are lines of cones and the connecting nodes are Hula Hoops. Having built it, all we need now are some willing volunteers. And here they are. NPL's very own apprentices.
So welcome to our packet-switching sports day. We've got two teams, red and blue. 'Both teams are pretending to be data 'and they're going to have to race.' You're going to start over there where it says "start", kind of obvious, and you're trying to get through to the end as quickly as you possibly can. You can't just go anywhere, you have to go through these hoops to get to the finish line, these little nodes in our network. You're only allowed to travel along the lines of the cones, but only if there's nobody else along that line.
All clear? OK, there is one catch. All of you who are in the red team, we are going to tie your feet together. So you've got to travel round our network as one big chunk of data. Those of you who are in blue, you are allowed to travel on your own, so it's slightly easier. 'The objective is for both teams to deposit their beanbags 'in the goal in the right order, one to five.'
EXCITED CHATTER Get in the hoop! Get in the hoop! Bring out your competitive spirit here. We've got packets versus big chunks of data. I'm going to time you. Everyone ready? OK, over to you, Roger. TOOT! Remember, you can't go down the route until it's clear.
'The red and blue teams are exactly the same size, 'let's say five megabytes each. 'But their progress through the network is clearly very different.' THEY LAUGH OK, blues, you took 13 seconds, pretty impressive. Reds, 20 seconds.
That's a victory for the packet switchers. Well done, you guys! Well done, you guys. The impact that packet switching has had on the world, I mean, it sort of came from here and then spread out elsewhere. It did indeed, we gave the world packet switching, and the world, of course, being America, they took it on and ran with it. This little race, Donald Davies' packet switching, was adopted by the people that would go on to build the internet, and today, the whole thing still runs on this idea. Let's say I want to e-mail you a picture of Molly.
First, it will be broken up into over 1,000 data packets. Each one is stamped with the address of where it's from and where it's going to, which routers check to keep the packets moving. Regardless of the order they arrive, the image is reassembled, and there she is. This is quite a cool thing, right, that you've got one of the original creators of packet switching right here and you can ask him... Every time you're like... Well, do anything, really. "Why is my internet running so slowly?" THEY LAUGH Don't ask me! We've come a very long way in just a few decades.
Around 3.4 billion people now have access to the internet at home and there are around four times the number of phones and other data-sharing devices online, the so-called Internet of Things. Just by being alive in the 21st century with our phones, our tablets, our smart devices, all of us are familiar with data. Really embrace your inner nerd here, because every time you wander around looking at your screen, you are gobbling up and churning out absolutely tons of the stuff. Our relationship with data has really changed - it's no longer just for specialists, it's for everyone.
There's one city in the UK that's putting the sharing and real-time analysis of data at the heart of everything it does - Bristol. Using digital technology, we take the city's pulse. This data is the route to an open, smart, liveable city, a city where optical, wireless and mesh networks combine to create an open, urban canopy of connectivity. Taking the pulse of the city under a canopy of connectivity might sound a bit sci-fi, or like something from a broadband advert.
But if you just hold on to your cynicism for a second, because Bristol are trying to build a new type of data-sharing network for its citizens. There's a city-centre area which now has next-generation or maybe the generation after next of superfast broadband and then that's coupled to a Wi-Fi network, as well. The question is, what can you do with it? We would have a wide area network of very simple Internet of Things sensing devices that just monitor a simple signal like air quality or traffic queued in a traffic jam. Once you've got all this network infrastructure, you can get an awful lot, a really huge amount of data arriving to you in real time.
What's happening here is a city-scale experiment to try and develop and test what's going to be called the programmable city of the future. It relies on Bristol's futuristic network, vast amounts of data from as many sensors as possible and a computer system that can simulate and effectively reprogram the city. The computer system can intervene. It could reroute traffic and we can actually radio out to individuals, so maybe they get a message on their smartphone or perhaps a wrist-mounted device, saying, "If you have asthma, perhaps you should get indoors."
Once you create that capacity for anything and everything in the city to be connected together, you can really start to re-imagine how a city might operate. We are starting to experiment with driverless cars and, in order for driverless cars to work, they have to be able to communicate with the city infrastructure. So, your car needs to speak to the traffic lights, the traffic lights need to speak to the car, the cars to speak to each other. All of that requires a completely different set of infrastructure.
Of course, as the amount of data a city can share grows, the computing power needed to do something useful with it must grow, too. And for that, we have the cloud. For example, imagine trying to analyse all of Bristol's traffic data, weather and pollution data on your home computer. It could take a year.
Well, you could reduce that to a day by getting 364 more computers, but that's expensive. A cheaper option is sharing the analysis with other computers over the internet, which Google worked out first, but they published the basics and now free software exists to help anyone do the same. Big online companies rent their spare computers for a few pence an hour. So, now anyone like me or you can do big data analytics quickly for a few quid. Such computing power is something we could never have dreamt of just a few years ago, but it will only fulfil its potential if we can share our own data in a safe and transparent way.
If Bristol Council wanted to know where your car was at all times but could use that information to sort of minimise traffic jams, how would you feel about something like that? Er, I'm not sure if I'd particularly like it. I think it is up to me where I leave my car. I understand the idea of justifying it with all these great other ideas, but I still probably wouldn't like it very much. If they are using it for a better purpose, then yeah, but one should know how they are using it and why they'll be using it, for what purpose.
I'd like to imagine a world in which all the data that was retained was used for the greater good of mankind, but I can't imagine a circumstance like that in the world that we have today. We live in a modern society, where if you don't let your data out there, not in the public domain, but in a secure business domain, then you can't take part in society, really. Unsurprisingly, people are pretty wary about what happens to their data. We need to be careful that civil liberties are not eroded, because otherwise the technology is likely to be rejected. I think it's an area where us as a society have yet to sort of fully understand what the correct way forward is and therefore it is very much a discussion.
It's not a lecture, it's not a code, it's one where we are co-producing and co-forming these sorts of rules with people in the city, in order to sort of help us work out what the right and wrong things to do are. It will be intriguing to watch Bristol grapple with the technological and ethical challenges of being our first data-centric city. In all these contexts, Internet of Things... ..new forms of health care, smart cities, what we're seeing is an increase in transparency. You can see through the body, you can see through the house, you can see through the city and the square, you can see through society.
Now, transparency may be good. It's something that we may need to handle carefully in order to extract the value from those data to improve your lifestyle, your social interactions, the way in which your city works and so on. But it also needs to be carefully handled, because it's touching the ultimate nerve of what it means to be human. So how much data should you give away? Traffic management is one thing but when it comes to health care, the stakes, the risks and benefits are even higher. And in Bristol, with a project called Sphere, they're pushing the boundaries here, too.
The population is getting older, and an ageing population needs more intense health care, but it's very difficult to pay for that health care in institutions, paying for nurses and doctors. So, the key insight of the Sphere team was that it's now possible to arrange, in a house, lots of small devices where each device is monitoring a simple set of signals about what's going on in that house. There might be monitors for your heart rate or your temperature, but there might also be monitors that notice, as you're going up and down stairs, whether you're limping or not. They've invited me to go and spend a night in this very experimental house, but unfortunately, I'm not allowed to tell you where it is. The project is a live-in experiment and will soon roll out to 100 homes across Bristol.
It's a gigantic data challenge, overseen by Professor Ian Craddock. So, that's one up there, then? Yes, that's one of the video sensors and we have more sensors in the kitchen. We have another video camera in the hall and some environmental sensors, and a few more in here. The house can generate 3-D video, body position, location and movement data from a special wearable.
How much data are you collecting, then? So, when we scale from this house to 100 houses in Bristol, in total we'll be storing over two petabytes of data for the project. Lord. So, on my computer at home, I don't even have a terabyte hard drive and you're talking about 20,000 of those. Yes. I mean, you know, the interaction of people with their environment and with each other is a very complicated and very variable thing and that's why it is a very challenging area, especially for data analysts, machine learners, to make sense of this big mass of data. I'm happy to find out that the research doesn't call for cameras in the bedroom or bathroom, but I do have to be left entirely on my own for the night.
The very first thing I'm going to do is pour myself a nice bloody big glass of wine. There we go. So, that nice glass of wine that I'm enjoying isn't completely guilt-free, because I've got to admit to it to the University of Bristol. I have to keep a log of everything I do, so that the data from my stay can be labelled with what I actually got up to. In this way, I'll be helping the process of machine learning, teaching the team's computers how to automatically monitor things like cooking, washing and sleeping, signals in the data of normal behaviour.
In the interests of science. 'I was also asked to do some things that are less expected.' Oh! I spilled my drink. 'The team need to learn to detect out-of-the-ordinary behaviour, too, 'if they want to, one day, spot specific signs of ill health.'
Right, I'm going to run this back to the kitchen now. It's a fairly strange experience. I think the temperature sensors, the humidity sensors, the motion sensors, even the wearable I don't have a problem with at all.
For some reason the body position is the one that's getting me. On the flipside, though, I would go absolutely crazy to have this data. This is the most wonderful... My goodness me. Everything you could learn about humans. It would be so brilliant. One thing I wanted to do was to do something completely crazy just to see if they can spot it in the data. Just to kind of test them. OK, ready? I can't believe this is my life now.
'Anyone can get the data from my stay online if they fancy trying to find 'my below-the-radar escape. 'The man in charge of machine learning, Professor Peter Flach, 'has the first look.' Between nine and ten, you were cooking.
Correct. Then you went into the lounge. You had your meal in the lounge. You know what? I ate on the sofa.
And you were watching crap television. I was watching crap television? I've been found out. We didn't switch the crap-television sensor on. That's not on here, but OK.
So, you were in the lounge sort of until 11:30. Correct. Then you went upstairs, there's a very clear signal here. And then, from then on, there isn't a lot of movement. I was in bed. So, I guess you were in bed.
Sleeping. Normal activities, like cooking or being in bed, are relatively straightforward to spot. But what about the weird stuff? This is yesterday, again. I can see it. I can see the moment. You can see the moment? I can see it, yeah. There's something happening here which is sort of rather quick.
You've been in the lounge for quite a while and then, suddenly, there's a brief move to the kitchen here and then very quick cleaning up in the lounge. I wasted good wine on this experiment. Good wine? Humans are extraordinarily good at spotting most patterns. For machines, the task is much more challenging, but, once they've learned what to look for, they can do it tirelessly. I suppose, in the long run, if you are going to scale this up to more houses, you can't have people sifting through these graphs trying to find...
I mean, you have to train computers to do them. You have to train computers to do them. One challenge that we are facing is that our models, our machine learning classifiers and models, need to be robust against changes in layout, changes in personal behaviour, changes in the number of people that are in a house.
And maybe we are wildly optimistic about what it can do, but we are in the process of trying to find out what it can do, at what cost, at what... ..invasion into privacy, and then we can have a discussion about whether, as a society, we want this or not. If this type of technology rolls out, machines will be modelling us in mathematical terms and intervening to help keep us healthy in real time - and that's completely new.
It's true that our fascination with machine, or artificial, intelligence is as old as computers themselves. Claude Shannon and Alan Turing both explored the possibilities of machines that could learn. But it's only today, with torrents of data and pattern-finding algorithms, that intelligent machines will realise their potential. You'll hear a lot of heady stuff about what's going to happen when we mix big data with artificial intelligence. A lot of people, understandably, are very anxious about it.
But, for me, despite how much the world has changed, the core challenge is the same as it always was. It doesn't matter if you are William Farr in Victorian London trying to understand cholera or in one of Bristol's wired-up houses, all you're trying to do is to understand patterns in the data using the language of mathematics. And machines can certainly help us to find those patterns, but it takes us to find the meaning in them.
We should be worried about what we're going to do with these smart technologies, not about the smart technologies in themselves. They are in our hands to shape our future. They will not shape our futures for us. In the blink of an eye, we have gone from a world where data, information and knowledge belonged only to the privileged few, to what we have now, where it doesn't matter if you're trying to work out where to go on holiday next or researching the best cancer treatments. Data has really empowered all of us. Now, of course, there are some concerns about big corporations hoovering up the data traces that we all leave behind in our everyday lives, but I, for one, am an optimist as well as a rationalist and I think that if we can marshal together the power of data, then the future lies in the hands of the many and not just the few.
And that, for me, is the real joy of data. MUSIC: Good Vibrations by The Beach Boys