IDEA Webinar: Physically Protecting Sensitive Data
LARS VILHUBER: Welcome to J-PAL's IDEA Handbook webinar series. I'm Lars Vilhuber, an economist at Cornell University and co-chair of J-PAL's IDEA Initiative. These webinars accompanied the release of the handbook on using administrative data for research and evidence-based policy funded by the Alfred P. Sloan Foundation. The handbook provides a series of technical guides and case studies on how to make administrative data more accessible for research and policy. Following this talk, we will have a live question and answer session.
You have the opportunity to enter any questions during the talk in the chat box below this video. Today, I'm doing something unusual. I'm introducing myself. I'll be presenting the chapter titled Physically Protecting Sensitive Data, which I co-wrote with my coauthor Jim Shen, who's also on this call. It's one of the technical guides in the handbook.
And without further ado, I'll hop right into the presentation. So the physically protecting the data part is one of the key elements of the five safes approach to protecting confidential data. In combination, all those five safes make the protection stronger. We'll focus today here on one of the key contributors to the safe settings namely how to physically protect data.
The physical protection is one of the key parameters that data custodians can and do regularly influence. And it's a parameter is very dependent on the current state of technology. So what we're going to be presenting today is a snapshot. It should be seen as an indicator of where to look for information, whom to ask for information. It should not be seen as a state of the technology because tomorrow some elements are probably going to be obsolete already. The types of threats and interactions with the other safes will depend on each particular circumstance, so we might comment on some of those as we go through.
The key focus group of people interested in this particular presentation should not be IT specialists. They know far more about this than Jim and I do. But knowledge of the technological possibilities is important when negotiating access to administrative data or making administrative data available that don't currently have some existing access mechanism.
It's pushing the frontier and knowing how to combine various safeguards that is the key outcome here. What do we mean actually by physically protecting the data? Well, this is primarily in contrast to statistically or computationally protecting the data. My co-author Ian Schmutte already presented on balancing privacy and data usability in November. That's chapter 5 in the handbook.
And chapter 6, you heard last week from Alex Wood and her co-authors discussed how to design access with differential privacy in mind. There are, of course, other ways of protecting the data as well through legal agreements that we've heard about as part of the series as well. What do we consider to be physical in this context? IT security measures is the obvious first one. But also, building security measures is eminently physical. But also something is somewhat more abstract such as the choice of locations.
In this overall assessment, various types of threats. So we can distinguish external and internal threats. External threats are the ones that most people will have in mind when they think of hacking or security breaches, our adversarial actors, the archetypical external hacker who wants access to the data. Examples might include the Equifax breach, or other things.
They typically exploit technical vulnerabilities and may conduct social engineering attacks. The unintentional breach is when information gets lost because left unsecured. It's part of many trainings on how to handle confidential data is what should you do with these kinds of things.
We'll be able to talk a bit about how to mitigate some of those. But of course, unintentional is unintentional and can amount to leaving an unsecured laptop or sheets of paper somewhere where somebody else could access them. Internal, finally, are unauthorized use or ways, unauthorized means of accessing the data by individuals who do have the authority to access the data, but do so in an unauthorized way. One example, for instance, is the Facebook.
Cambridge Analytica scandal: folks who had legitimate access to the data, but used it in ways that were not originally foreseen or probably should not have happened. So putting it all into context, what we're trying to do is connect researchers with data, or for an administrator, we're trying to connect the data to some researchers. And we can focus on various aspects of that process. First of all, that data, those data are going to be stored somewhere. That data storage needs to be secured.
That storage can be with the administrator, with the original data provider, it can be with the researcher, and it can be with a third party in the cloud. So in assessing the various protections here, you need to think about physical media if those are being used. That might be attached storage within a laptop, within the server.
It might be removable storage. USB drives or CDs that one might use or decide to choose to share the data or transfer the data. One does need to think about storage in Cloud services. Third party data providers, whether they're proprietary or open source and how the data is stored there. Here the talk is often of encryption at rest. And encryption in transfer to these folks.
So we hear when we're talking about data storage. We're talking about encryption at rest. Is that a feature of the data provider? Is that a feature of the physical media that's there? And we also need to incorporate within the security access overall.
Some measure of reliability. What happens if we lose access to one part of that? Are there backups? What if something catastrophic happens to one of these data access points that we will discuss? You will also need to think about are the backups secured and in a way made, access is controlled and remains secure throughout the entire process. So throughout all of this, we will typically come back to encryption and to think about it right away, The key message here is always encrypt. But we'll get back to that and moment. In order to get data to researchers, one needs to transfer some element of the data. Often that has meant sending the data to the researcher.
But at a minimum, if you're working with a third party Cloud provider you're also transmitting the data to that Cloud. And in some fashion, you're also then transmitting the data from that cloud provider to the researcher, which direction this goes is there are many ways of setting that up. In the old days, that might have been sending around tapes, CDs, USB drives, something of the kind. Removable drive that can be used to send over the data.
Nowadays, it's more typical to encounter electronic transfers through encrypted network protocols or aforementioned Cloud services. Again, the key message here is never send the data unencrypted over any of those channels. Even if you're writing to a CD, it should be encrypted. And our chapter provides a number of links and examples of software that can be used to do that encryption. Some of which is built into the operating system.
Some of which is available for free in any other way. Now, one way to not transfer data is to give researchers access to the location where the data actually happens to be at the data provider. So we'll call that data access.
Again, here we need to think about various elements of that path that need to be protected. There is network security. There are virtual private networks that can be used to exchange data over public networks as if they were directly connected on a private network. And of course, it goes without saying, we can also set up a private network, that was done much more frequently in the past. It is much more rarely seen nowadays. One can use very simple measures such as restricting which machines can access a particular network.
That can go from simple IP address restrictions. They're not fail safe either to more complex mechanisms of actually identifying some of the endpoints that are accessing the data. Again, here encryption is key, in particular, for virtual private networks. They fashion an encrypted channel over which the information is transmitted, whether we're talking about data transfers or access.
We get to access in a moment. But there's minimum security requirements for all of these. When we mentioned data storage, full disk encryption is available either through software or through hardware. Many modern disk drives have the ability to have encryption enabled at the disk drive level regardless of what happens. There are memory sticks that allow you to encrypt.
Once every time that the disk is unplugged, the data is encrypted, and unless you enter a password, it does not get decrypted at the destination. You can also encrypt individual files or a collection of files. Those might be virtual drives that show up as a file on disk. And once mounted through a particular piece of software, they become accessible to doing so. And once you've finished for the day or after some time has elapsed, those virtual drives lock down again. Again there's several open source solutions to that.
Several commercial providers also provide mechanisms to do so. The next level is when you give access to researchers to places where the data can be accessed to think about system isolation. The idea being that the researcher can access only as allowed and trusted.
You might have a contract to access a particular data file. But not all the data files that a data provider might give to the researcher, or might provide access to. So you want to control for that. One simple way of doing that is by setting up independent systems. Researcher systems that are separate from the administrative systems that may be present for instance in an agency.
But the key element here, the user access isolated from each other and from other users on the system. So a researcher probably doesn't need access to all the administrative data systems suggesting that you might want to set up a separate system. But even within your own systems, you might already have as an administrator various ways of isolating users from each other. And those mechanisms may be leverage for researchers as well. Technical means of achieving that, again, are often built into the systems that require experts to properly set them up and avoid that things leak around the edges.
Data access controls are built into the file system. Every time that you've been prompted by your computer, are you sure that you want to do this, or do so as administrator an example of an operating system embedded access control? But there are much more sophisticated methods that can be set up. But this might also be a physical system isolation.
It used to be the case, for instance, that one way to spread out secure systems across say university offices was to have standalone computers sitting in a back office locked room that a researcher walks into. And that system is the ultimate physical isolation. If the data security cannot transmit over the network, you're fine.
You might still want to encrypt say the hard drives that are used in the system. And you still need to get data onto it using some of the mechanisms we've described before. In modern systems, you often scale out by having virtual systems. And there is a whole panoply of various tools to do so, whether those are called virtual machines, or Virtual Desktop infrastructure, or Docker, or chroot, or et cetera.
They're present in many windows and Linux systems. Finally, we're talking about connecting the researchers over all these various paths to a system. We're talking about virtual systems at the data provider, or maybe a physical computer sitting at the data provider. Then we may often think about remote desktop systems.
That's one of the more user convenient ways to give researchers access to these. That doesn't mean that we're talking primarily about accessing windows and Linux computers. I have seen very few remote desktop solutions that actually add data providers allow you to access Mac computers.
But the researchers computer can be because he's using software to connect to another system. The software will be called remote desktop, or VDI, or Citrix, or one of the many implementations of this. Again, there are many options out there that range from to free to the quite expensive. And it all depends on what kind of knowledge you already have what kind of funds you have to support these kinds of things.
And in our chapter, we point to a few examples. One of the key advantages of the remote desktop, I pointed that out showing the graph, originally is that you don't have to worry about transferring data to researchers. You're only sending desktop pictures more or less to researchers. You also don't have to worry about storing data At any researcher site because everything stays with the data provider. Downside is that there's obviously a cost associated with it and there may be network issues.
It may not be convenient to connect halfway around the world when there is network lag between every key touch and every key reaction seen on the screen. Internet goes down, which We hope it won't do during this presentation, But certainly it happens. And at that point in time, researchers no longer have access to it for reasons that are outside of their control. In some instances, there are specialized thin clients: computers that are optimized for remote desktop. They're literally quite small. Literally the only thing that they do is connect to another computer and they do so in a variety of ways.
This little box at the bottom here will come up again, and some of our other controls. It's used by the French remote access data system. I should mention also that there's another way to connect to remote systems which are a remote process, our query systems. Some academic researchers might be used to these when they have a supercomputer on campus or things like that. On the supercomputer, it's meant to scale out.
But it's also possible to use such systems to control for the access by researchers to its system. It makes some of the life of administrators easier because only code is sent. But it has the downside that it's really hard to do some interactive work by the researcher. Researchers just might not be used to this processing as well, there's a learning curve.
It is one way to implement job limits to balance many users across the limited resources that a data provider might have that is somewhat easier to manage than a live interactive desktop session. And if you've run into some of these before, you will know that a Remote Process system also has a queue. And you can find out where you are in the queue. You can manipulate parameters of the queue to be potentially more equitable, or have many more people complain about not having the immediate response. That's your choice. And finally, researchers don't sit in a void.
They sit somewhere when they work on these systems. So there's, in the simplest instance, There's a box around the user. That's called a room. There may be walls that need to be constrained or controlled in some fashion.
The federal research data centers, for instance, have very good and precise descriptions about how the physical room be controlled about how high or how secure the brick walls are, and what kind of windows and things like that are. And those apply just as much to the data provider's storage location. In most circumstances, of course, the data provider will already have some secure access. But it's not out of the question to require that researchers also have some constraints on from where they can access the data. That might range from anything like we just need to know you're in your faculty office too.
You need to cross campus walk into a locked room, leave your laptop outside, leave your phone outside, et cetera. Specifying the type of physical access is one of the key toolkits often used. So this may involve that there is a restricted access authorized persons only room in a location.
That location might be used at the same time to store the data for the local users even though an administrator might have yet another room within this locked environment. To do that, you can specify various other things. There is an interesting Scottish system that has been used in the UK called the SafePod, where essentially this whole gadget that you see at the bottom right is rolled into a library, is hooked up and is considered secure enough for access to the data. Because it's entirely controlled in its infrastructure by the data provider who then sends this out there to the researchers. So there is a huge variety of this.
Again, like I said, it can range from anywhere from just needs to be a faculty office to certain data providers who do explicitly not care about where you access that, relying on other features of the protection mechanisms. Physical access also means that there may be physical access cards, in particular, to access rooms. But going back to the little box at the bottom, you might also need a card that you need to stick into a computer or a laptop in order to unlock access to the remote desktop client that then connected to the data. This gives you a measure to lock down who can at any point in time access it, so you can't have multiple simultaneous logins, or things like that as well. And you might have biometric authentication as well. Again, the French box at the bottom shows up because it has an integrated fingerprint reader.
You can certainly configure secure laptops as well that have fingerprint readers in addition to card readers, in addition to being locked down with secure hard drives, and all these other kinds of things. So all of these technical features can be combined to create one facet of the secure access protocols that allow researchers to access sensitive data that you might be interested in. They're combined with all the other tools that I mentioned earlier the other four safes to create an overall secure environment. Now, all of these options are bewildering potentially, and it helps to have a look at some examples. That's one of the many reasons why you'll find in the case studies some very detailed information as much as it can be on examples of the typical access mechanisms that exist out there.
So one of the many combinations that we see out there is, for instance, remote execution. We see that in some limited circumstances where data providers will provide a tabulation system, or a remote submission system that does some simple tasks that will satisfy some users, but not all. But is a first step to providing broader access to things that cannot simply be posted as simple reports on a website, or something like that.
Such remote execution systems may rely on automated services. There are interesting examples in Canada and Norway that we mentioned in our chapter, or it may be that it's actually sent to a queuing system where staff then execute these by hand, and look at the output, and the output as, for instance, happens in the Germany case. One of the frequently encountered secure access mechanisms are physical data enclaves. Sometimes you might encounter them as being called research data centers. Those might be either at universities. They might be within the data providers offices themselves, where there is a researcher accessible area called the research data center, where researchers walk in to get controlled on the entry, and et cetera.
That might be as simple as a room with a person controlling the access to it within the administration. In some circumstances, the data custodian might maintain the infrastructure and the control. In some cases, that's delegated to users, or researchers, or libraries, or things like that. Becoming more popular at least as far as I can tell our virtual data enclaves. The progress made in the past decades about how to secure access over the internet has been tremendous. It has also become more convenient because the internet has gotten faster.
And so there has been a move to allow researchers to remotely access and analyze data. Typically such systems replaced pre-existing transfer system Where the data would have been sent out to researchers offices. This avoids all the hassle of having to track all CDs that you sent out, and controlling when they get back, however, monitoring that. It's also easier to scale out. It's much easier to create a new virtual data enclave or at least an additional node within the virtual data enclave than to build an entire secure physical data on clave.
And finally, something that we still encounter quite often is research provided infrastructure, where the researcher obtains custody of the data and then is committed to maintaining a secure local infrastructure for storage and analysis. We have examples for all of those in the handbook in the case studies. And all of them vary in how they have combined these tools. But in terms of the physical access modalities, these are the kinds of typical access mechanisms we're likely to encounter. So Jim and I originally sat down and tried to capture what are the key aspects of these various physical security mechanisms that we could serve identify to highlight differences and similarities across the various case studies.
And we ultimately consolidated on five aspects of physical security. One, something that has a lot to do with user from researcher satisfaction here is the level of researcher agency over analysis computers. What can a researcher actually do on the analysis computers that they have access to? The second one then is, where are those analysis computers? And typically, that also means where is the data at the same time? When they are separated, where are the access computers located? And what kind of security to such access locations have? Finally, we wanted to have a look at the range of analysis methods that are available to researchers on these systems.
And that, again, relates a lot to how satisfied researchers might be, what kind of collaboration between administrators and researchers can be envisioned to do this. For each of these aspects, we try to nail down three categories, three levels into which these can be classified. The coding is weakly aligned with how restrictive it might be on the researcher, and how much control the data provider exerts. So some of these are more along the lines of how much control somebody exerts, and some of them are more along the line of how restrictive it might be on some dimension. So researcher agency. We can consider a system where the researcher has very low agency.
They are limited to the software that data provider allows. They have no influence on that software or on the features directly. There might be medium agency where the researcher might request that things can be changed on their behalf on the system, or that some configuration options are available. And high agency is where the researcher may have a lot of control over what they can install on these computers, what kind of software they have, they might have administrative privileges on these computers, et cetera. Think of high agency. This is the researcher laptop.
And low agency, this is a Remote Process system. That is what it is, and it can do what it was designed to and the researcher cannot really influence not. The location of the analysis computers then is one of the other tradeoffs. You can really see from some of the mechanisms that I've mentioned earlier where this might lead us. One is that the analysis computer and data remain with the data provider giving the data provider the highest control over this.
And the researcher the lowest. A third party might act as a data custodian. This gives a bit more flexibility It doesn't fit nicely into an all or nothing kind of analysis here, we see this quite often.
So we thought at least the ability to highlight this. And finally, the location might be with a researcher. As I said earlier, this might be the research computer sitting on the researchers desk. But it might also be a research institution on campus that is run by the University employing the researcher rather than by some other third party. This is not always entirely clear as to how this is aligned.
For instance, there might be a third party, which provides access on campus. So its location is close to the researcher. But it's not run by something that the researcher can influence.
Access computers then when in the previous slide, when the analysis computer is moved on campus, the access computers may be exactly the same computer that the analysis is running on. But when the researcher needs to walk to a separate access location, or need to use a different device, then there's separation between the two. The location of the access computers might be remote from the researcher. They might need to travel to the data custodian. So this is where the analysis computer and the access computer coincide. But at the researcher, no at the administrative data provider location.
Again, there might be a third party that manages these access computers. And the researcher might control the access. But not the analysis computer. The security of the access computers we're trying to capture here with different mechanisms we might have. High security means that there are a lot of controls layered on top of how you can actually get to the access computer. This might be secure rooms with hardening beyond standard locked doors kind of scenario.
There is a medium level of security. You might have locations that are restricted to approve researchers such as a locked faculty office. And there's low security, which might mean that there are very few controls on the location of those access computers and where they are. This may be, for instance, when much of the restrictions come through behavioral restrictions through a data use agreement. For instance, no control at all about the access computers.
If it's any laptop will serve to use the software scenario. Finally, the range of analysis methods, they might be highly restricted. Think of a custom tabulation system, where you're limited to the tabulations that it can do. It can only do tabulations.
And while you can be extremely tricky and do some sort of OLS using only tabulations, you're going to be limited by what you can do on these kinds of things. Sometimes there are some small restrictions. You might have for instance, the ability to run most Stata code, or something like that. But you may not be able to run those commands that would allow you to see individual records. We call that limited restrictions.
And finally, unrestricted is where there is at least the potential to run any analysis on the analysis computer itself. That this is different from the agency over the computer, agency has more to do with the change in what can be done on these computers. Whereas the analysis methods available says something about what could potentially be run on these in the first place. OK. Some examples and, in particular, of an agency that deploys all three different combinations of these systems is for instance the research data center of the German Federal Employment Agency. They have on site access, they have a job submission system, and they provide scientific use files.
And these provide various different levels of access that balance out the detail that's in the data, the kind of analysis that can be done with various controls or the physical access modalities. So the most detailed data is only available through site access. Research agency in this context is relatively low. The data location may be with a third party.
It's not with the researcher. The access security is high. There are various protocols that are followed in order to access the data in the first place. But there's relatively little restrictions on what analysis can be done. But it is limited to the extent that on these access computers, you can only run, say, Stata and R in some limited fashion. On the other side of that spectrum, they also offer scientific use files.
They provide much less detail. And so they feel comfortable relaxing some of these constraints. The data location is with the researcher.
There is some access security imposed on this because these are campus files. They need to be on university campuses on some secured computer. But typically, this means that any type of analysis can be run on these files that the data would support. Other examples include, for instance, the Ohio State system.
In their case, they send data to the researchers based on a data use agreement. In this case, the analysis methods are unrestricted. The access security is at a lower level than what say the German system has. It doesn't mean that it has no security because there are still restrictions and computer security controls that the user the researcher needs to maintain.
But it is, to some extent, a different combination of these factors compounded or combined with data use agreements and other methods of restricting what can be done with these data. And finally, an example from New Brunswick that is one of our case studies as well. In their case, the university runs a data access system for other researchers within the Province on behalf of the Province of New Brunswick. This goes through secure rooms. With access cards, you've seen these pictures before those were the samples that we have there.
So the university serves as a data custodian. Not just for their own researchers. In this case, they serve as a third party as well for other university researchers and for other researchers outside of academia.
For instance, from other government agencies. There are other examples in the handbook each chapter has a discussion of these access methods of discussion, in particular, of what they consider to be safe settings under the five states framework. There's a summary table that Jim and I put together. That is our take on what these are classifying them into a system. And you can see in the chapter itself. So ultimately, what do data providers and researchers need to take away from this chapter? First, there are many solutions.
Not a single best solution that balanced high security with relatively broad accessibility convenience for researchers. Our case studies have at least two of those: the RDC at the IAB, the Federal Employment Agency and the NB-IRDT are both solutions that have a very high security with broad accessibility through some mechanism, where researchers go into a research data center. There are many examples of relatively simple but effective data access mechanisms as implemented say by the Ohio data system and by the Stanford San Francisco team that also wrote a chapter for our case studies. Data providers can allow researchers some flexibility in aspects while maintaining the overall security of the system. The IAB showcases how one can create multiple levels of security in the files that allow the agency itself to be comfortable putting the data into a variety of environments that are commensurate with the sensitivity of the data and the trust that they have in the researchers.
I should note that the Canadian RDC network is also currently working on implementing a much more fine grained access mechanism that goes through many of these details and is explicitly framed within the five states framework as well. Necessary aspects of data access mechanisms on researchers should be considered always in the context of all the five safe dimensions that we illustrate in the handbook. And that is why each of the case studies provides a detailed explanation along those five safe dimensions. At least in our opinion, one of the key indicators here of flexibility is the comfort level that an agency or data provider might have for enforcing data use agreements in the first place. This interacts with the trust that they may have in the various individuals signing those data use agreements, in the institutions that employ those researchers, et cetera. All of these physical access mechanisms have to be put into the context of the particular enforcement mechanism through the data use agreement.
If you consider that the most lenient data use agreement is a public use and open data access, simply download some of the data, That data is typically the least sensitive because it has had many other mechanisms. For instance, statistical disclosure avoidance applied to them. On the other hand, the most detailed information on health or personal identifiers of people needs to be both physically secure, statistically secured, and have a data use agreement, that's also much more solid. So that's just an indication of how all of these are combined. So with that, my overview of our chapter ends. There's a lot more detail in the chapter itself.
And again, much of the detail comes from the case studies that are in the handbook that we link to from our chapter for explicit examples of how these can be implemented. One of the key takeaways here is if you are a data provider or a researcher who wants to set up one of these access mechanisms, when there is no such access mechanism originally there, read through the chapter and get some idea of what the possibilities are. And then go and talk to an IT security specialist, an IT specialist either at the university or at the provider, get them in touch. And then be able to ask meaningful questions about, can we relax this dimension and what is the reputational cost, what is the risk cost, what can we compensated for by other mechanisms such as data use agreements or statistical projections.