hi and i'm really excited to be here at microsoft research summit 2021 my name is dongsung a professor in computer science at uc berkeley and also the founder of oasis labs today i will talk about towards building a responsible data economy as we all know data is a key driver for modern economy and is a lifeblood of ai and machine learning a huge amount of data is being collected every day and it is estimated that a significant portion of eu's gdp comes from value from personalized data and the global data economy is growing at the exponential speed however a lot of this data is sensitive and both individuals and organizations are facing unprecedented challenges in how sensitive data is being used first individuals have lost control over how their data is being used oftentimes uses data is misused or sold without their knowledge and consent and a lot of this data is being sold under the premise that the data has been anonymized however volume of research has shown that anonymization doesn't adequately protect user's privacy so what you see on the screen here is an example a case study that new york times has done showing that from anonymized mobile phone location data sets that they were able to obtain they were able to track the location of a secret service agent with former president trump and hence also revealing the information about the location and trajectory of former president trump and data breaches continue to plague businesses many large-scale attacks have stolen sensitive information about hundreds of millions of users of billions of users in each one of these attacks and also businesses now is becoming much more expensive and cumbersome for businesses to comply with privacy regulations such as ccpa and gdpr and the bigger issue is that a lot of valuable data is being locked up in data silos and now being able to be utilized due to privacy concerns so today we are already seeing facing huge challenges in how to utilize sensitive data to the extent that these problems if not addressed they will significantly hinder societal progress and to the extent that it will undermine human values and fundamental rights and thus we have an urgent need for a framework for a responsible data economy what do i mean by a responsible data economy what are the goals and principles of a responsible data economy so first we need to establish and enforce data rights data rights form the foundation of data economy and prevent misuse and abuse of data and we would also like to enable fair distribution of value created from data such that users can gain sufficient benefits from their data and ultimately we would like to enable efficient data use to maximize social welfare and economic efficiency so that data can be utilized for individuals organizations governments and society's best interests there are many unique challenges and complexity due to the nature of data so first there's natural tension between utility and privacy on one hand we want to utilize data to extract value out of data on the other hand we need to be able to protect uh privacy of users data and also data has a unique property called the non-reverie which is different from physical objects so with the physical objects for example if i hold an apple in my hands then nobody else can be holding the same apple at the same time so this is called rivalry however for data data is different data has nature of being non-rival meaning that if i hold a copy of a piece of data then somebody else can hold a copy of the same piece of data and they can utilize the data just as well as i can in this case and hence due to this non-rivalry property once data is given out is copied one can never really take the data back and data also has properties such as data dependency and data externality for example process data can review information about the original data because the process data has been computed from the original data and also data about one entity can review information about another entity because of these unique properties of data we can now simply copy concepts and methods in the analog world instead we need to develop a new framework for a responsible data economy and this requires a combination of technical and non-technical solutions in particular a framework for a responsible data economy needs to have at least three components technical solutions incentive models and legal framework so first let me talk about the technical solutions the traditional solutions are insufficient in traditional solutions data encryption has mainly been used to protect data at rest or in transit and data is not really protected when being in use and in fact data is either now used all once used is often copied and thus making it difficult to control future usage of the data and also as i mentioned earlier anonymizing data is often insufficient to protect users data privacy so we need to develop new technologies in particular to protect data while data is in use and in order to achieve this we need to address two main goals one is that we need to be able to enable contribute use of data so that data can be used without having to copy the raw data and second we also need to be able to protect the computation outputs from leaking sensitive information about the original input luckily in the research community there has been huge advancements in developing new technologies to address these issues and i call this technology responsible data technologies and given the complexity of the issue and the different aspects for addressing the issue we need to have different component technologies together to enable responsible data use in particular here i list four examples of components technologies including secure computing such as using secure hardware and pc fully hormonal encryption so to keep data confidential even when data is in use by application and differential privacy helps protect the computation output from leaking sensitive information and fair digital learning utilizes secure computing and differential privacy and enable data to stay at the data owner's devices without leaving the data owner's devices and enable distributed data analytics and model training across different devices and finally with distributed ledger it helps provide an immutable log of user's data rights and how data has been utilized to ensure that data usage is compliant so first let me talk briefly about the first component technology secure computing the goal of secure computing is to protect the computation process from leaking sensitive information and there are different types of technologies for secure computing including using secure hardware as well as cryptographic based tech techniques such as secure multi-party computation and fully homomorphic encryption the challenge for cryptographic based approaches is that the performance overheads often can be very high and can be octaves of magnitude higher than native computation and hence can only support special purpose computation in the recent years there has been a huge advancement in improving performance overhead for these cryptographic-based methods such as homomorphic encryption and multi-party uh computation however and still um today there's still a huge gap in terms of the the performance enabled by these cryptographic based approaches versus what is really needed in practice for uh computation with practical large-scale workload for example today using hormone marker encryption and npc methods for training for example machining models on certain real worlds data sets it can still take an orders of magnitude slower than native computation and for example training image nets um with vgg16 takes many years to do using these type of approaches so another type of approach is to use uh secure hardware the advantage of using secure hardware is that the performance overheads is much lower often times close to native performance and hence can be used to support general purpose computation and for secure hardware usually it utilizes a combination of hardware and software mechanisms to provide an isolated execution environment often called a secure enclave when applications are running inside the secure enclave nothing outside for example including the operating system and other applications will be able to modify the data inside the secure enclave and its execution are seeing information learning information about the the data and execution running inside the secure enclave enhanced provides integrity and confidentiality for computation in the secure enclave and also secure enclave provides a capability called the remote at the station to enable a remote verifier to verify the initial state of the secure enclave and the program that's running inside the secure enclave and hence provides a remote verification of the computation inside the secure enclave given its strong security capabilities secure enclaves can serve as a cornerstone security primitives and that enable a platform for building new security applications that couldn't be built otherwise for the same practical performance given as importance different hardware manufacturers have been developing different solutions for secure hardware however all these are closed source in order to truly enable trustworthy secure enclave i strongly believe that we need a open source solution for secure hardware this will help us to provide the security assurance that's needed for actually a trustworthy secure enclave and towards the school at berkeley we have been developing a project called the keystone which provides a first open framework for building a customizable secure hardware also called trusted execution environments tees keystone is built on top of risk 5 which is open source risk architecture and it utilizes capabilities memory isolation capabilities provided by risk 5 and enables a modular and flexible design to enable a minimum transient computing base that can better facilitate formal verification with keystone from our experiments we show that uh the performance overheads can be close to native uh performance for for example for different machining benchmarks and other benchmarks and today you can try out keystone on the latest risk 5 boards as well as using fpgas and also emulation and keystone is now supported by the confidential computing consortium with the global members of different leading technology companies and other entities i strongly believe that in 10 years secure computing will become a common place in 10 years most chips will have secure enclave secure execution environment capabilities and in 10 years most computation will utilize secure enclaves and also in 10 years hardware accelerators for cryptographic methods for secure computation will also be widely available so that's the first component technology and for secure computing now let me briefly talk about the second component technology differential privacy to motivate why we need differential privacy first let me um illustrate with some recent case studies that we have done so in our recent work in collaboration with researchers from google and other places we said how to explore the question do your neural networks remember training data if yes can attacker utilize or exploit this issue and to be able to extract secrets in the original training data from just simply querying the learned models so in this case study we focus on language models so with language models it's changed over a corpus of text and when given a sequence of words or characters the language model will predict the next word or the next character so for example language models has had white applications and adoption for example in smart compose and so on so the question is in this case can attackers by just simply occurring the learned model actually learn sensitive information in the original training data so this is an interesting example that in this case the attacker was able to by utilizing the learn model to really identify and learn sensitive information about the revolution so in our case study here is an example where we show that the key study here is to change a language model an email data set calls the enron email data set the annual email data set actually contained actual users credit card and social security numbers our workshop that's an attacker without knowing the any details of the model including the model architecture and the parameters of the model by just simply occurring the learned model the attacker is able to automatically extract users credit card and social security numbers that were embedded in the original training data so this is an example illustrating and that as we train machining the models is important to take precaution to protect users to protect the privacy of users data in our work we also propose a measure called exposure to measure the degree of memorization and it has been utilized in for example google smart compose to help improve privacy protection in our more recent work we also extended the work to study larger language larger scale language models such as gpt2 and we have developed new attacks illustrating that even for uh for these large scale language models similar issues also appears and attackers in this case by just occurring the gbt2 model for example is able to to extract personally identifiable information that has been used in the training data set so what can we do in in this case how can we trim better machining models that provides better protection for user's data privacy in this case in our work we show that in certain cases we can utilize solutions such as differential privacy to help address the issue so for example in the earlier work we showed that by training a differentially private language model instead of a vanilla language model in certain cases we can still maintain a similar utility for the model and at the same time we can significantly enhance the privacy protection for users data from these chains models so differential privacy is a formal notion of privacy that helps protect users data from the data analytics and machining models computers from these from users original data but however deploying differential privacy in practice has many challenges in our work we have developed automated techniques to to enable uh automatic rewriting of anal data analytics programs to automatically embed differential privacy mechanisms into data analytics queries such that one doesn't need to change the back end of the original database and also the data analyst doesn't need to know anything about differential privacy this rewritten and data analysis query will then automatically produce differentially private results and hence make it easy to deploy differential privacy and also at the same time to be able to provide automatic protection for users data privacy and also in our reason where can we also develop a new higher other language and type system to be able to statically analyze programs and in this language and to enforce differential privacy we are able to automatically compute the privacy banks and enhance be able to automatically prove differential privacy for example machine learning algorithms and so on and our work is one the distinguished paper was at uh top uh programming program language conference uh uppsala and some of our work has been deployed and piloted at a different industry so um given the interest of time i won't be able to go into the details for all these different component technologies and when we combine these different component technologies uh including secure computing differential privacy federal learning and distribute ledger we can build a secure distributed computing fabric that can serve as the platform for a responsible data economy so here is a quick workflow how one these component technologies are put together can enable this platform for a responsible data economy so in this case the data provider can provide bundles of data together with its desire the policy of how data is being utilized for example data may be only used for training differentially private machining models and so on and the data will be encrypted in this case together uh bundled with the data use policy and this information can then be committed to a blockchain which provides the immutable ledger for this data rights and data use policy and now when under the analyst and the consumer wants to use the data the data analysts can submit the analytics and machine learning program that will that the data analyst wants to run on the data and now access controller will take the latest data use policy from the blockchain for the data and also the analyst program and the access controller also has a static analyzer component which will check the policy against the program and also outputs what we call a residual policy that governs the use of the competition results and if the aesthetic analyzer checks that the program is compliant with the use policy of the data and will then send the proof of the policy compliance to a distributed key manager then the access controller will then also instantiate a trusted execution environment and the date the encrypted data will be loaded in the trusted execution environment and the the distributed key manager can then send the decryption key to the traffic execution environment then the data can be decrypted and the computation can be done and then the computer results will then be encrypted with the residue policy and this information will then also be committed to the blockchain and also provides an immutable log for how the data has been utilized and this encrypted results depending on the residual policy can then be further utilized for example the analysts can download the results can query training models around additional programs that satisfies the residue policy and also the data provider in this case can inspect the blockchain to see how their data has been utilized so overall this distributed secure computing platform can then be utilized to support what i call data commons for decentralized data science so in this setting the data owners data producers they will register their data sets with the policy specified and to a distributed data catalog and the data consumer and data analyst can then search this distributed catalog and as it finds relevant data i can write the analytics and machine learning programs over these different datasets and data sources and submit the program to this distributed platform that i described earlier and this distributed platform executing platform can then provide uh distributed secure computing while ensuring that the program is compliant with the desired policies so then enables privacy preserving utilization of data from distributed data sources this approach can help reduce friction of data use remove data silos and enforce strong security and privacy protection i strongly believe that in 10 years data trust and in the comments will become predominant ways of utilizing diverse sources of data and enabling ownership economy where users benefit from their data as owners and partners in ten years data stewards futures and trustees will be a new class of entities important in the data ecosystem managing and protecting users data and growing its value and in 10 years huge economic value will be created through these new forms of data trust and data commons orders are magnitude higher than today's data marketplace and when we combine privacy computing and blockchain it can also enable a new type of assets that are called data assets in this case blockchain helps enable providing an immutable lock of users rights to data and data use policies and also help enforce the use policies it's being satisfied and with the privacy computing it can ensure that data remains private during compute and cannot be reused without permission and this capsule of data and policies can create a new type of asset that i call data assets that helps users to both protect their data privacy and also at the same time and to gain value from their data and with this approach with data accession we can unlock a new responsible data economy where users and businesses can maintain data rights and earn value from their data assets and in our recent work at oasis labs we have been putting this approach into practice and in particular we have developed technologies to help uh users to maintain control of their genomic data and also at the same time to enable their data to be utilized in a privacy presuming way and enable data users to be able to gain value from their data as well so so far i have focused on the first components of a responsible data economy the technical solutions so now i will briefly talk about this the two other components incentive models and legal frameworks so we need to have uh better incentive models for example how to determine and distribute value of data existing data evaluation approaches are ad hoc and insufficient in our recent work we have showed that we can model machine learning as a collisional game and utilize the concept of sharply value to define and to build a rigorous framework to show that how we can distribute value created from machine learning models to the data contributors a given interest of time i won't go through the details of shareplay value and at a high level sharply value computes for each player essentially as expected marginal contribution of the players data to the trained machining models to the utility function and we have developed efficient algorithms to compute sharply value for training machining models with large-scale data and we also need to establish legal frameworks to help address some of the open challenges regarding to data rights for example what are the rights and who can choose data rights individual property rights are a cornerstone of modern economy and has helped establish modern economics and filled centuries of significant growth and today we lack an adequate framework for data rights establishing data rights will allow individuals to derive value from their data and propel economic growth and unlock new value and there has been diverse concepts and frameworks proposed in this space and overall we need data-driven technology-informed regulation for example how will advancements of responsible data technology influence and impact regulatory frameworks and how can regulation help with faster and broader adoption of responsible data technology for the future of the internet we must build a responsible data economy and the 2020s is a decade for building a responsible data economy thank you
2022-02-11