什么是机器学习(预览版)中的差异隐私What is differential privacy in machine learning (preview)

了解机器学习中的差异隐私及其工作原理。Learn about differential privacy in machine learning and how it works.

随着组织收集并用于分析的数据量的增加,隐私和安全问题也逐渐增多。As the amount of data that an organization collects and uses for analyses increases, so do concerns of privacy and security. 分析需要数据。Analyses require data. 通常,用于训练模型的数据越多,模型就越精确。Typically, the more data used to train models, the more accurate they are. 当个人信息用于这些分析时,在整个使用过程中数据保密尤为重要。When personal information is used for these analyses, it's especially important that the data remains private throughout its use.

差异隐私的工作原理How differential privacy works

差异隐私是一组系统和做法,可帮助保持个人数据的安全和隐私。Differential privacy is a set of systems and practices that help keep the data of individuals safe and private.

差异隐私机器学习过程Differential privacy machine learning process

在传统场景中,原始数据存储在文件和数据库中。In traditional scenarios, raw data is stored in files and databases. 用户通常在分析数据时使用原始数据。When users analyze data, they typically use the raw data. 这是一个问题,因为可能会侵犯个人隐私。This is a concern because it might infringe on an individual's privacy. 差异隐私尝试通过对数据添加“干扰”或随机性来处理此问题,这样用户就无法识别任何单个数据点。Differential privacy tries to deal with this problem by adding "noise" or randomness to the data so that users can't identify any individual data points. 至少,此类系统提供了合理的可否认性。At the least, such a system provides plausible deniability. 因此,个人隐私会得到保留,而对数据准确度的影响是有限的。Therefore, the privacy of individuals is preserved with limited impact on the accuracy of the data.

在差异隐私系统中,可通过称为“查询”的请求来共享数据。In differentially private systems, data is shared through requests called queries. 当用户提交数据查询时,称为“隐私机制”的操作将向请求的数据添加干扰。When a user submits a query for data, operations known as privacy mechanisms add noise to the requested data. 隐私机制返回近似数据,而不是原始数据。Privacy mechanisms return an approximation of the data instead of the raw data. 此隐私保留结果出现在报表中。This privacy-preserving result appears in a report. 报表包含两个部分:计算的实际数据,以及有关如何创建数据的说明。Reports consist of two parts, the actual data computed and a description of how the data was created.

差异隐私指标Differential privacy metrics

差异隐私尝试防止用户产生数量不定的报表,而最终泄露敏感数据。Differential privacy tries to protect against the possibility that a user can produce an indefinite number of reports to eventually reveal sensitive data. 称为“epsilon”的值度量报表的干扰或隐私程度。A value known as epsilon measures how noisy or private a report is. epsilon 与干扰或隐私的关系是相反的。Epsilon has an inverse relationship to noise or privacy. epsilon 越低,数据的干扰(以及隐私)程度就越高。The lower the epsilon, the more noisy (and private) the data is.

epsilon 值为非负数。Epsilon values are non-negative. 小于 1 的值提供了完全合理的可否认性。Values below 1 provide full plausible deniability. 任何大于 1 的值会导致公开实际数据的风险更高。Anything above 1 comes with a higher risk of exposure of the actual data. 实现差异隐私系统时,你会希望生成一个 epsilon 值介于 0 和 1 之间的报表。As you implement differentially private systems, you want to produce reports with epsilon values between 0 and 1.

与 epsilon 直接关联的另一个值是 delta。Another value directly correlated to epsilon is delta. delta 度量报表不具备完整隐私性的概率。Delta is a measure of the probability that a report is not fully private. delta 越高,epsilon 就越高。The higher the delta, the higher the epsilon. 由于这些值是关联的,因此使用 epsilon 的频率更高。Because these values are correlated, epsilon is used more often.

利用隐私预算限制查询Limit queries with a privacy budget

为了确保允许多个查询的系统中的隐私,差异隐私定义了速率限制。To ensure privacy in systems where multiple queries are allowed, differential privacy defines a rate limit. 此限制称为“隐私预算”。This limit is known as a privacy budget. 隐私预算会阻止通过多个查询重新创建数据。Privacy budgets prevent data from being recreated through multiple queries. 隐私预算分配了一个 epsilon 值,通常介于 1 和 3 之间,以限制重新识别的风险。Privacy budgets are allocated an epsilon amount, typically between 1 and 3 to limit the risk of reidentification. 在生成报表时,隐私预算将跟踪单个报表的 epsilon 值以及所有报表的汇总值。As reports are generated, privacy budgets keep track of the epsilon value of individual reports as well as the aggregate for all reports. 在隐私预算用完或用尽后,用户将无法再访问数据。After a privacy budget is spent or depleted, users can no longer access data.

数据的可靠性Reliability of data

虽然保护隐私应该是我们的目标,但是当涉及到数据的可用性和可靠性时,就需要进行权衡。Although the preservation of privacy should be the goal, there is a tradeoff when it comes to usability and reliability of the data. 在数据分析中,准确性可以被视为对采样误差带来的不确定性的度量。In data analytics, accuracy can be thought of as a measure of uncertainty introduced by sampling errors. 这种不确定性往往在一定范围内。This uncertainty tends to fall within certain bounds. 从差异隐私角度来看,准确性改为衡量数据的可靠性,而可靠性受到隐私机制所引入的不确定性的影响。Accuracy from a differential privacy perspective instead measures the reliability of the data, which is affected by the uncertainty introduced by the privacy mechanisms. 简而言之,更高级别的干扰或隐私会转换为具有较低 epsilon、准确性和可靠性的数据。In short, a higher level of noise or privacy translates to data that has a lower epsilon, accuracy, and reliability.

开源差异隐私库Open-source differential privacy libraries

SmartNoise 是一个开放源代码项目,其中包含用于生成全局差异隐私系统的不同组件。SmartNoise is an open-source project that contains different components for building global differentially private systems. SmartNoise 由以下顶级组件组成:SmartNoise is made up of the following top-level components:

  • SmartNoise 核心库SmartNoise Core library
  • SmartNoise SDK 库SmartNoise SDK library

SmartNoise 核心SmartNoise Core

核心库包含以下用于实现差异隐私系统的隐私机制:The core library includes the following privacy mechanisms for implementing a differentially private system:

组件Component 说明Description
分析Analysis 任意计算的图形说明。A graph description of arbitrary computations.
验证程序Validator 一个包含一组工具的 Rust 库,这些工具用于检查和派生使分析具有差异隐私性的必要条件。A Rust library that contains a set of tools for checking and deriving the necessary conditions for an analysis to be differentially private.
运行时Runtime 要执行分析的介质。The medium to execute the analysis. 引用运行时是用 Rust 编写的,但运行时可以使用任何计算框架(如 SQL 和 Spark)编写,这取决于你的数据需求。The reference runtime is written in Rust but runtimes can be written using any computation framework such as SQL and Spark depending on your data needs.
绑定Bindings 用于生成分析的语言绑定和帮助程序库。Language bindings and helper libraries to build analyses. 目前 SmartNoise 提供 Python 绑定。Currently SmartNoise provides Python bindings.

SmartNoise SDKSmartNoise SDK

系统库提供了以下工具和服务,用于处理表格数据和关系数据:The system library provides the following tools and services for working with tabular and relational data:

组件Component 说明Description
数据访问Data Access 用于截获和处理 SQL 查询并生成报表的库。Library that intercepts and processes SQL queries and produces reports. 此库在 Python 中实现,并支持以下 ODBC 和 DBAPI 数据源:This library is implemented in Python and supports the following ODBC and DBAPI data sources:
  • PostgreSQLPostgreSQL
  • SQL ServerSQL Server
  • SparkSpark
  • PrestonPreston
  • PandasPandas
服务Service 提供 REST 终结点以针对共享数据源提供请求或查询的执行服务。Execution service that provides a REST endpoint to serve requests or queries against shared data sources. 此服务旨在允许差异隐私模块进行组合,以对包含不同 delta 和 epsilon 值的请求(也称为异类请求)进行操作。The service is designed to allow composition of differential privacy modules that operate on requests containing different delta and epsilon values, also known as heterogeneous requests. 此参考实现考虑到了查询对相关数据的额外影响。This reference implementation accounts for additional impact from queries on correlated data.
计算器Evaluator 用于检查隐私冲突、准确性和偏差的随机计算器。Stochastic evaluator that checks for privacy violations, accuracy, and bias. 计算器支持以下测试:The evaluator supports the following tests:
  • 隐私测试 - 确定报表是否符合差异隐私的条件。Privacy Test - Determines whether a report adheres to the conditions of differential privacy.
  • 准确性测试 - 度量报表的可靠性是否在给定 95% 置信度的上下限范围内。Accuracy Test - Measures whether the reliability of reports falls within the upper and lower bounds given a 95% confidence level.
  • 实用性测试 - 确定报表的置信界限是否足够接近数据,同时仍最大限度地确保隐私。Utility Test - Determines whether the confidence bounds of a report are close enough to the data while still maximizing privacy.
  • 偏差测试 - 度量重复查询报表的分布,确保它们不会失衡Bias Test - Measures the distribution of reports for repeated queries to ensure they are not unbalanced

后续步骤Next steps

在 Azure 机器学习中如何构建差异隐私系统How to build a differentially private system in Azure Machine Learning.

若要详细了解 SmartNoise 的组件,请查看 GitHub 存储库中的 SmartNoise 核心SmartNoise SDKSmartNoise 示例To learn more about the components of SmartNoise, check out the GitHub repositories for SmartNoise Core, SmartNoise SDK, and SmartNoise samples.