在 Azure 机器学习(预览版)中使用差分隐私Use differential privacy in Azure Machine Learning (preview)

了解如何使用 SmartNoise Python 开放源代码库将有关差分隐私的最佳做法应用于 Azure 机器学习模型。Learn how to apply differential privacy best practices to Azure Machine Learning models by using the SmartNoise Python open-source libraries.

差分隐私是隐私的最高标准定义。Differential privacy is the gold-standard definition of privacy. 遵守此隐私定义的系统可针对各种数据重建和重新识别攻击(包括拥有辅助信息的攻击者发起的攻击)提供强大的防护保障。Systems that adhere to this definition of privacy provide strong assurances against a wide range of data reconstruction and reidentification attacks, including attacks by adversaries who possess auxiliary information. 详细了解差分隐私的工作原理Learn more about how differential privacy works.


安装 SmartNoise Python 库Install SmartNoise Python libraries

独立安装Standalone installation

这些库设计为在分布式 Spark 群集中运行,可以像安装任何其他包一样安装。The libraries are designed to work from distributed Spark clusters, and can be installed just like any other package.

以下说明假设已将 pythonpip 命令映射到 python3pip3The instructions below assume that your python and pip commands are mapped to python3 and pip3.

使用 pip 安装 SmartNoise Python 包Use pip to install the SmartNoise Python packages.

pip install opendp-smartnoise

若要验证是否安装了这些包,请启动 Python 提示并键入:To verify that the packages are installed, launch a python prompt and type:

import opendp.smartnoise.core
import opendp.smartnoise.sql

如果导入成功,则表示这些库已安装且可供使用。If the imports succeed, the libraries are installed and ready to use.

Docker 映像Docker image

还可以将 SmartNoise 包与 Docker 配合使用。You can also use SmartNoise packages with Docker.

拉取 opendp/smartnoise 映像,以在包含 Spark、Jupyter 和示例代码的 Docker 容器中使用这些库。Pull the opendp/smartnoise image to use the libraries inside a Docker container that includes Spark, Jupyter, and sample code.

docker pull opendp/smartnoise:privacy

拉取映像后,启动 Jupyter 服务器:Once you've pulled the image, launch the Jupyter server:

docker run --rm -p 8989:8989 --name smartnoise-run opendp/smartnoise:privacy

这会在 localhost 中的端口 8989 上使用密码 pass@word99 启动 Jupyter 服务器。This starts a Jupyter server at port 8989 on your localhost, with password pass@word99. 假设你已使用上述命令行启动了名为 smartnoise-privacy 的容器,现在可以运行以下命令,在 Jupyter 服务器中打开 bash 终端:Assuming you used the command line above to start the container with name smartnoise-privacy, you can open a bash terminal in the Jupyter server by running:

docker exec -it smartnoise-run bash

Docker 实例在关闭时会清除所有状态,因此,在运行中的实例上创建的所有笔记本都会丢失。The Docker instance clears all state on shutdown, so you will lose any notebooks you create in the running instance. 若要补救此问题,可以在启动容器时将一个本地文件夹装载到该容器:To remedy this, you can bind mount a local folder to the container when you launch it:

docker run --rm -p 8989:8989 --name smartnoise-run --mount type=bind,source=/Users/your_name/my-notebooks,target=/home/privacy/my-notebooks opendp/smartnoise:privacy

my-notebooks 文件夹下创建的任何笔记本都将存储在本地文件系统中。Any notebooks you create under the my-notebooks folder will be stored in your local filesystem.

执行数据分析Perform data analysis

若要准备差分隐私版本,需要选择数据源、统计信息和一些隐私参数,以指示隐私保护级别。To prepare a differentially private release, you need to choose a data source, a statistic, and some privacy parameters, indicating the level of privacy protection.

此示例引用了“加利福尼亚州公用微观数据”(PUMS),其中提供了匿名的公民人口统计信息记录:This sample references the California Public Use Microdata (PUMS), representing anonymized records of citizen demographics:

import os
import sys
import numpy as np
import opendp.smartnoise.core as sn

data_path = os.path.join('.', 'data', 'PUMS_california_demographics_1000', 'data.csv')
var_names = ["age", "sex", "educ", "race", "income", "married", "pid"]

在此示例中,我们将计算年龄的平均值和方差。In this example, we compute the mean and the variance of the age. 我们使用 epsilon 总数 1.0(epsilon 是隐私参数,将隐私预算分散在我们要计算的两个数量之间)。We use a total epsilon of 1.0 (epsilon is our privacy parameter, spreading our privacy budget across the two quantities we want to compute. 详细了解隐私指标Learn more about privacy metrics.

with sn.Analysis() as analysis:
    # load data
    data = sn.Dataset(path = data_path, column_names = var_names)

    # get mean of age
    age_mean = sn.dp_mean(data = sn.cast(data['age'], type="FLOAT"),
                          privacy_usage = {'epsilon': .65},
                          data_lower = 0.,
                          data_upper = 100.,
                          data_n = 1000
    # get variance of age
    age_var = sn.dp_variance(data = sn.cast(data['age'], type="FLOAT"),
                             privacy_usage = {'epsilon': .35},
                             data_lower = 0.,
                             data_upper = 100.,
                             data_n = 1000

print("DP mean of age: {0}".format(age_mean.value))
print("DP variance of age: {0}".format(age_var.value))
print("Privacy usage: {0}".format(analysis.privacy_usage))

结果如下所示:The results look something like those below:

DP mean of age: 44.55598845931517
DP variance of age: 231.79044646429134
Privacy usage: approximate {
  epsilon: 1.0

对于此示例,需要注意几个重要事项。There are some important things to note about this example. 首先,Analysis 对象表示一个数据处理图。First, the Analysis object represents a data processing graph. 在此示例中,平均值和方差是从同一个源节点计算的。In this example, the mean and variance are computed from the same source node. 但是,可以包含更复杂的表达式,用于以任意方式将输入与输出组合在一起。However, you can include more complex expressions that combine inputs with outputs in arbitrary ways.

该分析图包含用于指定年龄下限和上限的 data_upperdata_lower 元数据。The analysis graph includes data_upper and data_lower metadata, specifying the lower and upper bounds for ages. 这些值用于精确校准干扰数据,以确保差分隐私。These values are used to precisely calibrate the noise to ensure differential privacy. 在一些需要处理离群值或缺失值的情况下也会使用这些值。These values are also used in some handling of outliers or missing values.

最后,该分析图会跟踪支出的总隐私预算。Finally, the analysis graph keeps track of the total privacy budget spent.

可以使用库来编写更复杂的分析图,其中包含多种机制、统计信息和实用函数:You can use the library to compose more complex analysis graphs, with several mechanisms, statistics, and utility functions:

统计信息Statistics 机制Mechanisms 实用程序Utilities
计数Count 高斯Gaussian 强制转换Cast
直方图Histogram 几何Geometric 钳位Clamping
平均值Mean 拉普拉斯Laplace 数字化Digitize
分位数Quantiles 筛选器Filter
SumSum 插补Imputation
方差/协方差Variance/Covariance 转换Transform

有关更多详细信息,请参阅数据分析笔记本See the data analysis notebook for more details.

差分隐私版本的大致效用Approximate utility of differentially private releases

由于差分隐私是通过校准干扰数据运行的,因此版本的效用因隐私风险而异。Because differential privacy operates by calibrating noise, the utility of releases may vary depending on the privacy risk. 一般情况下,随着样本大小不断增大,为保护每个个体而需要的干扰数据会变得可忽略不计,但样本大小增大也会显著影响针对单个个体的版本结果准确度。Generally, the noise needed to protect each individual becomes negligible as sample sizes grow large, but overwhelm the result for releases that target a single individual. 分析师可以查看版本的准确度信息,以确定该版本的作用如何:Analysts can review the accuracy information for a release to determine how useful the release is:

with sn.Analysis() as analysis:
    # load data
    data = sn.Dataset(path = data_path, column_names = var_names)

    # get mean of age
    age_mean = sn.dp_mean(data = sn.cast(data['age'], type="FLOAT"),
                          privacy_usage = {'epsilon': .65},
                          data_lower = 0.,
                          data_upper = 100.,
                          data_n = 1000

print("Age accuracy is: {0}".format(age_mean.get_accuracy(0.05)))

该操作的结果应如下所示:The result of that operation should look similar to that below:

Age accuracy is: 0.2995732273553991

此示例按上面所述计算平均值,并使用 get_accuracy 函数请求 alpha 为 0.05 时的准确度。This example computes the mean as above, and uses the get_accuracy function to request accuracy at alpha of 0.05. alpha 为 0.05 表示 95% 的间隔,因为在大约 95% 的时间内,发布的值都处于报告的准确度边界内。An alpha of 0.05 represents a 95% interval, in that released value will fall within the reported accuracy bounds about 95% of the time. 在此示例中,报告的准确度为 0.3,即,在大约 95% 的时间内,发布的值将在宽度为 0.6 的间隔内。In this example, the reported accuracy is 0.3, which means the released value will be within an interval of width 0.6, about 95% of the time. 不要将此值视为误差线,因为发布的值会按 alpha 指定的几率超出报告的准确度范围,而超出范围的值可能会在任一方向超出范围。It is not correct to think of this value as an error bar, since the released value will fall outside the reported accuracy range at the rate specified by alpha, and values outside the range may be outside in either direction.

分析师可以在 get_accuracy 中查询 alpha 的不同值,以获取更窄或更宽的置信度间隔,且不会产生额外的隐私成本。Analysts may query get_accuracy for different values of alpha to get narrower or wider confidence intervals, without incurring additional privacy cost.

生成直方图Generate a histogram

内置的 dp_histogram 函数基于以下任何数据类型创建差分隐私直方图:The built-in dp_histogram function creates differentially private histograms over any of the following data types:

  • 连续变量,其中的数字集必须分割为箱A continuous variable, where the set of numbers has to be divided into bins
  • 布尔或二分变量,只能取两个值A boolean or dichotomous variable, that can only take on two values
  • 分类变量,其中包含作为字符串枚举的相异类别A categorical variable, where there are distinct categories enumerated as strings

下面是为连续变量直方图指定箱的 Analysis 示例:Here is an example of an Analysis specifying bins for a continuous variable histogram:

income_edges = list(range(0, 100000, 10000))

with sn.Analysis() as analysis:
    data = sn.Dataset(path = data_path, column_names = var_names)

    income_histogram = sn.dp_histogram(
            sn.cast(data['income'], type='int', lower=0, upper=100),
            edges = income_edges,
            upper = 1000,
            null_value = 150,
            privacy_usage = {'epsilon': 0.5}

由于个体在直方图箱之间以不相交的方式分区,因此,每个直方图只会产生隐私成本一次,即使直方图包含许多的箱,也是如此。Because the individuals are disjointly partitioned among histogram bins, the privacy cost is incurred only once per histogram, even if the histogram includes many bins.

有关直方图的详细信息,请参阅直方图笔记本For more on histograms, see the histograms notebook.

生成协方差矩阵Generate a covariance matrix

SmartNoise 通过其 dp_covariance 函数提供三种不同的功能:SmartNoise offers three different functionalities with its dp_covariance function:

  • 两个向量之间的协方差Covariance between two vectors
  • 矩阵的协方差矩阵Covariance matrix of a matrix
  • 一对矩阵的互协方差矩阵Cross-covariance matrix of a pair of matrices

下面是计算标量协方差的示例:Here is an example of computing a scalar covariance:

with sn.Analysis() as analysis:
    wn_data = sn.Dataset(path = data_path, column_names = var_names)

    age_income_cov_scalar = sn.dp_covariance(
      left = sn.cast(wn_data['age'], 
      type = "FLOAT"), 
      right = sn.cast(wn_data['income'], 
      type = "FLOAT"), 
      privacy_usage = {'epsilon': 1.0},
      left_lower = 0., 
      left_upper = 100., 
      left_n = 1000, 
      right_lower = 0., 
      right_upper = 500_000.,
      right_n = 1000)

有关详细信息,请参阅协方差笔记本For more information, see the covariance notebook

后续步骤Next Steps