Configure Apache Ranger policies for Spark SQL in HDInsight with Enterprise Security Package

This article describes how to configure Apache Ranger policies for Spark SQL with Enterprise Security Package in HDInsight.

In this article, you learn how to:

Create Apache Ranger policies.
Verify the applied Ranger policies.
Apply guidelines for setting Apache Ranger for Spark SQL.

Prerequisites

An Apache Spark cluster in HDInsight version 5.1 with Enterprise Security Package

Connect to the Apache Ranger admin UI

From a browser, connect to the Ranger admin user interface by using the URL https://ClusterName.azurehdinsight.cn/Ranger/.

Change ClusterName to the name of your Spark cluster.
Sign in by using your Microsoft Entra admin credentials. The Microsoft Entra admin credentials aren't the same as HDInsight cluster credentials or Linux HDInsight node Secure Shell (SSH) credentials.

Create domain users

For information on how to create sparkuser domain users, see Create an HDInsight cluster with ESP. In a production scenario, domain users come from your Microsoft Entra tenant.

Create a Ranger policy

In this section, you create two Ranger policies:

An access policy for accessing hivesampletable from Spark SQL
A masking policy for obfuscating the columns in hivesampletable

Create a Ranger access policy

Open the Ranger admin UI.
Under HADOOP SQL, select hive_and_spark.
On the Access tab, select Add New Policy.
Enter the following values:

Property Value

Policy Name read-hivesampletable-all

database default

table hivesampletable

column *

Select User sparkuser

Permissions select

If a domain user isn't automatically populated for Select User, wait a few moments for Ranger to sync with Microsoft Entra ID.
Select Add to save the policy.
Open a Zeppelin notebook and run the following command to verify the policy:
```
     %sql 
     select * from hivesampletable limit 10;
```
Here's the result before a policy is applied:

Here's the result after a policy is applied:

Property	Value
Policy Name	read-hivesampletable-all
database	default
table	hivesampletable
column	*
Select User	`sparkuser`
Permissions	select

Create a Ranger masking policy

The following example shows how to create a policy to mask a column:

On the Masking tab, select Add New Policy.

Enter the following values:

Property	Value
Policy Name	mask-hivesampletable
Hive Database	default
Hive Table	hivesampletable
Hive Column	devicemake
Select User	`sparkuser`
Access Types	select
Select Masking Option	Hash

Select Save to save the policy.
Open a Zeppelin notebook and run the following command to verify the policy:
```
     %sql
     select clientId, deviceMake from hivesampletable; 
```

Note

By default, the policies for Hive and Spark SQL are common in Ranger.

Apply guidelines for setting up Apache Ranger for Spark SQL

The following scenarios explore guidelines for creating an HDInsight 5.1 Spark cluster by using a new Ranger database and by using an existing Ranger database.

Scenario 1: Use a new Ranger database while creating an HDInsight 5.1 Spark cluster

When you use a new Ranger database to create a cluster, the relevant Ranger repo that contains the Ranger policies for Hive and Spark is created under the name hive_and_spark in the Hadoop SQL service on the Ranger database.

If you edit the policies, they're applied to both Hive and Spark.

Consider these points:

If you have two metastore databases with the same name used for both Hive (for example, DB1) and Spark (for example, DB1) catalogs:
- If Spark uses the Spark catalog (metastore.catalog.default=spark), the policies are applied to the DB1 database of the Spark catalog.
- If Spark uses the Hive catalog (metastore.catalog.default=hive), the policies are applied to the DB1 database of the Hive catalog.
From the perspective of Ranger, there's no way to differentiate between DB1 of the Hive and Spark catalogs.

In such cases, we recommend that you either:
- Use the Hive catalog for both Hive and Spark.
- Maintain different database, table, and column names for both Hive and Spark catalogs so that the policies aren't applied to databases across catalogs.
If you use the Hive catalog for both Hive and Spark, consider the following example.

Let's say that you create a table named table1 through Hive with the current xyz user. It creates a Hadoop Distributed File System (HDFS) file named table1.db whose owner is the xyz user.

Now imagine that you use the user abc to start the Spark SQL session. In this session of user abc, if you try to write anything to table1, it's bound to fail because the table owner is xyz.

In such a case, we recommend that you use the same user in Hive and Spark SQL for updating the table. That user should have sufficient privileges to perform update operations.

Scenario 2: Use an existing Ranger database (with existing policies) while creating an HDInsight 5.1 Spark cluster

When you create an HDInsight 5.1 cluster by using an existing Ranger database, a new Ranger repo is created again on this database with the name of the new cluster in this format: hive_and_spark.

Let's say that you have the policies defined in the Ranger repo already under the name oldclustername_hive on the existing Ranger database inside the Hadoop SQL service. You want to share the same policies in the new HDInsight 5.1 Spark cluster. To achieve this goal, use the following steps.