Migrate on-premises Apache Hadoop clusters to Azure HDInsight - security and DevOps best practices

This article gives recommendations for security and DevOps in Azure HDInsight systems. It's part of a series that provides best practices to assist with migrating on-premises Apache Hadoop systems to Azure HDInsight.

Secure and govern cluster with Enterprise Security Package

The Enterprise Security Package (ESP) supports Active Directory-based authentication, multiuser support, and role-based access control. With the ESP option chosen, HDInsight cluster is joined to the Active Directory domain and the enterprise admin can configure role-based access control (RBAC) for Apache Hive security by using Apache Ranger. The admin can also audit the data access by employees and any changes done to access control policies.

ESP is available on the following cluster types: Apache Hadoop, Apache Spark, Apache HBase, Apache Kafka, and Interactive Query (Hive LLAP).

Use the following steps to deploy the Domain-joined HDInsight cluster:

  • Deploy Microsoft Entra ID by passing the Domain name.

  • Deploy Microsoft Entra Domain Services.

  • Create the required Virtual Network and subnet.

  • Deploy a VM in the Virtual Network to manage Microsoft Entra Domain Services.

  • Join the VM to the domain.

  • Install AD and DNS tools.

  • Have the Microsoft Entra Domain Services Administrator create an Organizational Unit (OU).

  • Enable LDAPS for Microsoft Entra Domain Services.

  • Create a service account in Microsoft Entra ID with delegated read & write admin permission to the OU, so that it can. This service account can then join machines to the domain and place machine principals within the OU. It can also create service principals within the OU that you specify during cluster creation.

    Note

    The service account does not need to be AD domain admin account.

  • Deploy HDInsight ESP cluster by setting the following parameters:

    Parameter Description
    Domain name The domain name that's associated with Microsoft Entra Domain Services.
    Domain user name The service account in the Microsoft Entra Domain Services DC-managed domain that you created in the previous section, for example: hdiadmin@contoso.partner.onmschina.cn. This domain user will be the administrator of this HDInsight cluster.
    Domain password The password of the service account.
    Organizational unit The distinguished name of the OU that you want to use with the HDInsight cluster, for example: OU=HDInsightOU,DC=contoso,DC=onmicrosoft,DC=com. If this OU doesn't exist, the HDInsight cluster tries to create the OU using the privileges of the service account.
    LDAPS URL for example, ldaps://contoso.partner.onmschina.cn:636.
    Access user group The security groups whose users you want to sync to the cluster, for example: HiveUsers. If you want to specify multiple user groups, separate them by semicolon ';'. The group(s) must exist in the directory before creating the ESP cluster.

For more information, see the following articles:

Implement end to end enterprise security

End to end enterprise security can be achieved using the following controls:

Private and protected data pipeline (perimeter level security) - Perimeter level Security can be achieved through Azure Virtual Networks, Network Security Groups, and Gateway service.

Authentication and authorization for data access - Create Domain-joined HDInsight cluster using Microsoft Entra Domain Services. (Enterprise Security Package). - Use Ambari to provide role-based access to cluster resources for AD users. - Use Apache Ranger to set access control policies for Hive at the table / column / row level. - SSH access to the cluster can be restricted only to the administrator.

Auditing - View and report all access to the HDInsight cluster resources and data. - View and report all changes to the access control policies.

Encryption - Transparent Server-Side encryption using Azure-managed keys or customer-managed keys. - In Transit encryption using Client-Side encryption, https, and TLS.

For more information, see the following articles:

Use monitoring & alerting

For more information, see the article:

Azure Monitor Overview

Upgrade clusters

Regularly upgrade to the latest HDInsight version to take advantage of the latest features. The following steps can be used to upgrade the cluster to the latest version:

  1. Create a new TEST HDInsight cluster using the latest available HDInsight version.
  2. Test on the new cluster to make sure that the jobs and workloads work as expected.
  3. Modify jobs or applications or workloads as required.
  4. Back up any transient data stored locally on the cluster nodes.
  5. Delete the existing cluster.
  6. Create a cluster of the latest HDInsight version in the same virtual network subnet, using the same default data and meta store as the previous cluster.
  7. Import any transient data that was backed up.
  8. Start jobs/continue processing using the new cluster.

For more information, see the article: Upgrade HDInsight cluster to a new version.

Patch cluster operating systems

For more information, see the article: OS patching for HDInsight.

Post-Migration

  1. Remediate applications - Iteratively make the necessary changes to the jobs, processes, and scripts.
  2. Perform Tests - Iteratively run functional and performance tests.
  3. Optimize - Address any performance issues based on the above test results and then retest to confirm the performance improvements.

Next steps

Read more about HDInsight 4.0.