Azure Synapse Analytics terminology
This document guides you through the basic concepts of Azure Synapse Analytics.
Synapse workspace
A Synapse workspace is a securable collaboration boundary for doing cloud-based enterprise analytics in Azure. A workspace is deployed in a specific region and has an associated ADLS Gen2 account and file system (for storing temporary data). A workspace is under a resource group.
A workspace allows you to perform analytics with SQL and Apache spark. Resources available for SQL and Spark analytics are organized into SQL and Spark pools.
Linked services
A workspace can contain any number of Linked service, essentially connection strings that define the connection information needed for the workspace to connect to external resources.
Synapse SQL
Synapse SQL is the ability to do T-SQL based analytics in Synapse workspace. Synapse SQL has two consumption models: dedicated and serverless. For the dedicated model, use dedicated SQL pools. A workspace can have any number of these pools. To use the serverless model, use the serverless SQL pools. Every workspace has one of these pools.
Inside Synapse Studio, you can work with SQL pools by running SQL scripts.
Note
Dedicated SQL pools in Azure Synapse is different from the dedicated SQL pool (formerly SQL DW). Not all features of the dedicated SQL pool in Azure Synapse workspaces apply to dedicated SQL pool (formerly SQL DW), and vice versa.
Apache Spark for Synapse
To use Spark analytics, create and use serverless Apache Spark pools in your Synapse workspace. When you start using a Spark pool, the workspaces creates a spark session to handle the resources associated with that session.
There are two ways within Synapse to use Spark:
- Spark Notebooks for doing Data Science and Engineering use Scala, PySpark, C#, and SparkSQL
- Spark job definitions for running batch Spark jobs using jar files.
SynapseML
SynapseML (previously known as MMLSpark), is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. It is an ecosystem of tools used to expand the Apache Spark framework in several new directions. SynapseML unifies several existing machine learning frameworks and new Microsoft algorithms into a single, scalable API that's usable across Python, R, Scala, .NET, and Java. To learn more, see the key features of SynapseML.
Pipelines
Pipelines are how Azure Synapse provides Data Integration - allowing you to move data between services and orchestrate activities.
- Pipeline are logical grouping of activities that perform a task together.
- Activities defines actions within a Pipeline to perform on data such as copying data, running a Notebook or a SQL script.
- Data flows are a specific kind of activity that provide a no-code experience for doing data transformation that uses Synapse Spark under-the-covers.
- Trigger - Executes a pipeline. It can be run manually or automatically (schedule, tumbling window or event-based)
- Integration dataset - Named view of data that simply points or references the data to be used in an activity as input and output. It belongs to a Linked Service.