Pipelines and activities - Azure Data Factory & Azure Synapse (2024)

  • Article

APPLIES TO: Pipelines and activities - Azure Data Factory & Azure Synapse (1)Azure Data Factory Pipelines and activities - Azure Data Factory & Azure Synapse (2)Azure Synapse Analytics

Tip

Try out Data Factory in Microsoft Fabric, an all-in-one analytics solution for enterprises. Microsoft Fabric covers everything from data movement to data science, real-time analytics, business intelligence, and reporting. Learn how to start a new trial for free!

Important

Support for Azure Machine Learning Studio (classic) will end on August 31, 2024. We recommend that you transition to Azure Machine Learning by that date.

As of December 1, 2021, you can't create new Machine Learning Studio (classic) resources (workspace and web service plan). Through August 31, 2024, you can continue to use the existing Machine Learning Studio (classic) experiments and web services. For more information, see:

  • Migrate to Azure Machine Learning from Machine Learning Studio (classic)
  • What is Azure Machine Learning?

Machine Learning Studio (classic) documentation is being retired and might not be updated in the future.

This article helps you understand pipelines and activities in Azure Data Factory and Azure Synapse Analytics and use them to construct end-to-end data-driven workflows for your data movement and data processing scenarios.

Overview

A Data Factory or Synapse Workspace can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. For example, a pipeline could contain a set of activities that ingest and clean log data, and then kick off a mapping data flow to analyze the log data. The pipeline allows you to manage the activities as a set instead of each one individually. You deploy and schedule the pipeline instead of the activities independently.

The activities in a pipeline define actions to perform on your data. For example, you can use a copy activity to copy data from SQL Server to an Azure Blob Storage. Then, use a data flow activity or a Databricks Notebook activity to process and transform data from the blob storage to an Azure Synapse Analytics pool on top of which business intelligence reporting solutions are built.

Azure Data Factory and Azure Synapse Analytics have three groupings of activities: data movement activities, data transformation activities, and control activities. An activity can take zero or more input datasets and produce one or more output datasets. The following diagram shows the relationship between pipeline, activity, and dataset:

Pipelines and activities - Azure Data Factory & Azure Synapse (3)

An input dataset represents the input for an activity in the pipeline, and an output dataset represents the output for the activity. Datasets identify data within different data stores, such as tables, files, folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity. For more information about datasets, see Datasets in Azure Data Factory article.

Note

There is a default soft limit of maximum 80 activities per pipeline, which includes inner activities for containers.

Data movement activities

Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the data stores listed in the table in this section. Data from any source can be written to any sink.

For more information, see Copy Activity - Overview article.

Click a data store to learn how to copy data to and from that store.

CategoryData storeSupported as a sourceSupported as a sinkSupported by Azure IRSupported by self-hosted IR
AzureAzure Blob storage
Azure AI Search index
Azure Cosmos DB for NoSQL
Azure Cosmos DB for MongoDB
Azure Data Explorer
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
Azure Database for MariaDB
Azure Database for MySQL
Azure Database for PostgreSQL
Azure Databricks Delta Lake
Azure Files
Azure SQL Database
Azure SQL Managed Instance
Azure Synapse Analytics
Azure Table storage
DatabaseAmazon RDS for Oracle
Amazon RDS for SQL Server
Amazon Redshift
DB2
Drill
Google BigQuery
Greenplum
HBase
Hive
Apache Impala
Informix
MariaDB
Microsoft Access
MySQL
Netezza
Oracle
Phoenix
PostgreSQL
Presto
SAP Business Warehouse via Open Hub
SAP Business Warehouse via MDX
SAP HANASink supported only with the ODBC Connector and the SAP HANA ODBC driver
SAP table
Snowflake
Spark
SQL Server
Sybase
Teradata
Vertica
NoSQLCassandra
Couchbase (Preview)
MongoDB
MongoDB Atlas
FileAmazon S3
Amazon S3 Compatible Storage
File system
FTP
Google Cloud Storage
HDFS
Oracle Cloud Storage
SFTP
Generic protocolGeneric HTTP
Generic OData
Generic ODBC
Generic REST
Services and appsAmazon Marketplace Web Service (Deprecated)
Concur (Preview)
Dataverse
Dynamics 365
Dynamics AX
Dynamics CRM
Google AdWords
HubSpot
Jira
Magento (Preview)
Marketo (Preview)
Microsoft 365
Oracle Eloqua (Preview)
Oracle Responsys (Preview)
Oracle Service Cloud (Preview)
PayPal (Preview)
QuickBooks (Preview)
Salesforce
Salesforce Service Cloud
Salesforce Marketing Cloud
SAP Cloud for Customer (C4C)
SAP ECC
ServiceNow
SharePoint Online List
Shopify (Preview)
Square (Preview)
Web table (HTML table)
Xero
Zoho (Preview)

Note

If a connector is marked Preview, you can try it out and give us feedback. If you want to take a dependency on preview connectors in your solution, contact Azure support.

Data transformation activities

Azure Data Factory and Azure Synapse Analytics support the following transformation activities that can be added either individually or chained with another activity.

For more information, see the data transformation activities article.

Data transformation activityCompute environment
Data FlowApache Spark clusters managed by Azure Data Factory
Azure FunctionAzure Functions
HiveHDInsight [Hadoop]
PigHDInsight [Hadoop]
MapReduceHDInsight [Hadoop]
Hadoop StreamingHDInsight [Hadoop]
SparkHDInsight [Hadoop]
ML Studio (classic) activities: Batch Execution and Update ResourceAzure VM
Stored ProcedureAzure SQL, Azure Synapse Analytics, or SQL Server
U-SQLAzure Data Lake Analytics
Custom ActivityAzure Batch
Databricks NotebookAzure Databricks
Databricks Jar ActivityAzure Databricks
Databricks Python ActivityAzure Databricks

Control flow activities

The following control flow activities are supported:

Control activityDescription
Append VariableAdd a value to an existing array variable.
Execute PipelineExecute Pipeline activity allows a Data Factory or Synapse pipeline to invoke another pipeline.
FilterApply a filter expression to an input array
For EachForEach Activity defines a repeating control flow in your pipeline. This activity is used to iterate over a collection and executes specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping structure in programming languages.
Get MetadataGetMetadata activity can be used to retrieve metadata of any data in a Data Factory or Synapse pipeline.
If Condition ActivityThe If Condition can be used to branch based on condition that evaluates to true or false. The If Condition activity provides the same functionality that an if statement provides in programming languages. It evaluates a set of activities when the condition evaluates to true and another set of activities when the condition evaluates to false.
Lookup ActivityLookup Activity can be used to read or look up a record/ table name/ value from any external source. This output can further be referenced by succeeding activities.
Set VariableSet the value of an existing variable.
Until ActivityImplements Do-Until loop that is similar to Do-Until looping structure in programming languages. It executes a set of activities in a loop until the condition associated with the activity evaluates to true. You can specify a timeout value for the until activity.
Validation ActivityEnsure a pipeline only continues execution if a reference dataset exists, meets a specified criteria, or a timeout has been reached.
Wait ActivityWhen you use a Wait activity in a pipeline, the pipeline waits for the specified time before continuing with execution of subsequent activities.
Web ActivityWeb Activity can be used to call a custom REST endpoint from a pipeline. You can pass datasets and linked services to be consumed and accessed by the activity.
Webhook ActivityUsing the webhook activity, call an endpoint, and pass a callback URL. The pipeline run waits for the callback to be invoked before proceeding to the next activity.

Creating a pipeline with UI

  • Azure Data Factory
  • Synapse Analytics

To create a new pipeline, navigate to the Author tab in Data Factory Studio (represented by the pencil icon), then click the plus sign and choose Pipeline from the menu, and Pipeline again from the submenu.

Pipelines and activities - Azure Data Factory & Azure Synapse (4)

Data factory will display the pipeline editor where you can find:

  1. All activities that can be used within the pipeline.
  2. The pipeline editor canvas, where activities will appear when added to the pipeline.
  3. The pipeline configurations pane, including parameters, variables, general settings, and output.
  4. The pipeline properties pane, where the pipeline name, optional description, and annotations can be configured. This pane will also show any related items to the pipeline within the data factory.

Pipelines and activities - Azure Data Factory & Azure Synapse (5)

Pipeline JSON

Here is how a pipeline is defined in JSON format:

{ "name": "PipelineName", "properties": { "description": "pipeline description", "activities": [ ], "parameters": { }, "concurrency": <your max pipeline concurrency>, "annotations": [ ] }}
TagDescriptionTypeRequired
nameName of the pipeline. Specify a name that represents the action that the pipeline performs.
  • Maximum number of characters: 140
  • Must start with a letter, number, or an underscore (_)
  • Following characters are not allowed: “.”, "+", "?", "/", "<",">","*"," %"," &",":"," "
StringYes
descriptionSpecify the text describing what the pipeline is used for.StringNo
activitiesThe activities section can have one or more activities defined within it. See the Activity JSON section for details about the activities JSON element.ArrayYes
parametersThe parameters section can have one or more parameters defined within the pipeline, making your pipeline flexible for reuse.ListNo
concurrencyThe maximum number of concurrent runs the pipeline can have. By default, there is no maximum. If the concurrency limit is reached, additional pipeline runs are queued until earlier ones completeNumberNo
annotationsA list of tags associated with the pipelineArrayNo

Activity JSON

The activities section can have one or more activities defined within it. There are two main types of activities: Execution and Control Activities.

Execution activities

Execution activities include data movement and data transformation activities. They have the following top-level structure:

{ "name": "Execution Activity Name", "description": "description", "type": "<ActivityType>", "typeProperties": { }, "linkedServiceName": "MyLinkedService", "policy": { }, "dependsOn": { }}

Following table describes properties in the activity JSON definition:

TagDescriptionRequired
nameName of the activity. Specify a name that represents the action that the activity performs.
  • Maximum number of characters: 55
  • Must start with a letter-number, or an underscore (_)
  • Following characters are not allowed: “.”, "+", "?", "/", "<",">","*"," %"," &",":"," "
Yes
descriptionText describing what the activity or is used forYes
typeType of the activity. See the Data Movement Activities, Data Transformation Activities, and Control Activities sections for different types of activities.Yes
linkedServiceNameName of the linked service used by the activity.

An activity might require that you specify the linked service that links to the required compute environment.

Yes for HDInsight Activity, ML Studio (classic) Batch Scoring Activity, Stored Procedure Activity.

No for all others

typePropertiesProperties in the typeProperties section depend on each type of activity. To see type properties for an activity, click links to the activity in the previous section.No
policyPolicies that affect the run-time behavior of the activity. This property includes a timeout and retry behavior. If it isn't specified, default values are used. For more information, see Activity policy section.No
dependsOnThis property is used to define activity dependencies, and how subsequent activities depend on previous activities. For more information, see Activity dependencyNo

Activity policy

Policies affect the run-time behavior of an activity, giving configuration options. Activity Policies are only available for execution activities.

Activity policy JSON definition

{ "name": "MyPipelineName", "properties": { "activities": [ { "name": "MyCopyBlobtoSqlActivity", "type": "Copy", "typeProperties": { ... }, "policy": { "timeout": "00:10:00", "retry": 1, "retryIntervalInSeconds": 60, "secureOutput": true } } ], "parameters": { ... } }}
JSON nameDescriptionAllowed ValuesRequired
timeoutSpecifies the timeout for the activity to run.TimespanNo. Default timeout is 12 hours, minimum 10 minutes.
retryMaximum retry attemptsIntegerNo. Default is 0
retryIntervalInSecondsThe delay between retry attempts in secondsIntegerNo. Default is 30 seconds
secureOutputWhen set to true, the output from activity is considered as secure and aren't logged for monitoring.BooleanNo. Default is false.

Control activity

Control activities have the following top-level structure:

{ "name": "Control Activity Name", "description": "description", "type": "<ActivityType>", "typeProperties": { }, "dependsOn": { }}
TagDescriptionRequired
nameName of the activity. Specify a name that represents the action that the activity performs.
  • Maximum number of characters: 55
  • Must start with a letter number, or an underscore (_)
  • Following characters are not allowed: “.”, "+", "?", "/", "<",">","*"," %"," &",":"," "
Yes
    descriptionText describing what the activity or is used forYes
    typeType of the activity. See the data movement activities, data transformation activities, and control activities sections for different types of activities.Yes
    typePropertiesProperties in the typeProperties section depend on each type of activity. To see type properties for an activity, click links to the activity in the previous section.No
    dependsOnThis property is used to define Activity Dependency, and how subsequent activities depend on previous activities. For more information, see activity dependency.No

    Activity dependency

    Activity Dependency defines how subsequent activities depend on previous activities, determining the condition of whether to continue executing the next task. An activity can depend on one or multiple previous activities with different dependency conditions.

    The different dependency conditions are: Succeeded, Failed, Skipped, Completed.

    For example, if a pipeline has Activity A -> Activity B, the different scenarios that can happen are:

    • Activity B has dependency condition on Activity A with succeeded: Activity B only runs if Activity A has a final status of succeeded
    • Activity B has dependency condition on Activity A with failed: Activity B only runs if Activity A has a final status of failed
    • Activity B has dependency condition on Activity A with completed: Activity B runs if Activity A has a final status of succeeded or failed
    • Activity B has a dependency condition on Activity A with skipped: Activity B runs if Activity A has a final status of skipped. Skipped occurs in the scenario of Activity X -> Activity Y -> Activity Z, where each activity runs only if the previous activity succeeds. If Activity X fails, then Activity Y has a status of "Skipped" because it never executes. Similarly, Activity Z has a status of "Skipped" as well.

    Example: Activity 2 depends on the Activity 1 succeeding

    { "name": "PipelineName", "properties": { "description": "pipeline description", "activities": [ { "name": "MyFirstActivity", "type": "Copy", "typeProperties": { }, "linkedServiceName": { } }, { "name": "MySecondActivity", "type": "Copy", "typeProperties": { }, "linkedServiceName": { }, "dependsOn": [ { "activity": "MyFirstActivity", "dependencyConditions": [ "Succeeded" ] } ] } ], "parameters": { } }}

    Sample copy pipeline

    In the following sample pipeline, there is one activity of type Copy in the activities section. In this sample, the copy activity copies data from an Azure Blob storage to a database in Azure SQL Database.

    { "name": "CopyPipeline", "properties": { "description": "Copy data from a blob to Azure SQL table", "activities": [ { "name": "CopyFromBlobToSQL", "type": "Copy", "inputs": [ { "name": "InputDataset" } ], "outputs": [ { "name": "OutputDataset" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink", "writeBatchSize": 10000, "writeBatchTimeout": "60:00:00" } }, "policy": { "retry": 2, "timeout": "01:00:00" } } ] }}

    Note the following points:

    • In the activities section, there is only one activity whose type is set to Copy.
    • Input for the activity is set to InputDataset and output for the activity is set to OutputDataset. See Datasets article for defining datasets in JSON.
    • In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink type. In the data movement activities section, click the data store that you want to use as a source or a sink to learn more about moving data to/from that data store.

    For a complete walkthrough of creating this pipeline, see Quickstart: create a Data Factory.

    Sample transformation pipeline

    In the following sample pipeline, there is one activity of type HDInsightHive in the activities section. In this sample, the HDInsight Hive activity transforms data from an Azure Blob storage by running a Hive script file on an Azure HDInsight Hadoop cluster.

    { "name": "TransformPipeline", "properties": { "description": "My first Azure Data Factory pipeline", "activities": [ { "type": "HDInsightHive", "typeProperties": { "scriptPath": "adfgetstarted/script/partitionweblogs.hql", "scriptLinkedService": "AzureStorageLinkedService", "defines": { "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" } }, "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "policy": { "retry": 3 }, "name": "RunSampleHiveActivity", "linkedServiceName": "HDInsightOnDemandLinkedService" } ] }}

    Note the following points:

    • In the activities section, there is only one activity whose type is set to HDInsightHive.
    • The Hive script file, partitionweblogs.hql, is stored in the Azure Storage account (specified by the scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container adfgetstarted.
    • The defines section is used to specify the runtime settings that are passed to the hive script as Hive configuration values (for example, ${hiveconf:inputtable}, ${hiveconf:partitionedtable}).

    The typeProperties section is different for each transformation activity. To learn about type properties supported for a transformation activity, click the transformation activity in the Data transformation activities.

    For a complete walkthrough of creating this pipeline, see Tutorial: transform data using Spark.

    Multiple activities in a pipeline

    The previous two sample pipelines have only one activity in them. You can have more than one activity in a pipeline. If you have multiple activities in a pipeline and subsequent activities are not dependent on previous activities, the activities might run in parallel.

    You can chain two activities by using activity dependency, which defines how subsequent activities depend on previous activities, determining the condition whether to continue executing the next task. An activity can depend on one or more previous activities with different dependency conditions.

    Scheduling pipelines

    Pipelines are scheduled by triggers. There are different types of triggers (Scheduler trigger, which allows pipelines to be triggered on a wall-clock schedule, as well as the manual trigger, which triggers pipelines on-demand). For more information about triggers, see pipeline execution and triggers article.

    To have your trigger kick off a pipeline run, you must include a pipeline reference of the particular pipeline in the trigger definition. Pipelines & triggers have an n-m relationship. Multiple triggers can kick off a single pipeline, and the same trigger can kick off multiple pipelines. Once the trigger is defined, you must start the trigger to have it start triggering the pipeline. For more information about triggers, see pipeline execution and triggers article.

    For example, say you have a Scheduler trigger, "Trigger A," that I wish to kick off my pipeline, "MyCopyPipeline." You define the trigger, as shown in the following example:

    Trigger A definition

    { "name": "TriggerA", "properties": { "type": "ScheduleTrigger", "typeProperties": { ... } }, "pipeline": { "pipelineReference": { "type": "PipelineReference", "referenceName": "MyCopyPipeline" }, "parameters": { "copySourceName": "FileSource" } } }}

    Related content

    See the following tutorials for step-by-step instructions for creating pipelines with activities:

    • Build a pipeline with a copy activity
    • Build a pipeline with a data transformation activity

    How to achieve CI/CD (continuous integration and delivery) using Azure Data Factory

    • Continuous integration and delivery in Azure Data Factory
    Pipelines and activities - Azure Data Factory & Azure Synapse (2024)
    Top Articles
    Start With Crypto - Your One Stop Guide In Understanding Cryptocurrencies - UseTheBitcoin
    Amazon and Stripe double down on their payments partnership
    Linkvertise Bypass 2023
    Lycoming County Docket Sheets
    What's New on Hulu in October 2023
    Blog:Vyond-styled rants -- List of nicknames (blog edition) (TouhouWonder version)
    How Much Is Tj Maxx Starting Pay
    Samsung Galaxy S24 Ultra Negru dual-sim, 256 GB, 12 GB RAM - Telefon mobil la pret avantajos - Abonament - In rate | Digi Romania S.A.
    Peraton Sso
    Mail.zsthost Change Password
    Idaho Harvest Statistics
    NHS England » Winter and H2 priorities
    Adam4Adam Discount Codes
    50 Shades Of Grey Movie 123Movies
    Wausau Marketplace
    Lakers Game Summary
    Https Paperlesspay Talx Com Boydgaming
    Craigslist Apartments Baltimore
    R&S Auto Lockridge Iowa
    Parkeren Emmen | Reserveren vanaf €9,25 per dag | Q-Park
    Dr. Nicole Arcy Dvm Married To Husband
    Busted Mugshots Paducah Ky
    EVO Entertainment | Cinema. Bowling. Games.
    R Baldurs Gate 3
    Astro Seek Asteroid Chart
    Winterset Rants And Raves
    Marlene2295
    Guide to Cost-Benefit Analysis of Investment Projects Economic appraisal tool for Cohesion Policy 2014-2020
    Obsidian Guard's Skullsplitter
    Nurtsug
    Promatch Parts
    Swimgs Yuzzle Wuzzle Yups Wits Sadie Plant Tune 3 Tabs Winnie The Pooh Halloween Bob The Builder Christmas Autumns Cow Dog Pig Tim Cook’s Birthday Buff Work It Out Wombats Pineview Playtime Chronicles Day Of The Dead The Alpha Baa Baa Twinkle
    Smayperu
    P3P Orthrus With Dodge Slash
    Justin Mckenzie Phillip Bryant
    Help with your flower delivery - Don's Florist & Gift Inc.
    Toonily The Carry
    Studio 22 Nashville Review
    Wattengel Funeral Home Meadow Drive
    Linda Sublette Actress
    Reese Witherspoon Wiki
    Gravel Racing
    Silive Obituary
    Best Restaurants West Bend
    Vindy.com Obituaries
    10 Types of Funeral Services, Ceremonies, and Events » US Urns Online
    Cara Corcione Obituary
    Craigslist Marshfield Mo
    Gear Bicycle Sales Butler Pa
    Bumgarner Funeral Home Troy Nc Obituaries
    When Is The First Cold Front In Florida 2022
    E. 81 St. Deli Menu
    Latest Posts
    Article information

    Author: Corie Satterfield

    Last Updated:

    Views: 6370

    Rating: 4.1 / 5 (62 voted)

    Reviews: 85% of readers found this page helpful

    Author information

    Name: Corie Satterfield

    Birthday: 1992-08-19

    Address: 850 Benjamin Bridge, Dickinsonchester, CO 68572-0542

    Phone: +26813599986666

    Job: Sales Manager

    Hobby: Table tennis, Soapmaking, Flower arranging, amateur radio, Rock climbing, scrapbook, Horseback riding

    Introduction: My name is Corie Satterfield, I am a fancy, perfect, spotless, quaint, fantastic, funny, lucky person who loves writing and wants to share my knowledge and understanding with you.