Share via


Develop and debug ETL pipelines with the multi-file editor in Lakeflow Declarative Pipelines

Important

This feature is in Beta.

This article describes using the multi-file editor in Lakeflow Declarative Pipelines to develop and debug ETL (extract, transform, and load) pipelines. The multi-file editor shows a pipeline as a set of files in the pipeline assets browser. You can edit the files and control the configuration of the pipeline and which files to include in one ___location.

For the default development experience using a single notebook in Lakeflow Declarative Pipelines, see Develop and debug ETL pipelines with a notebook in Lakeflow Declarative Pipelines.

Overview of the multi-file editor

The ETL pipeline multi-file editor has the following features:

  1. Pipeline asset browser: Create, delete, rename, and organize pipeline assets.
  2. Multi-file code editor with tabs: Work across multiple code files associated with a pipeline.
  3. Pipeline-specific toolbar: Enables pipeline configuration and has pipeline-level run actions.
  4. Interactive directed acyclical graph (DAG): Get an overview of your tables, open the data previews bottom bar, and perform other table-related actions.
  5. Data preview: Inspect the data of your streaming tables and materialized views.
  6. Table-level execution insights: Get execution insights for all tables or a single table in a pipeline. The insights refer to the latest pipeline run.
  7. Issues panel: This feature summarizes errors across all files in the pipeline, and you can navigate to where the error occurred inside a specific file. It complements code-affixed error indicators.
  8. Selective execution: The code editor has features for step-by-step development, such as the ability to refresh tables only in the current file using the Run file action or a single table.
  9. Default pipeline folder structure: New pipelines include a predefined folder structure and sample code that you can use as a starting point for your pipeline.
  10. Simplified pipeline creation: Provide a name, catalog, and schema where tables should be created by default, and a pipeline is created using default settings. You can later adjust Settings from the pipeline editor toolbar.

Lakeflow Declarative Pipelines multi-file editor

Enable the multi-file editor

Note

You must first enable Pipelines multi-file developer experience for your workspace. See Manage Azure Databricks Previews for more information.

If your tier is in the Compliance Security Profile, reach out to your Azure Databricks contact to try the feature.

You can enable the ETL pipeline multi-file editor in multiple ways:

  • When you create a new ETL pipeline, enable the multi-file editor in Lakeflow Declarative Pipelines with the ETL Pipeline editor toggle.

    Lakeflow Declarative Pipelines multi-file editor toggle on

    The advanced settings page for the pipeline is used the first time you enable the multi-file editor. The simplified pipeline creation window is used the next time you create a new pipeline.

  • For an existing pipeline, open a notebook used in a pipeline and enable the ETL Pipeline editor toggle in the header. You can also go to the pipeline monitoring page and click Settings to enable the multi-file editor.

After you have enabled the ETL Pipeline editor toggle, all ETL pipelines will use the multi-file editor by default. You can turn the ETL pipeline multi-file editor on and off from the editor.

Alternatively, you can enable the multi-file editor from user settings:

  1. Click your user badge in the upper-right area of your workspace and then click Settings and Developer.
  2. Enable Tabs for notebooks and files.
  3. Enable ETL Pipeline multi-file editor.

Create a new ETL pipeline

To create a new ETL pipeline using the multi-file editor, follow these steps:

  1. At the top of the sidebar, click Plus icon. New and then select Pipeline icon. ETL pipeline.

  2. At the top, you can give your pipeline a unique name.

  3. Just under the name, you can see the default catalog and schema that have been chosen for you. Change these to give your pipeline different defaults.

    The default catalog and the default schema are where datasets are read from or written to when you do not qualify datasets with a catalog or schema in your code. See Database objects in Azure Databricks for more information.

  4. Select your preferred option to create a pipeline, by choosing one of the folloiwng options:

    • Start with sample code in SQL to create a new pipeline and folder structure, including sample code in SQL.
    • Start with sample code in Python to create a new pipeline and folder structure, including sample code in Python.
    • Start with a single transformation to create a new pipeline and folder structure, with a new blank code file.
    • Add existing assets to create a pipeline that you can associate with exisitng code files in your workspace.

    You can have both SQL and Python source code files in your ETL pipeline. When creating a new pipeline and choosing a language for the sample code, the language is only for the sample code included in your pipeline by default.

  5. When you make your selection, you are redirected to the newly created pipeline.

The ETL pipeline is created with the following default settings:

You can adjust these settings from the pipeline toolbar or select Create advanced pipeline to provide your preferred settings. See Configure Lakeflow Declarative Pipelines for more information.

Alternatively, you can create an ETL pipeline from the workspace browser:

  1. Click Workspace in the left side panel.
  2. Select any folder, including Git folders.
  3. Click Create in the upper-right corner, and click ETL pipeline.

You can also create an ETL pipeline from the jobs and pipelines page:

  1. In your workspace, click Workflows icon. Jobs & Pipelines in the sidebar.
  2. Under New, click ETL Pipeline.

Open an existing ETL pipeline

To open an existing ETL pipeline in the multi-file editor, follow these steps:

  1. Click Workspace in the side panel.
  2. Navigate to a folder with source code files for your pipeline.
  3. Click the source code file to open the pipeline in the editor.

Open an existing ETL pipeline

You can also open an existing ETL pipeline in the following ways:

  • On the Recents page on the left sidebar, open a pipeline or a file configured as the source code for a pipeline.
  • In the pipeline monitoring page, click Edit pipeline.
  • On the Job Runs page in the left sidebar, click Jobs & pipelines tab and click Kebab menu icon. and Edit pipeline.
  • When you create a new job and add a pipeline task, you can click open in new tab New window icon. when you choose a pipeline under Pipeline.
  • When editing a pipeline, you can click the name of the pipeline at the top of the asset browser to choose from a list of recently viewed pipelines.
  • If you open a source code file configured as source code for a different pipeline from the asset browser, a banner is shown at the top of the editor for that file, prompting you to open that associated pipeline. To open a source code file that is not part of the pipeline, choose All files at the top of the asset browser.

Pipeline assets browser

The multi-file pipeline editor has a special mode for the workspace browser sidebar called the Pipeline assets browser and, by default, focuses the panel on the pipeline.

Click the pipeline name at the top of the browser to switch between recently viewed pipelines.

The asset browswer has two tabs:

  • Pipeline: This is where you can find all files associated with the pipeline. You can create, delete, rename, and organize them into folders.
  • All files: All other workspace assets are available here.

Pipeline asset browser

You can have the following types of files in your pipeline:

  • Source code files: These files are part of the pipeline's source code definition, which can be seen in Settings. Databricks recommends always storing source code files inside the pipeline root folder; otherwise, they will be shown in an external file section at the bottom of the browser and have a less rich feature set.
  • Non-source code files: These files are stored inside the pipeline root folder but are not part of the pipeline source code definition.

Important

You must use the pipeline assets browser under the Pipeline tab to manage files and folders for your pipeline. This will update the pipeline settings correctly. Moving or renaming files and folders from your workspace browser or the All files tab will break the pipeline configuration, and you must then resolve this manually in Settings.

Root folder

The pipeline assets browser is anchored in a pipeline root folder. When you create a new pipeline, the pipeline root folder is created in your user home folder and named the same as the pipeline name.

You can change the root folder in the pipeline assets browser. This is useful if you created a pipeline in a folder and you later want to move everything to a different folder. For example, you created the pipeline in a normal folder, and you want to move the source code to a Git folder for version control.

  1. Click the Kebab menu icon. overflow menu for the root folder.
  2. Click Configure new root folder.
  3. Under Pipeline root folder click Folder Icon and choose another folder as the pipeline root folder.
  4. Click Save.

Change pipeline root folder

In the Kebab menu icon. for the root folder, you can also click Rename root folder to rename the folder name. Here, you can also click Move root folder to move the root folder, for example, into a Git folder.

You can also change the pipeline root folder in settings:

  1. Click Settings.
  2. Under Code assets click Configure paths.
  3. Click Folder Icon to change the folder under Pipeline root folder.
  4. Click Save.

Note

If you change the pipeline root folder, the file list displayed by the pipeline assets browser will be affected, as the files in the previous root folder will now be shown as external files.

Existing pipeline with no root folder

An existing pipeline created in the default development experience using a single notebook in Lakeflow Declarative Pipelines won't have a root folder configured. Follow these steps to configure the root folder for your existing pipeline:

  1. In the pipeline assets browser, click Configure.
  2. Click Folder Icon to select the root folder under Pipeline root folder.
  3. Click Save.

No pipeline root folder

Default folder structure

When you create a new pipeline, a default folder structure is created. This is the recommended structure for organizing your pipeline source and non-source code files, as described below.

A small number of sample code files are created in this folder structure.

Folder name Recommended ___location for these types of files
<pipeline_root_folder> Root folder that contains all folders and files for your pipeline.
explorations Non-source code files, such as notebooks, queries, and code files used for explorative data analysis.
transformations Source code files, such as Python or SQL code files with table definitions.
utilities Non-source code files with Python modules that can be imported from other code files. If you choose SQL as your language for sample code, this folder will not be created.

You can rename the folder names or change the structure to fit your workflow. To add a new source code folder, follow these steps:

  1. Click Add in the pipeline assets browser.
  2. Click Create pipeline source code folder.
  3. Enter a folder name and click Create.

Source code files

Source code files are part of the pipeline's source code definition. When you run the pipeline, these files are evaluated. Files and folders part of the source code definition have a special icon with a mini Pipeline icon superimposed.

To add a new source code file, follow these steps:

  1. Click Add in the pipeline assets browser.
  2. Click Transformation.
  3. Enter a Name for the file and select Python or SQL as the Language.
  4. Click Create.

You can also click Kebab menu icon. for any folder in the pipeline assets browser to add a source code file.

A transformations folder for source code is created by default when you create a new pipeline. This folder is the recommended ___location for pipeline source code, such as Python or SQL code files with pipeline table definitions.

Non-source code files

Non-source code files are stored inside the pipeline root folder but are not part of the pipeline source code definition. These files are not evaluated when you run the pipeline. Non-source code files cannot be external files.

You can use this for files related to your work on the pipeline that you'd like to store together with the source code. For example:

  • Notebooks that you use for ad hoc explorations executed on non-Lakeflow Declarative Pipelines compute outside the lifecycle of a pipeline.
  • Python modules that are not to be evaluated with your source code unless you explicitly import these modules inside your source code files.

To add a new non-source code file, follow these steps:

  1. Click Add in the pipeline assets browser.
  2. Click Exploration or Utility.
  3. Enter a Name for the file.
  4. Click Create.

You can also click Kebab menu icon. for the pipeline root folder or a non-source code file to add non-source code files to the folder.

When you create a new pipeline, the following folders for non-source code files are created by default:

Folder name Description
explorations This folder is the recommended ___location for notebooks, queries, dashboards, and other files and then run them on non-Lakeflow Declarative Pipelines compute, as you would normally do outside of a pipeline's execution lifecycle.
Important: These must not be added as source code for the pipeline. The pipeline could give an error because these files will likely cover arbitrary non-Lakeflow Declarative Pipelines code.
utilities This folder is the recommended ___location for Python modules that can be imported from other files via direct imports expressed as from <filename> import, as long as their parent folder is hierarchically under the root folder.

You can also import Python modules located outside the root folder, but in that case, you must append the folder path to sys.path in your Python code:

import sys, os
sys.path.append(os.path.abspath('<alternate_path_for_utilities>/utilities'))
from utils import \*

External files

The pipeline browser's External files section shows source code files outside the root folder.

To move an external file to the root folder, such as the transformations folder, follow these steps:

  1. Click Kebab menu icon. for the file in the assets browser and click Move.
  2. Choose the folder to which you want to move the file and click Move.

Files associated with multiple pipelines

A badge is shown in the file's header if a file is associated with more than one pipeline. It has a count of associated pipelines and allows switching to the other ones.

All files section

In addition to the Pipeline section, there is an All files section, where you can open any file in your workspace. Here you can:

  • Open files outside the root folder in a tab without leaving the multi-file editor.
  • Navigate to another pipeline's source code files and open them. This opens the file in the editor, and give you a banner with the option to switch focus in the editor to this second pipeline.
  • Move files to the pipeline's root folder.
  • Include files outside the root folder in the pipeline source code definition.

Run pipeline code

You have three options to run your pipeline code:

  1. Run all source code files in the pipeline: Click Run pipeline or Run pipeline with full table refresh to run all table definitions in all files defined as pipeline source code.

    Run pipeline

    You can also click Dry run to validate the pipeline without updating any data.

  2. Run the code in a single file: Click Run file or Run file with full table refresh to run all table definitions in the current file.

    Run file

  3. Run the code for a single table: Click Run table DLT Run Table Icon for a table definition in a source code file, and click Refresh table or Full refresh table.

    Run table

Directed acyclical graph (DAG)

After you have run or validated all source code files in the pipeline, you will see a directed acyclic graph (DAG). The graph shows the table dependency graph. Each node has different states along the pipeline lifecycle, such as validated, running, or error.

Directed acyclical graph (DAG)

You can toggle the graph on and off by clicking the graph icon in the right side panel. You can also maximize the graph. There are additional options at the bottom right, including zoom options, and Sliders icon. More options to display the graph in a vertical or horizontal layout.

Hovering over a node displays a toolbar with options, including refresh the query. Right-clicking a node gives you the same options, in a context-menu.

Clicking a node shows the data preview and table definition. When you edit a file, the tables defined in that file are highlighted in the graph.

Data previews

The data preview section shows sample data for a selected table.

You will see a preview of the table's data when you click a node in the directed acyclic graph (DAG).

If no table has been selected, go to the Tables section and click View data preview DLT View Data Preview Icon. If you have chosen a table, click All tables to return to all tables.

Execution insights

You can see the table execution insights about the latest pipeline update in the panels at the editor's bottom.

Panel Description
Tables Lists all tables with their statuses and metrics. If you select one table, you will see the metrics and performance for that table and a tab for the data preview.
Performance Query history and profiles for all flows in this pipeline. You can access execution metrics and detailed query plans during and after execution. See Access query history for Lakeflow Declarative Pipelines for more information.
Issues panel Click the panel to view a simplified errors and warnings view for the pipeline. You can click an entry to see more details, and then navigate to the place in the code where the error occurred. If the error is in a file other than the one currently displayed, this will redirect you to the file where the error is.
Click View details to see the corresponding event log entry for complete details. Click View logs to see the complete event log.
Code-affixed error indicators are shown for errors associated with a specific part of the code. To get more details, click the error icon or hover over the red line. A pop-up with more information appears. You can then click Quick fix to reveal a set of actions to troubleshoot the error.
Event log All events triggered during the last pipeline run. Click View logs or any entry in the issues tray.

Pipeline settings

To access the pipeline settings panel, click Settings in the toolbar or click Gear icon. in the mini card on the pipeline assets browser.

Pipeline settings

Event Log

The event log for your pipeline is not available until you set it up in Settings.

  1. Open Settings.
  2. Click the Chevron right icon. arrow next to Advanced settings.
  3. Click Edit advanced settings.
  4. Select Publish event log to metastore.
  5. Provide a name, catalog, and schema for the event log.
  6. Click Save.

Now your pipeline events will be published to the table you specified.

Environment

You can create an environment for your source code by adding dependencies in Settings.

  1. Open Settings.
  2. Under Environment, click Edit environment.
  3. Select Plus icon. Add dependency to add a dependency, as if you were adding it to a requirements.txt file. For more information about dependencies, see Add dependencies to the notebook.

Databricks recommends that you pin the version with ==. See PyPI package.

The environment applies to all source code files in your pipeline.

Notifications

You can add notifications using the Legacy pipeline settings.

  1. Open Settings.
  2. At the bottom of the Pipeline settings panel, click Legacy pipeline settings.
  3. Under Notifications, click Add notification.
  4. Add one or more email addresses and the events you want them to be sent.
  5. Click Add notification.

Limitations and known issues

See the following limitations and known issues for the ETL pipeline multi-line editor in Lakeflow Declarative Pipelines:

  1. The workspace browser sidebar will not focus on the pipeline if you start by opening a file in the explorations folder or a notebook, as these files or notebooks are not part of the pipeline source code definition.

    1. To enter the pipeline focus mode in the workspace browser, open a file associated with the pipeline.
  2. Data previews are not supported for regular views.

  3. Multi-table refreshes can only be performed from the pipeline monitoring page. Use the mini-card in the pipeline browser to navigate to that page.

  4. Run table DLT Run Table Icon can appear at an incorrect position due to line wrapping in your code.

  5. %pip install is not supported from files (the default asset type with the new editor). You can add dependencies in settings. See Environment.

    Alternately, you can continue to use %pip install from a notebook associated with a pipeline, in its source code definition.

FAQ

  1. Why use files and not notebooks for source code?

    Notebooks' cell-based execution was not compatible with Lakeflow Declarative Pipelines. So, we had to turn off features or change their behavior, which led to confusion.

    In the ETL pipeline multi-file editor, the file editor is used as a foundation for a first-class editor for Lakeflow Declarative Pipelines. Features are targeted explicitly to Lakeflow Declarative Pipelines, like Run table DLT Run Table Icon, rather than overloading familiar features with different behavior.

  2. Can I still use notebooks as source code?

    Yes, you can. However, some features, such as Run table DLT Run Table Icon or Run file, will not be present.

    If you have an existing pipeline using notebooks, it will still work in the new editor. However, Databricks recommends switching to files for new pipelines.

  3. How can I add existing code to a newly created Pipeline?

    You can add existing source code files to a new pipeline. To add a folder with existing files, follow these steps:

    1. Click Settings.
    2. Under Source code click Configure paths.
    3. Click Add path and choose the folder for the existing files.
    4. Click Save.

    You can also add individual files:

    1. Click All files in the pipeline assets browser.
    2. Navigate to your file, click Kebab menu icon., and click Include in pipeline.

    Consider moving these files to the pipeline root folder. If left outside the pipeline root folder, they will be shown in the External files section.

  4. Can I manage the Pipeline source code in Git?

    You can manage your pipeline source in Git by choosing a Git folder when you initially create the pipeline.After you have created the pipeline without version control, you can move your source to a Git folder. Databricks recommends using the editor action to move the entire root folder to a Git folder. This will update all settings accordingly. See Root folder.

    To move the root folder to a Git folder in the pipeline asset browser:

    1. Click Kebab menu icon. for the root folder.
    2. Click Move root folder.
    3. Choose a new ___location for your root folder and click Move.

    See the Root folder section for more information.

    After the move, you will see the familiar Git icon next to your root folder's name.

    Important

    To move the pipeline root folder, use the pipeline assets browser and the above steps. Moving it any other way will break the pipeline configurations, and you must manually configure the correct folder path in Settings.

  5. Can I have multiple Pipelines in the same root folder?

    You can, but Databricks recommends only having a single Pipeline per root folder.

  6. When should I run a dry run?

    Click Dry run to check your code without updating the tables.

  7. When should I use temporary Views, and when should I use materialized views in my code?

    Use temporary views when you do not want to materialize the data. For example, this is a step in a sequence of steps to prepare the data before it is ready to materialize using a streaming table or materialized view registered in the Catalog.