Example pipelines & datasets for the designer - Azure Machine Learning (2023)

  • Article
  • 10 minutes to read

Use the built-in examples in Azure Machine Learning designer to quickly get started building your own machine learning pipelines. The Azure Machine Learning designer GitHub repository contains detailed documentation to help you understand some common machine learning scenarios.

(Video) Creating a Machine Learning Pipeline with the Designer in the Azure ML Service

Prerequisites

  • An Azure subscription. If you don't have an Azure subscription, create a free account
  • An Azure Machine Learning workspace

Important

If you do not see graphical elements mentioned in this document, such as buttons in studio or designer, you may not have the right level of permissions to the workspace. Please contact your Azure subscription administrator to verify that you have been granted the correct level of access. For more information, see Manage users and roles.

Use sample pipelines

The designer saves a copy of the sample pipelines to your studio workspace. You can edit the pipeline to adapt it to your needs and save it as your own. Use them as a starting point to jumpstart your projects.

Here's how to use a designer sample:

  1. Sign in to ml.azure.com, and select the workspace you want to work with.

  2. Select Designer.

  3. Select a sample pipeline under the New pipeline section.

    Select Show more samples for a complete list of samples.

  4. To run a pipeline, you first have to set default compute target to run the pipeline on.

    1. In the Settings pane to the right of the canvas, select Select compute target.

    2. In the dialog that appears, select an existing compute target or create a new one. Select Save.

      (Video) Azure Machine Learning Pipeline

    3. Select Submit at the top of the canvas to submit a pipeline job.

    Depending on the sample pipeline and compute settings, jobs may take some time to complete. The default compute settings have a minimum node size of 0, which means that the designer must allocate resources after being idle. Repeated pipeline jobs will take less time since the compute resources are already allocated. Additionally, the designer uses cached results for each component to further improve efficiency.

  5. After the pipeline finishes running, you can review the pipeline and view the output for each component to learn more. Use the following steps to view component outputs:

    1. Right-click the component in the canvas whose output you'd like to see.
    2. Select Visualize.

    Use the samples as starting points for some of the most common machine learning scenarios.

Regression

Explore these built-in regression samples.

Sample titleDescription
Regression - Automobile Price Prediction (Basic)Predict car prices using linear regression.
Regression - Automobile Price Prediction (Advanced)Predict car prices using decision forest and boosted decision tree regressors. Compare models to find the best algorithm.

Classification

Explore these built-in classification samples. You can learn more about the samples by opening the samples and viewing the component comments in the designer.

Sample titleDescription
Binary Classification with Feature Selection - Income PredictionPredict income as high or low, using a two-class boosted decision tree. Use Pearson correlation to select features.
Binary Classification with custom Python script - Credit Risk PredictionClassify credit applications as high or low risk. Use the Execute Python Script component to weight your data.
Binary Classification - Customer Relationship PredictionPredict customer churn using two-class boosted decision trees. Use SMOTE to sample biased data.
Text Classification - Wikipedia SP 500 DatasetClassify company types from Wikipedia articles with multiclass logistic regression.
Multiclass Classification - Letter RecognitionCreate an ensemble of binary classifiers to classify written letters.

Computer vision

Explore these built-in computer vision samples. You can learn more about the samples by opening the samples and viewing the component comments in the designer.

Sample titleDescription
Image Classification using DenseNetUse computer vision components to build image classification model based on PyTorch DenseNet.

Recommender

Explore these built-in recommender samples. You can learn more about the samples by opening the samples and viewing the component comments in the designer.

Sample titleDescription
Wide & Deep based Recommendation - Restaurant Rating PredictionBuild a restaurant recommender engine from restaurant/user features and ratings.
Recommendation - Movie Rating TweetsBuild a movie recommender engine from movie/user features and ratings.

Utility

Learn more about the samples that demonstrate machine learning utilities and features. You can learn more about the samples by opening the samples and viewing the component comments in the designer.

Sample titleDescription
Binary Classification using Vowpal Wabbit Model - Adult Income PredictionVowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning. This sample shows how to use Vowpal Wabbit model to build binary classification model.
Use custom R script - Flight Delay PredictionUse customized R script to predict if a scheduled passenger flight will be delayed by more than 15 minutes.
Cross Validation for Binary Classification - Adult Income PredictionUse cross validation to build a binary classifier for adult income.
Permutation Feature ImportanceUse permutation feature importance to compute importance scores for the test dataset.
Tune Parameters for Binary Classification - Adult Income PredictionUse Tune Model Hyperparameters to find optimal hyperparameters to build a binary classifier.

Datasets

When you create a new pipeline in Azure Machine Learning designer, a number of sample datasets are included by default. These sample datasets are used by the sample pipelines in the designer homepage.

The sample datasets are available under Datasets-Samples category. You can find this in the component palette to the left of the canvas in the designer. You can use any of these datasets in your own pipeline by dragging it to the canvas.

DatasetnameDataset description
Adult Census Income Binary Classification datasetA subset of the 1994 Census database, using working adults over the age of 16 with an adjusted income index of > 100.
Usage: Classify people using demographics to predict whether a person earns over 50K a year.
Related Research: Kohavi, R., Becker, B., (1996). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science
Automobile price data (Raw)Information about automobiles by make and model, including the price, features such as the number of cylinders and MPG, as well as an insurance risk score.
The risk score is initially associated with auto price. It is then adjusted for actual risk in a process known to actuaries as symboling. A value of +3 indicates that the auto is risky, and a value of -3 that it is probably safe.
Usage: Predict the risk score by features, using regression or multivariate classification.
Related Research: Schlimmer, J.C. (1987). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
CRM Appetency Labels SharedLabels from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train_appetency.labels).
CRM Churn Labels SharedLabels from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train_churn.labels).
CRM Dataset SharedThis data comes from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train.data.zip).
The dataset contains 50K customers from the French Telecom company Orange. Each customer has 230 anonymized features, 190 of which are numeric and 40 are categorical. The features are very sparse.
CRM Upselling Labels SharedLabels from the KDD Cup 2009 customer relationship prediction challenge (orange_large_train_upselling.labels
Flight Delays DataPassenger flight on-time performance data taken from the TranStats data collection of the U.S. Department of Transportation (On-Time).
The dataset covers the time period April-October 2013. Before uploading to the designer, the dataset was processed as follows:
- The dataset was filtered to cover only the 70 busiest airports in the continental US
- Canceled flights were labeled as delayed by more than 15 minutes
- Diverted flights were filtered out
- The following columns were selected: Year, Month, DayofMonth, DayOfWeek, Carrier, OriginAirportID, DestAirportID, CRSDepTime, DepDelay, DepDel15, CRSArrTime, ArrDelay, ArrDel15, Canceled
German Credit Card UCI datasetThe UCI Statlog (German Credit Card) dataset (Statlog+German+Credit+Data), using the german.data file.
The dataset classifies people, described by a set of attributes, as low or high credit risks. Each example represents a person. There are 20 features, both numerical and categorical, and a binary label (the credit risk value). High credit risk entries have label = 2, low credit risk entries have label = 1. The cost of misclassifying a low risk example as high is 1, whereas the cost of misclassifying a high risk example as low is 5.
IMDB Movie TitlesThe dataset contains information about movies that were rated in Twitter tweets: IMDB movie ID, movie name, genre, and production year. There are 17K movies in the dataset. The dataset was introduced in the paper "S. Dooms, T. De Pessemier and L. Martens. MovieTweetings: a Movie Rating Dataset Collected From Twitter. Workshop on Crowdsourcing and Human Computation for Recommender Systems, CrowdRec at RecSys 2013."
Movie RatingsThe dataset is an extended version of the Movie Tweetings dataset. The dataset has 170K ratings for movies, extracted from well-structured tweets on Twitter. Each instance represents a tweet and is a tuple: user ID, IMDB movie ID, rating, timestamp, number of favorites for this tweet, and number of retweets of this tweet. The dataset was made available by A. Said, S. Dooms, B. Loni and D. Tikk for Recommender Systems Challenge 2014.
Weather DatasetHourly land-based weather observations from NOAA (merged data from 201304 to 201310).
The weather data covers observations made from airport weather stations, covering the time period April-October 2013. Before uploading to the designer, the dataset was processed as follows:
- Weather station IDs were mapped to corresponding airport IDs
- Weather stations not associated with the 70 busiest airports were filtered out
- The Date column was split into separate Year, Month, and Day columns
- The following columns were selected: AirportID, Year, Month, Day, Time, TimeZone, SkyCondition, Visibility, WeatherType, DryBulbFarenheit, DryBulbCelsius, WetBulbFarenheit, WetBulbCelsius, DewPointFarenheit, DewPointCelsius, RelativeHumidity, WindSpeed, WindDirection, ValueForWindCharacter, StationPressure, PressureTendency, PressureChange, SeaLevelPressure, RecordType, HourlyPrecip, Altimeter
Wikipedia SP 500 DatasetData is derived from Wikipedia (https://www.wikipedia.org/) based on articles of each S&P 500 company, stored as XML data.
Before uploading to the designer, the dataset was processed as follows:
- Extract text content for each specific company
- Remove wiki formatting
- Remove non-alphanumeric characters
- Convert all text to lowercase
- Known company categories were added
Note that for some companies an article could not be found, so the number of records is less than 500.
Restaurant Feature DataA set of metadata about restaurants and their features, such as food type, dining style, and location.
Usage: Use this dataset, in combination with the other two restaurant datasets, to train and test a recommender system.
Related Research: Bache, K. and Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
Restaurant RatingsContains ratings given by users to restaurants on a scale from 0 to 2.
Usage: Use this dataset, in combination with the other two restaurant datasets, to train and test a recommender system.
Related Research: Bache, K. and Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
Restaurant Customer DataA set of metadata about customers, including demographics and preferences.
Usage: Use this dataset, in combination with the other two restaurant datasets, to train and test a recommender system.
Related Research: Bache, K. and Lichman, M. (2013). UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science.

Clean up resources

Important

You can use the resources that you created as prerequisites for other Azure Machine Learning tutorials and how-to articles.

(Video) Components in Azure Machine Learning Pipelines

Delete everything

If you don't plan to use anything that you created, delete the entire resource group so you don't incur any charges.

  1. In the Azure portal, select Resource groups on the left side of the window.

    Example pipelines & datasets for the designer - Azure Machine Learning (1)

  2. In the list, select the resource group that you created.

  3. Select Delete resource group.

Deleting the resource group also deletes all resources that you created in the designer.

Delete individual assets

In the designer where you created your experiment, delete individual assets by selecting them and then selecting the Delete button.

The compute target that you created here automatically autoscales to zeronodes when it's not being used. This action is taken to minimize charges.If you want to delete the compute target, take these steps:

Example pipelines & datasets for the designer - Azure Machine Learning (2)

You can unregister datasets from your workspace by selecting each dataset and selecting Unregister.

(Video) Azure Machine Learning Studio Tutorial

Example pipelines & datasets for the designer - Azure Machine Learning (3)

To delete a dataset, go to the storage account by using the Azure portal or Azure Storage Explorer and manually delete those assets.

Next steps

Learn the fundamentals of predictive analytics and machine learning with Tutorial: Predict automobile price with the designer

Feedback

Submit and view feedback for

FAQs

How do I create a pipeline in Azure for machine learning? ›

See Manage and increase quotas for resources with Azure Machine Learning.
  1. Create a dataset for the Azure-stored data.
  2. Create the pipeline step's source.
  3. Specify the pipeline step.
  4. Create the training step's source.
  5. Create the training pipeline step.
  6. Stop the compute instance.
  7. Delete everything.
Mar 1, 2023

What should you create before you can run the pipeline in Azure machine learning designer? ›

Create the resources required to run an ML pipeline:
  1. Set up a datastore used to access the data needed in the pipeline steps.
  2. Configure a Dataset object to point to persistent data that lives in, or is accessible in, a datastore. ...
  3. Set up the compute targets on which your pipeline steps will run.
Mar 1, 2023

What are pipelines in Azure ML? ›

An Azure Machine Learning pipeline is an independently executable workflow of a complete machine learning task.

How do I create a dataset in Azure machine learning? ›

Create datasets from datastores
  1. Verify that you have contributor or owner access to the underlying storage service of your registered Azure Machine Learning datastore. Check your storage account permissions in the Azure portal.
  2. Create the dataset by referencing paths in the datastore.
Apr 3, 2023

What is the difference between data pipeline and ML pipeline? ›

Newsletter Issue 14: Data pipelines get data to the warehouse or lake. Machine Learning pipelines transform data before training or inference. MLOps pipelines automate ML workflows. “Pipeline” is an overloaded term in data science and machine learning.

How do I build an ETL pipeline in Azure? ›

Select source type as Dataset and in dataset click +New to create the source dataset. Search for Azure Blob Storage. Write OutputJsonFile as the name, choose the Linked Service created in step3 as the Linked service, and select employeejson container path in the File path. Click OK.

How many types of pipelines are there in Azure? ›

There are two main options for operating Azure Pipelines—you can define pipelines using YAML code or the classic UI.

How do you write ML pipeline? ›

Machine learning pipeline example
  1. Load the data.
  2. Perform data preprocessing.
  3. Split the data.
  4. Apply transformations to the data using the 'fit()' method.
  5. Make predictions and evaluate model performance.

What are the two ways for Azure pipelines to be built? ›

Azure Pipelines requires your source code to be in a version control system. Azure DevOps supports two forms of version control - Git and Azure Repos. Any changes you push to your version control repository are automatically built and validated.

What are the main 3 stages in data pipeline? ›

Data pipelines consist of three essential elements: a source or sources, processing steps, and a destination.

What are the two types of pipeline? ›

Declarative versus Scripted Pipeline syntax

Declarative and Scripted Pipelines are constructed fundamentally differently.

What is pipeline vs data flow Azure? ›

Pipelines are for process orchestration. Data Flow is for data transformation. In ADF, Data Flows are built on Spark using data that is in Azure (blob, adls, SQL, synapse, cosmosdb). Connectors in pipelines are for copying data and job orchestration.

What language is used in Azure pipelines? ›

Azure Pipelines is a cloud service that is useful in automating build and testing the code project. It is implementable to any programming language or type of project. It supports languages such as C#, C++, Go, Java, Java Script, PHP, Python, Ruby, YAML and many more.

How do I create a data pipeline in Azure? ›

In this article
  1. Prerequisites.
  2. Provision Azure resources.
  3. Select an Azure region.
  4. Create Azure resources.
  5. Upload data to your storage container.
  6. Set up Key Vault.
  7. Import the data pipeline solution.
  8. Add an Azure Resource Manager service connection.
Mar 29, 2023

How do I run a pipeline in Azure? ›

Parameters
  1. id: Required. ID of the pipeline run.
  2. open: Optional. Opens the build results page in your web browser.
  3. project: Name or ID of the project. You can configure the default project using az devops configure -d project=NAME_OR_ID . Required if not configured as default or picked up using git config .
Mar 6, 2023

What should you create before you run the pipeline? ›

Before you run a pipeline you must specify the run details, run type, and run parameters. In the Run details section, specify the following: Pipeline: Select the pipeline that you want to run. Pipeline Version: Select the version of the pipeline that you want to run.

What needs to be created first to start with Azure machine learning? ›

To use Azure Machine Learning, you'll first need a workspace. If you don't have one, complete Create resources you need to get started to create a workspace and learn more about using it. Sign in to studio and select your workspace if it's not already open.

What kind of object must you create first to configure how frequently the pipeline runs? ›

Answer is ScheduleRecurrance. You need a ScheduleRecurrance object to create a schedule that runs at a regular interval. You have trained a model using the Python SDK for Azure Machine Learning.

What are the 5 instruction in pipeline? ›

Those stages are, Fetch, Decode, Execute, Memory, and Write.

What are the five phases in process pipeline? ›

The five components of a data pipeline—storage, preprocessing, analysis, applications, and delivery—are important to work with big data.

How do I use Azure Machine Learning designer? ›

Visual machine learning to accelerate productivity
  1. Connect to any data source and prepare and preprocess data using a variety of built-in modules.
  2. Build and train models visually using the latest machine learning and deep learning algorithms.
  3. Use drag and drop modules to validate and evaluate models.

How do I upload a dataset to Azure ML? ›

There are two ways you can import data into the designer:
  1. Azure Machine Learning datasets - Register datasets in Azure Machine Learning to enable advanced features that help you manage your data.
  2. Import Data component - Use the Import Data component to directly access data from online data sources.
Mar 1, 2023

What should you create first in Azure? ›

The first thing you create in Azure is a subscription. You can think of an Azure subscription as an 'Azure account'. You get billed per subscription.

What 2 types of pipelines can you create in Azure DevOps? ›

There are two main options for operating Azure Pipelines—you can define pipelines using YAML code or the classic UI.

What is the 10 times rule in machine learning? ›

The most common way to define whether a data set is sufficient is to apply a 10 times rule. This rule means that the amount of input data (i.e., the number of examples) should be ten times more than the number of degrees of freedom a model has. Usually, degrees of freedom mean parameters in your data set.

Videos

1. Production ML Pipelines with Python SDK v2 of Azure Machine Learning
(MG)
2. How to Label Training Data in Azure Machine Learning
(MG)
3. Azure Machine Learning Tutorial - 1. Build the ML Pipeline
(Glen Quinn)
4. Azure Machine Learning | Building & Deploying your First Machine Learning Model | Step By Step Guide
(Binod Suman Academy)
5. Managing DataStores and Datasets in Azure Machine Learning
(Luis Valencia)
6. Azure Machine Learning Classification Model Demo | Covid-19 Healthcare Scenario | K21Academy
(K21Academy)
Top Articles
Latest Posts
Article information

Author: Tish Haag

Last Updated: 07/08/2023

Views: 5893

Rating: 4.7 / 5 (67 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Tish Haag

Birthday: 1999-11-18

Address: 30256 Tara Expressway, Kutchburgh, VT 92892-0078

Phone: +4215847628708

Job: Internal Consulting Engineer

Hobby: Roller skating, Roller skating, Kayaking, Flying, Graffiti, Ghost hunting, scrapbook

Introduction: My name is Tish Haag, I am a excited, delightful, curious, beautiful, agreeable, enchanting, fancy person who loves writing and wants to share my knowledge and understanding with you.