Visit here for our full Microsoft DP-100 exam dumps and practice test questions.
Question 1:
A data scientist is training a machine learning model using Azure Machine Learning with a large dataset stored in Azure Data Lake Storage Gen2. The training jobs often fail due to timeout errors when multiple nodes read data simultaneously. Which approach best reduces timeout failures while ensuring efficient high-throughput data access across the compute nodes?
A) Enable caching for Azure Machine Learning Dataset to store cached copies in remote storage
B) Convert all CSV files into Parquet format and mount the dataset directly in the training script
C) Copy the dataset to each compute node’s ephemeral SSD storage before training begins
D) Use a FileDataset but disable multiprocessing for the data-loading steps
Answer:
C
Explanation:
Data-loading efficiency is a core concern in the DP-100 exam, because training workloads in Azure Machine Learning often operate in distributed compute environments where network I/O can quickly become a bottleneck. Timeout errors typically occur when the underlying storage service cannot supply sufficient throughput to satisfy concurrent requests across many training processes. Azure Data Lake Storage Gen2 is highly scalable and performant, but when training jobs run on clusters with many nodes and each node launches a large number of parallel workers reading simultaneously, remote storage latency can accumulate. The DP-100 exam expects candidates to understand how to optimize I/O patterns to reduce contention, avoid throttling, and ensure stable throughput.
Copying the dataset to local ephemeral storage on each compute node is one of the most effective techniques for optimizing data access during training. Ephemeral disks on Azure ML compute clusters are typically NVMe-based and provide extremely high read throughput, often orders of magnitude faster than pulling files repeatedly from remote storage. By copying the dataset at the start of the job, the training script ensures that all workers—across CPU or GPU nodes—consume data from a local path with much lower latency. This eliminates the bandwidth contention problem that arises when every node repeatedly downloads or streams remote data during training. For deep learning workloads where each worker performs frequent small reads or shuffled access patterns, local SSD storage dramatically reduces bottlenecks, preventing timeout errors and enabling consistent performance across epochs.
Option A is incorrect because Azure ML Dataset caching stores cached copies in remote blob storage locations or managed data stores, not directly on local cluster disks. Remote caching reduces some overhead but does not address the high-parallelism issue that causes timeout failures when many workers simultaneously query remote storage. It also does not eliminate network dependencies.
Option B is partially correct because converting data to Parquet format often improves read efficiency due to columnar compression and optimized metadata. However, even Parquet files accessed over the network can still become a bottleneck at scale. Parquet improves efficiency but does not solve the root issue: remote I/O saturation caused by many simultaneous training processes.
Option D is incorrect because disabling parallelism reduces CPU and GPU utilization, causing training to run significantly slower. The DP-100 exam focuses on scaling efficiently rather than reducing performance to avoid errors. Removing multiprocessing simply avoids triggering the underlying bottleneck but does not address the core challenge.
Therefore, the best solution is option C because it moves the data physically closer to the computation. This minimizes network dependency, decreases latency, and ensures predictable access speeds. DP-100 exam questions emphasize understanding Azure ML compute architecture, storage I/O patterns, and distributed training challenges. Solutions that maximize throughput and minimize remote dependencies are ideal. Copying the dataset into node-local ephemeral SSDs aligns perfectly with these principles and directly prevents timeout issues during large-scale parallel training.
Question 2:
You are designing an ML pipeline in Azure Machine Learning that performs data preprocessing, feature extraction, model training, and model evaluation. You want each stage to run only when its inputs change to avoid unnecessary compute usage. What should you use to ensure pipeline components are reused efficiently?
A) Use pipeline caching with OutputFileDatasetConfig to store intermediate results
B) Disable reuse for all pipeline steps to guarantee consistent execution
C) Manually track file hashes and rerun only modified steps
D) Store all intermediate data in a local path within the training environment
Answer:
A
Explanation:
Azure Machine Learning pipelines support step reuse, a crucial concept emphasized heavily on the DP-100 exam. Efficient pipeline execution matters because many ML workflows involve costly compute steps such as large-scale preprocessing or model training that do not need to be repeated unless the underlying inputs change. Step reuse minimizes compute cost and execution time, especially when working with GPU clusters or long-running transformations. The DP-100 exam expects familiarity with how Azure ML tracks dependencies and how step outputs can be cached in datastores to enable incremental processing.
Option A is correct because OutputFileDatasetConfig is designed to persist intermediate results of pipeline steps. When used with pipeline caching, Azure ML automatically determines whether the step needs to be re-executed. It does this by checking the inputs, source code, and parameters. If nothing has changed, Azure ML intelligently reuses the pre-computed results stored in the datastore. This approach follows a functional pipeline model and facilitates reproducibility, efficiency, and versioning. It also cleanly supports distributed and large-scale processing by storing results in managed datastores.
Option B is incorrect because disabling step reuse forces every pipeline stage to run each time, even when nothing has changed. This dramatically increases compute cost, slows down workflow execution, and defeats the purpose of building modular pipelines. The exam emphasizes cost efficiency and best practices; thus, unnecessary execution is discouraged.
Option C is incorrect because manually tracking file hashes or timestamps is error-prone and outside the intended use of Azure ML pipelines. Azure ML provides built-in dependency tracking, eliminating the need for manual checks. The exam stresses leveraging platform features rather than engineering custom solutions.
Option D is incorrect because storing intermediate data on local paths prevents step reuse across runs and across compute nodes. Local storage is ephemeral and tied to the compute instance. Once the job ends or the node is deallocated, the data disappears. Azure ML cannot detect or reuse intermediate outputs stored locally, making this approach unsuitable for pipeline caching.
Therefore, the best solution is option A because it allows Azure ML to automatically detect when inputs remain unchanged and reuse cached outputs. This results in faster, more cost-effective pipeline runs and aligns fully with DP-100 best practices.
Question 3:
You are running a distributed deep learning training job on Azure Machine Learning using a GPU cluster across multiple nodes. GPU utilization remains low, and profiling indicates the bottleneck is due to CPU-heavy data preprocessing steps performed during training. What is the best way to improve GPU utilization and accelerate training performance?
A) Increase the cluster node size to more powerful GPUs
B) Perform data preprocessing in a separate pipeline step and supply preprocessed data directly to the training step
C) Disable multiprocessing in the data-loading code
D) Reduce the batch size to decrease CPU overhead
Answer:
B
Explanation:
Improving GPU utilization is one of the most common tuning tasks expected on the DP-100 exam, especially when working with distributed training on Azure Machine Learning. In many deep learning workloads—particularly computer vision and large NLP tasks—each GPU expects a consistent pipeline of data to be fed into it. If the CPU-based preprocessing operations cannot supply batches fast enough, the GPUs remain idle despite their computational power. This exact issue appears frequently when transformations such as image decoding, resizing, cropping, normalization, tokenization, or file reading are applied repeatedly during training rather than being precomputed. Azure ML compute clusters, while powerful, still separate CPU and GPU resources, meaning CPU bound preprocessing can easily throttle GPU utilization when not optimized.
Option B is correct because performing data preprocessing as a separate pipeline step offloads the expensive CPU-dependent work prior to training. The recommended approach involves creating an Azure ML pipeline with steps such as data cleaning, transformation, augmentation, or feature extraction executed on CPU-optimized compute. These steps produce preprocessed datasets that are saved to Azure Blob storage, Data Lake Storage Gen2, or another Azure ML datastore. The training step then consumes this preprocessed data without needing to perform CPU-intensive operations, enabling GPU nodes to focus entirely on forward and backward passes. This dramatically improves GPU utilization because input bottlenecks are eliminated, and GPU cores can operate at or near their full compute capacity. The DP-100 exam strongly emphasizes this separation of preprocessing from training pipelines as part of efficient architecture design.
Option A is incorrect because increasing GPU power (for example, switching from V100 to A100) does not address the bottleneck. If the GPUs are underfed due to preprocessing delays, no matter how powerful they are, utilization will remain low. The exam expects candidates to recognize that resource mismatch between CPU preprocessing throughput and GPU consumption rates must be solved through proper pipeline design— not by simply upgrading GPU hardware.
Option C is incorrect because disabling multiprocessing actually slows down preprocessing further. DataLoader multiprocessing (for example, workers in PyTorch or TensorFlow) is designed to speed up data preparation. Removing it exacerbates the bottleneck, further decreasing GPU utilization. The DP-100 exam tests understanding of multiprocessing and parallel I/O strategies rather than disabling them.
Option D is incorrect because reducing batch size may lower the classical CPU overhead slightly, but it also reduces GPU workload, causing GPUs to perform fewer operations per iteration. This typically worsens GPU underutilization rather than improving it. Smaller batches also negatively impact gradient stability in many models.
Therefore, option B is the best solution because it shifts preprocessing into an earlier pipeline step, leaving the GPU-based training step optimized and free from CPU bottlenecks. This is fully aligned with Azure ML best practices, which promote separating heavy transformations into modular pipeline components to ensure cost efficiency, scalability, and resource specialization.
Question 4:
A data scientist is packaging a training script for Azure Machine Learning using a custom Docker image. The training job fails due to missing Python dependencies. What is the best way to ensure all required libraries are installed consistently across runs?
A) Install missing libraries manually inside the compute cluster during job execution
B) Use a curated Azure ML base image and specify dependencies in a conda environment file
C) Attach a requirements.txt file without modifying the environment
D) Use the default Ubuntu server image from Azure without creating a custom environment
Answer:
B
Explanation:
Dependency management is a critical component of Azure Machine Learning workflows, and the DP-100 exam tests it extensively. When packaging training scripts for Azure ML, the execution environment must be fully reproducible. This means that all Python packages, system libraries, and CUDA-level drivers must be available and properly configured within the Docker image or conda environment used to run the training job. Failures such as missing dependencies occur when necessary packages are not included during environment creation.
Option B is correct because using a curated Azure ML base image combined with a conda environment file provides a stable, reproducible structure for dependency management. Azure ML curated base images already include standard data science libraries, GPU drivers (for GPU images), and optimized system-level configurations. Adding a conda YAML file allows the data scientist to explicitly declare all required Python packages, precise versions, pip dependencies, and channels. Azure ML ensures deterministic environment creation using this YAML file, so every run uses the exact same package set, eliminating environment drift and missing dependencies. This approach is considered best practice and is highlighted throughout DP-100 learning materials.
Option A is incorrect because manually installing libraries inside a compute cluster during job execution breaks reproducibility and makes it nearly impossible to track library versions. It also increases job startup time and risks introducing installation failures or version conflicts during execution. The exam emphasizes automation and reproducibility, not ad-hoc manual interventions.
Option C is insufficient because attaching a requirements.txt file without including it in a defined environment configuration prevents Azure ML from guaranteeing that the environment is built correctly. Requirements files alone do not manage system dependencies, conda channels, or conflicting package versions. They also often lead to unstable builds on different machines or over time.
Option D is incorrect because using a plain Ubuntu server image does not provide any of the machine learning libraries needed for Azure ML workloads. You would need to install everything manually, which contradicts best practices. The DP-100 exam expects candidates to use Azure ML curated images or properly defined custom images—not generic system images.
Thus, option B is the correct solution because it ensures stable, consistent, repeatable dependency management using conda and Azure ML base images, which is crucial for successful pipeline execution and model training.
Question 5:
You are tuning hyperparameters for an Azure Machine Learning experiment using HyperDrive. The search is slow because some runs complete in minutes while others take much longer. What configuration helps maximize parallelism and reduce total runtime?
A) Set the max_total_runs to a low value to reduce job load
B) Use the Bandit early termination policy to stop poorly performing runs
C) Decrease the compute cluster size to force sequential execution
D) Disable early termination policies to ensure fairness across runs
Answer:
B
Explanation:
HyperDrive is Azure ML’s built-in hyperparameter tuning service, and the DP-100 exam requires deep understanding of how parallelism, early termination, and optimization algorithms interact. When tuning models, some hyperparameter combinations naturally perform poorly or converge slowly. Allowing all these runs to complete wastes compute and increases the overall tuning time. Early termination is the mechanism Azure ML provides to mitigate this.
Option B is correct because the Bandit policy monitors the primary metric and terminates underperforming runs relative to the best run so far. This prevents the compute cluster from wasting resources on poor configurations, freeing capacity for new trials and speeding up total search completion. Bandit is particularly effective for search strategies such as random sampling or Bayesian optimization where runs have variable execution times. The DP-100 exam emphasizes using early termination to improve efficiency and reduce Azure costs.
Option A is incorrect because lowering max_total_runs does not increase parallelism or speed. It simply reduces search breadth and the overall ability to discover optimal hyperparameter sets. This conflicts with the exam’s message on optimizing both performance and model quality.
Option C is incorrect because decreasing the compute cluster size reduces concurrency, forcing more sequential processing. This increases total runtime dramatically. The exam highlights scaling parallel experiments using autoscaling compute clusters for faster results.
Option D is incorrect because disabling early termination wastes compute resources. Azure ML documentation and exam materials consistently highlight early termination as a best practice.
Therefore, option B is the best solution because Bandit early termination efficiently eliminates poor runs, increases cluster throughput, and accelerates overall hyperparameter search.
Question 6:
A team wants to deploy a real-time inference endpoint using Azure Machine Learning. The model depends on several external Python packages that take a long time to install at runtime. What should the team do to ensure fast, reliable deployment of the endpoint?
A) Install required packages dynamically within the scoring script
B) Build a custom Docker image with all dependencies preinstalled and register it in Azure ML
C) Use only the default Azure ML curated environment
D) Store dependencies in a mounted datastore for dynamic loading during inference
Answer:
B
Explanation:
Real-time inference endpoints require fast startup, stable environments, and minimal runtime dependency installation. The DP-100 exam stresses the importance of packaging scoring environments properly. When Python packages are installed dynamically at runtime, deployment becomes slow, unpredictable, and error-prone. In many enterprise environments, internet access may not be available for installation, further complicating dynamic configuration.
Option B is correct because building a custom Docker image ensures all required libraries, models, and system packages are present before deployment. Azure ML allows registering custom Docker images as environments, making deployment extremely consistent. The scoring container starts instantly because no installation occurs during startup. This practice is ideal for production services requiring low latency.
Option A is incorrect because installing packages dynamically in the scoring script causes significant delays, breaks reproducibility, and violates Azure ML best practices. This method is acceptable only in rare experimental cases.
Option C is incorrect because curated environments may not include specialized packages, GPU dependencies, or libraries required for custom workflows.
Option D is incorrect because storing dependencies in a datastore does not solve installation problems; Python libraries cannot simply be “loaded” from storage.
Thus, option B aligns with DP-100 guidelines for deploying reliable, production-ready inference services.
Question 7:
You are monitoring a deployed Azure Machine Learning online endpoint and notice that response latency is increasing over time. Logs show that the model is repeatedly loading large files for each request. What is the best solution to improve performance?
A) Increase the number of scoring replicas
B) Load the model and large assets once at startup inside the init() function
C) Reduce the number of requests sent to the endpoint
D) Use CPU-optimized compute instead of GPU compute
Answer:
B
Explanation:
Azure ML real-time endpoints rely on a scoring script with an init() and run() function. The init() function executes once when the container starts, and run() executes for every incoming request. The DP-100 exam focuses on the importance of loading heavy assets—such as models, embeddings, tokenizers, or lookup tables—inside init(). If large files are repeatedly loaded inside run(), latency increases and I/O bottlenecks worsen. This is exactly what the logs indicate in the scenario.
Option B is correct because loading the model once at startup ensures that inference requests are processed quickly. This significantly reduces overhead and stabilizes performance. Scenarios like this appear frequently on the DP-100 exam to test understanding of inference architecture.
Option A is incorrect because scaling replicas only helps with throughput, not per-request latency. If each replica loads large files per request, latency remains high.
Option C is incorrect because reducing requests does not solve the root efficiency issue. Latency occurs even with a single request.
Option D is incorrect because the issue is not GPU vs CPU compute, but inefficient model loading.
Thus, option B is the correct and most efficient solution.
Question 8:
A data science team wants to track metrics, outputs, and intermediate artifacts for each training run in Azure Machine Learning. They need to compare runs, visualize metrics, and register the best model automatically. What Azure ML feature provides all these capabilities?
A) Azure Blob Storage with manual folder organization
B) Azure ML Experiments and Runs
C) Azure Monitor metrics workspace
D) Storing metrics directly in Application Insights
Answer:
B
Explanation:
Experiments and Runs are fundamental components of Azure Machine Learning and a major exam topic in DP-100. An experiment acts as a logical grouping of runs, while each run captures metrics, logs, artifacts, snapshots, scripts, and other metadata. The Azure ML Studio interface allows users to compare runs visually, track performance over time, register the best models, and analyze logs. These features help maintain reproducibility and auditability.
Option B is correct because Experiments and Runs provide a built-in system to log custom metrics through run.log(), upload artifacts, store models, compare parallel trials, and integrate HyperDrive searches. Using these features is strongly emphasized on the DP-100 exam.
Option A is incorrect because manual organization in blob storage lacks automated logging, metric tracking, and comparison tools.
Option C is incorrect because Azure Monitor is designed for infrastructure and operational logs, not ML-specific metrics or artifacts.
Option D is incorrect because Application Insights is intended for application telemetry, not ML experiment tracking.
Thus, option B aligns with Azure ML best practices for experiment management, tracking, and evaluation.
Question 9:
A data scientist wants to run multiple training jobs in parallel on Azure Machine Learning using different subsets of the dataset. The requirement is to distribute training tasks across nodes without modifying the training script. Which Azure ML feature is most appropriate?
A) HyperDrive with a random sampling search space
B) ParallelRunStep to distribute processing across multiple nodes
C) PipelineData for storing intermediate results
D) AutoML to automate dataset partitioning
Answer:
B
Explanation:
ParallelRunStep is specifically designed for scalable parallel processing across Azure ML compute clusters. It applies the same script to different mini-batches or partitions of data across multiple nodes. The advantage is that the training or scoring script does not need to be modified; ParallelRunStep handles distribution automatically. This is a recurring concept on the DP-100 exam.
Option B is correct because ParallelRunStep partitions the dataset and executes multiple parallel tasks. It is ideal for large batch inference, distributed preprocessing, or embarrassingly parallel workloads.
Option A is incorrect because HyperDrive executes separate hyperparameter tuning jobs, not data-parallel operations.
Option C stores intermediate outputs but does not implement parallel task execution.
Option D is incorrect because AutoML does not automate dataset partitioning for parallel tasks.
Thus, option B best addresses the requirement of distributing tasks without modifying the training script.
Question 10:
You are deploying a batch inference pipeline that processes millions of records daily using Azure Machine Learning. The team wants to ensure scalability, automatic retries, and minimal overhead when processing large amounts of data. Which compute target is the most suitable for large-scale batch inference?
A) Azure ML Compute Instance
B) Local compute environment
C) Azure ML Compute Cluster
D) Single-node GPU VM
Answer:
C
Explanation:
Large-scale batch inference requires distributed, autoscaling compute resources. Azure ML Compute Clusters are specifically designed for scalable job execution, including batch inference, parallel scoring, preprocessing, and distributed training. The DP-100 exam emphasizes choosing the right compute target for workload type.
Option C is correct because compute clusters scale out across multiple nodes, provide automatic retries, can handle large workloads efficiently, and support ParallelRunStep and pipeline execution.
Option A is incorrect because Compute Instances are single-node machines intended for development, not large-scale production workloads.
Option B is incorrect because local compute cannot scale, is unreliable for production workloads, and lacks Azure ML orchestration features.
Option D is insufficient because a single GPU VM cannot handle millions of daily records; it lacks distributed processing capability.
Thus, option C is the most appropriate compute target for large-scale batch inference pipelines.
Question 11:
A data science team needs to train a model on Azure Machine Learning using a compute cluster that automatically scales based on workload. They notice that new nodes take several minutes to begin training because environment setup is repeated for each run. What is the best way to reduce startup time and ensure fast provisioning of nodes for large distributed training jobs?
A) Use a warm pool of pre-provisioned nodes by configuring the cluster’s min_nodes parameter
B) Allow Azure ML to build environments automatically during each job submission
C) Delete and recreate the compute cluster before every experiment
D) Disable autoscaling and run all workloads on a single-node VM
Answer:
A
Explanation:
Training jobs on Azure Machine Learning often rely on autoscaling compute clusters to optimize cost and efficiency. When new nodes are provisioned, Azure ML pulls the base Docker image, sets up the conda environment, installs dependencies, and downloads required assets before the node becomes active. This startup sequence can take several minutes, particularly for custom images or large environments. The DP-100 exam emphasizes understanding how to minimize provisioning delays to maximize training throughput, especially in distributed or large-scale experiments.
Option A is correct because configuring the cluster with a minimum number of nodes ensures that some nodes remain available at all times. These nodes stay in a “warm” state with the environment fully prepared, significantly reducing startup time. When a new job begins, Azure ML does not need to prepare nodes from scratch if adequate idle nodes already exist. This strategy is particularly effective for HyperDrive hyperparameter sweeps, distributed deep learning, or pipeline workloads where multiple steps execute sequentially or in parallel. Additionally, min_nodes allows compute to scale down after inactivity while still retaining a baseline of ready nodes, balancing availability and cost.
Option B is incorrect because relying on environment builds during each job submission increases startup time. Azure ML environments should be pre-built and registered so jobs start quickly. The exam emphasizes registering environments to avoid repeated builds that slow down execution.
Option C is inefficient and contradictory to best practices. Recreating compute clusters wastes time and resources and does not address environment preparation delays. It can also introduce configuration inconsistencies and is never recommended by Azure ML architecture guidance.
Option D is incorrect because disabling autoscaling eliminates one of the biggest advantages of Azure ML compute clusters. Using a single VM not only increases cost for large workloads but severely limits parallelism and compute efficiency. Azure ML is designed for scalable workloads, and the DP-100 exam focuses on using clusters effectively rather than avoiding them.
Thus, option A is the best solution because it directly reduces node startup delays while preserving autoscaling and distributed compute advantages. Warm nodes enable faster experiment turnaround, consistent performance, and improved cost management, aligning with recommended Azure ML operational practices covered in the DP-100 exam.
Question 12:
You are creating an Azure Machine Learning training pipeline. One of the steps downloads large external data from a remote system via API. The team wants to avoid re-downloading this data during every pipeline run if previous data is still valid. Which approach provides the most efficient and automated reuse of previously downloaded data?
A) Save the data to PipelineData and rely on step reuse with output caching
B) Download the data manually and store it in a local compute directory
C) Add a sleep command at the beginning of the step to delay execution
D) Create a separate compute instance to store persistent data
Answer:
A
Explanation:
Azure Machine Learning pipelines support step reuse, which allows Azure ML to skip re-running steps when neither the inputs nor the logic change. This capability is a major focus on the DP-100 exam because efficient ML operations depend on preventing redundant computation. When downloading external data, re-running the download step each time wastes time, bandwidth, and compute resources. To avoid this, Azure ML provides the OutputFileDatasetConfig and PipelineData objects to persist outputs and enable caching.
Option A is correct because storing external data in PipelineData and enabling step reuse lets Azure ML automatically detect whether the download step requires re-execution. If the inputs remain unchanged, Azure ML retrieves the previously cached output from the datastore, saving extensive download time. Step reuse leverages pipeline versioning, environment consistency checking, and an internal cache mechanism. This approach also integrates cleanly with parallel, distributed, or multi-stage pipelines and supports full reproducibility.
Option B is incorrect because storing data in local compute directories prevents pipeline-level reuse across runs. Compute node storage is ephemeral, meaning that data disappears after the node is deallocated. This violates reproducibility, reliability, and Azure ML’s design principles for pipelines.
Option C is incorrect because adding delays does nothing to avoid redundant work. It wastes compute resources and violates best practices for efficient ML operations. The DP-100 exam stresses avoiding unnecessary computation.
Option D is incorrect because compute instances are for development, not for data persistence. They are not intended to store production pipeline outputs and cannot integrate automatically with pipeline caching, making the approach unreliable and costly.
Thus, option A is the most efficient solution because it fully leverages Azure ML’s built-in caching and reuse system, maximizing pipeline efficiency and reducing unnecessary data transfers.
Question 13:
A team trains a model using Azure Machine Learning and registers it in the workspace. They want to deploy the model to a real-time endpoint with minimal downtime as they update versions frequently. Which Azure ML feature is best suited for managing and updating models safely in production?
A) Enable application insights logging
B) Use model versioning and blue-green deployment strategies
C) Delete old models automatically before each deployment
D) Register each new model in a different workspace
Answer:
B
Explanation:
Azure Machine Learning supports model versioning, deployment orchestration, and advanced rollout strategies to ensure safe updates. The DP-100 exam covers production deployment patterns, particularly strategies for updating endpoints without downtime or service disruptions. Blue-green, canary, and rolling deployments are essential techniques for deploying new versioned models while ensuring that clients continue to access a stable environment.
Option B is correct because Azure ML model versioning allows storing multiple versions of a model under a single registered model name. Combined with blue-green deployment, teams can deploy a new version alongside the current stable one, validate its performance, and switch traffic only when ready. This minimizes downtime, avoids service outage, and provides a rollback path. Azure ML’s endpoint management features also allow routing specific traffic percentages to new deployments for safe gradual rollout.
Option A is important for monitoring but does not manage deployments or ensure safe updates.
Option C is incorrect because deleting old models removes rollback options, creates unnecessary risk, and contradicts ML lifecycle best practices.
Option D is incorrect because storing each model in a different workspace breaks version tracking and model lineage. It makes comparison, A/B testing, and deployment automation far more difficult.
Thus, option B is the correct choice because versioning and deployment strategies ensure safe, efficient, and reliable updates in production ML systems.
Question 14:
You are building a training pipeline that uses Azure ML’s ScriptRunConfig to execute Python scripts. During execution, you need to track custom metrics such as accuracy, loss curves, and evaluation summaries. What is the recommended way to log these metrics so they are captured and accessible in Azure ML Studio?
A) Write metrics to local text files only
B) Use run.log() or run.log_list() APIs within the training script
C) Log metrics only after the pipeline finishes
D) Submit metrics through Azure Resource Manager templates
Answer:
B
Explanation:
Azure ML Run objects provide a built-in mechanism for logging metrics, artifacts, images, and output files. The DP-100 exam repeatedly tests understanding of experiment tracking and the correct use of run.log() and run.log_list() inside training scripts. This method ensures that metrics are captured in real time, visualized within Studio, and available for comparison across runs. It also allows downstream components, such as HyperDrive, to use the logged primary metric for optimization.
Option B is correct because run.log() logs scalar values, while run.log_list() logs sequences such as loss curves. These log entries appear in Azure ML Studio under the experiment’s run history, enabling detailed comparisons and visualization.
Option A is incorrect because local text files are not captured automatically by Azure ML unless explicitly uploaded, and they do not support real-time logging.
Option C is incorrect because logging should occur during training, not only at the end. Early termination policies, progress monitoring, and metric dashboards rely on incremental logs.
Option D is incorrect because ARM templates are for resource provisioning, not experiment tracking.
Thus, option B is the correct solution and aligns with Azure ML experiment tracking best practices.
Question 15:
A data scientist needs to run a PyTorch-based distributed training job across four nodes using Azure Machine Learning. They want Azure ML to automatically configure communication backends such as NCCL for GPU training. Which Azure ML feature should they use to manage distributed execution?
A) HyperDrive
B) DistributedRunConfig
C) PipelineData
D) ParallelRunStep
Answer:
B
Explanation:
Distributed training is a core concept in DP-100. Azure ML supports distributed frameworks such as PyTorch, TensorFlow, Horovod, and MPI through DistributedRunConfig. This configuration sets up the required processes, communication backends (NCCL for GPUs, Gloo for CPUs), environment variables, and worker orchestration.
Option B is correct because DistributedRunConfig handles distributed process groups, environment setup, master-worker initialization, and node coordination. When used with ScriptRunConfig, Azure ML launches the appropriate number of worker processes across nodes. It ensures correct backend selection for PyTorch, allowing efficient GPU communication.
Option A is incorrect because HyperDrive is for hyperparameter tuning, not distributed training.
Option C is incorrect because PipelineData stores intermediate outputs but provides no distributed execution functionality.
Option D is incorrect because ParallelRunStep distributes batch inference, not distributed training.
Thus, option B is the correct choice for configuring multi-node PyTorch distributed training in Azure ML.
Question 16:
A machine learning model deployed to an Azure ML online endpoint is experiencing inconsistent performance because different requests require different preprocessing logic. The team wants to simplify deployment and ensure consistent transformations. What should they do?
A) Preprocess the data inside the run() function differently for each request
B) Create a preprocessing pipeline step and persist transformed inputs for inference
C) Include preprocessing inside the model so the endpoint always receives raw data
D) Disable preprocessing completely and rely on the client application
Answer:
C
Explanation:
In production ML systems, consistent transformations are essential. Many model failures occur due to data mismatch between training and inference. The DP-100 exam highlights the importance of integrating preprocessing steps into the model or ensuring the scoring environment captures all necessary transformations. Deployments must be deterministic, stable, and free from client-dependent inconsistencies.
Option C is correct because embedding preprocessing logic into the model ensures that inference receives identical transformations regardless of client behavior. This is common in scikit-learn pipelines, ONNX preprocessing graphs, TensorFlow SavedModels with preprocessing layers, or PyTorch modules that include data normalization. Doing so minimizes risk, simplifies endpoint usage, and guarantees consistency.
Option A is incorrect because applying different preprocessing logic for each request introduces inconsistency and increases latency.
Option B is inadequate for online inference because preprocessing must occur at request time, not as a batch step.
Option D is incorrect because delegating preprocessing to client applications removes model governance and increases error risk.
Thus, option C best ensures inference consistency and aligns with DP-100 deployment guidance.
Question 17:
You want to automate the retraining of a model when new data arrives in an Azure Blob Storage container. The retraining process must trigger automatically without manual intervention. Which Azure service provides the most appropriate automation capability?
A) Azure Monitor log alerts
B) Azure Event Grid triggering an Azure ML pipeline
C) Azure Application Insights alerts
D) Manual pipeline execution using Azure ML Studio
Answer:
B
Explanation:
The DP-100 exam includes data pipeline automation concepts. Azure Event Grid is specifically designed to trigger workflows based on events such as new files arriving in Blob Storage. When new data is available, Event Grid can invoke Azure ML pipelines via Logic Apps, Azure Functions, or custom handlers. This creates a fully automated retraining loop.
Option B is correct because Event Grid can publish blob creation events that trigger an Azure ML pipeline endpoint. The pipeline can then preprocess data, retrain the model, evaluate performance, and register a new version. This aligns with MLOps best practices.
Option A is incorrect because Azure Monitor logs operational events, not storage triggers.
Option C is incorrect because Application Insights monitors web apps, not ML training triggers.
Option D is manual and unsuitable for automation.
Thus, option B is the correct and scalable automation solution.
Question 18:
A team is performing batch inference on 20 million rows of data using Azure Machine Learning. They want to maximize throughput by distributing inference across multiple nodes. Which Azure ML feature is designed for large-scale parallel batch inference?
A) AutoML
B) ParallelRunStep
C) Local compute
D) Azure ML Compute Instance
Answer:
B
Explanation:
ParallelRunStep is a key DP-100 feature designed specifically for high-throughput batch inference. It partitions a dataset, runs inference scripts across multiple nodes, and aggregates output. It supports large jobs such as document scoring, image processing, and structured data inference.
Option B is correct because ParallelRunStep automatically distributes workload and scales across a compute cluster. It is optimized for large datasets and can execute tasks concurrently using many compute nodes.
Option A focuses on model training, not inference.
Option C lacks scalability and orchestration.
Option D is single-node and unsuitable for very large workloads.
Thus, the correct answer is B, aligning with Azure ML batch inference best practices.
Question 19:
A data scientist wants to compare multiple trained models and select the best one automatically based on a primary metric. They also want to store each model with metadata and access previous versions later. Which Azure ML feature supports this workflow?
A) Azure ML Model Registry
B) Azure Storage Queues
C) Application Insights
D) Azure Policy
Answer:
A
Explanation:
Model management is central to ML lifecycle governance and a major DP-100 topic. Azure ML Model Registry stores versioned models, allowing teams to compare models by metrics, tags, or lineage. HyperDrive and pipelines can automatically register the best model based on a primary metric.
Option A is correct because the Model Registry enables versioning, metadata storage, lineage tracking, and automated best-model registration.
Option B is a queueing service with no ML model capabilities.
Option C logs telemetry but cannot store models.
Option D enforces Azure governance policies, unrelated to ML workflows.
Thus, option A is the correct solution.
Question 20:
You are evaluating a model’s fairness using Azure Machine Learning’s responsible AI tooling. You want to compare model performance across different demographic groups. Which tool should you use?
A) Error Analysis dashboard
B) Fairness Assessment dashboard
C) Interpretability widget
D) Data Labeling workspace
Answer:
B
Explanation:
Responsible AI is a major topic introduced in newer versions of DP-100. Azure ML includes several dashboards for fairness, explainability, and error analysis. The fairness dashboard specifically evaluates metrics across demographic groups, detects disparities, and computes fairness metrics such as demographic parity and equal opportunity.
Option B is correct because the Fairness Assessment dashboard is designed to measure fairness across subgroups and visualize disparities.
Option A focuses on understanding model errors, not fairness.
Option C provides explanations but not fairness metrics.
Option D is unrelated.
Thus, option B is the correct answer.