Visit here for our full Amazon AWS Certified Data Engineer – Associate DEA-C01 exam dumps and practice test questions.
Question 101
A data pipeline must ensure that data written to Amazon S3 is immediately consistent for subsequent reads. What S3 consistency model applies?
A) S3 provides read-after-write consistency for all operations
B) S3 has eventual consistency only
C) S3 has no consistency guarantees
D) Consistency requires special configuration
Answer: A
Explanation:
Amazon S3 provides strong read-after-write consistency for all PUT and DELETE operations as of December 2020. After a successful write, subsequent reads immediately return the latest version of the object. This strong consistency applies to all S3 storage classes and regions without special configuration.
Strong consistency eliminates the need for retry logic or delays between write and read operations in data pipelines. You can write data to S3 and immediately trigger downstream processing without concerns about reading stale data. This simplifies pipeline architectures and reduces latency.
The consistency model applies to new objects, overwrites, and deletions. If you PUT a new object, an immediate GET returns that object. If you overwrite an existing object, subsequent reads reflect the new version. If you DELETE an object, subsequent reads return 404 errors immediately.
Prior to December 2020, S3 had eventual consistency for overwrite PUTs and DELETEs, but this limitation has been removed. All S3 operations now provide strong consistency, which is a significant improvement for data pipeline reliability and eliminates race conditions that previously required workarounds.
Claiming S3 has no consistency guarantees is incorrect. AWS documents clear consistency semantics for S3 operations. Understanding these guarantees is essential for building reliable data pipelines that depend on S3 storage.
Strong consistency is provided automatically for all S3 operations without special configuration or additional costs. You do not need to enable features or use special storage classes to get consistent reads after writes.
Question 102
A data engineer needs to implement a solution where data quality metrics are calculated and tracked over time. What approach enables monitoring data quality trends?
A) Store quality metrics in CloudWatch Metrics or time-series database
B) Calculate metrics once and forget
C) No quality tracking needed
D) Manual spreadsheet updates
Answer: A
Explanation:
Storing data quality metrics as CloudWatch custom metrics enables tracking trends over time with built-in visualization and alerting capabilities. After each data load or quality check, publish metrics like null percentages, row counts, and validation failure rates to CloudWatch. Dashboards show trends and anomalies automatically.
Alternatively, time-series databases like Amazon Timestream can store detailed quality metrics with high precision timestamps. This enables sophisticated analysis of quality trends, correlation with pipeline changes, and prediction of quality issues. Time-series databases excel at handling high-volume metric data efficiently.
Tracking metrics over time reveals whether quality is improving, degrading, or stable. Trends help identify root causes when quality degrades suddenly. Historical baselines enable anomaly detection where current metrics are compared against historical patterns to identify unusual quality issues.
Calculating metrics once without retention eliminates the ability to track trends or identify degradation over time. Without historical data, you cannot distinguish between chronic quality issues and new problems. Trend analysis requires persistent metric storage.
Claiming no quality tracking is needed ignores the reality that data quality impacts business decisions and user trust. Systematic quality tracking is essential for maintaining data product reliability. Without tracking, quality issues go unnoticed until they cause visible problems.
Manual spreadsheet updates do not scale and create gaps in metric collection. Automated metric collection ensures consistent, complete data for trend analysis. Manual processes are prone to errors and cannot provide real-time quality monitoring.
Question 103
A company needs to grant cross-account access to AWS Glue Data Catalog. What must be configured?
A) Resource policies on Data Catalog and IAM roles in accessing account
B) Make all catalogs public
C) No configuration needed
D) Disable all permissions
Answer: A
Explanation:
AWS Glue Data Catalog supports resource policies that grant access to principals in other AWS accounts. The catalog owner attaches a resource policy specifying which external accounts or roles can access specific databases and tables. The accessing account creates IAM roles with permissions to access the shared catalog resources.
This cross-account configuration requires coordination between both accounts. The catalog owner grants access through resource policies, and users in the accessing account assume roles with permissions to use that access. All cross-account access is logged through CloudTrail for security auditing.
Resource policies can grant different permission levels to different accounts, enabling fine-grained access control. One account might have read-only access while another has full permissions. This flexibility supports various multi-account architectures including data mesh patterns.
Making catalogs public exposes metadata to the internet, violating security requirements. Public access is inappropriate for business data catalogs containing sensitive schema and naming information. Proper cross-account access maintains security while enabling controlled sharing.
Claiming no configuration is needed is incorrect. Cross-account access requires explicit configuration through resource policies and IAM roles. Without proper configuration, accounts cannot access each other’s catalog resources due to AWS’s default security boundaries.
Disabling all permissions prevents any access including legitimate cross-account use cases. Security requires granting minimum necessary permissions to authorized principals, not eliminating all access. Proper configuration enables secure sharing while maintaining access controls.
Question 104
A data pipeline processes files with unpredictable sizes ranging from kilobytes to gigabytes. What processing strategy handles this variability effectively?
A) Dynamic resource allocation based on file size
B) Use maximum resources for all files
C) Use minimum resources for all files
D) Process only files of specific size
Answer: A
Explanation:
Dynamic resource allocation adjusts compute resources based on file characteristics detected during processing. For small files, use minimal resources to avoid waste. For large files, allocate additional memory and CPU to ensure timely completion. This adaptive approach optimizes costs while maintaining performance.
AWS Lambda can process small files efficiently within its resource constraints. For large files, Lambda can trigger AWS Glue jobs or EMR steps with appropriate resource allocations. You can also use AWS Glue with auto-scaling DPUs that adjust resources during job execution based on workload.
Implementing file size checking before processing enables routing decisions. Small files under a threshold go to Lambda for quick, serverless processing. Large files route to Glue or EMR with resources scaled to file size. This hybrid approach provides optimal cost and performance for mixed workloads.
Using maximum resources for all files wastes money when processing small files that do not require substantial compute power. Small files processed with large resource allocations drive up costs unnecessarily. Right-sizing resources to workload reduces costs significantly.
Using minimum resources for all files causes failures or extreme slowness when processing large files. Insufficient resources for large files leads to out-of-memory errors or processing that takes hours instead of minutes. Resources must match workload requirements.
Processing only files of specific sizes rejects valid data that should be processed. Business requirements typically mandate processing all incoming data regardless of size. Filtering by file size artificially limits pipeline functionality and loses valuable data.
Question 105
A data engineer needs to implement error handling in AWS Glue jobs that distinguishes between transient and permanent errors. What pattern should be used?
A) Retry logic with exponential backoff for transient errors, alerting for permanent errors
B) Never retry any errors
C) Retry all errors indefinitely
D) Ignore all errors completely
Answer: A
Explanation:
Distinguishing error types enables appropriate responses that maximize pipeline reliability. Transient errors like network timeouts or temporary service unavailability should trigger retries with exponential backoff. Permanent errors like data format violations or missing permissions should trigger alerts without infinite retries.
Implement error classification logic in Glue job exception handling. Network-related exceptions indicate transient issues warranting retry. Schema mismatches or missing files indicate permanent issues requiring human intervention. Retry transient errors with increasing delays between attempts to allow systems to recover.
After a maximum retry count for transient errors, escalate to permanent error handling with alerts. This prevents infinite retry loops while giving temporary issues time to resolve. Comprehensive error logging captures context for troubleshooting both error types.
Never retrying any errors treats all failures as permanent, reducing reliability. Many errors are temporary and would succeed if retried. Network glitches, service restarts, and capacity constraints cause transient failures that self-resolve. Retry logic improves success rates significantly.
Retrying all errors indefinitely wastes resources and prevents identification of permanent issues. Data format errors will never succeed regardless of retry count. Infinite retries on permanent errors consume resources and delay problem detection.
Ignoring errors allows data loss and corruption to go unnoticed. Silent failures are extremely dangerous because they produce incomplete results without alerting anyone. All errors must be logged and handled appropriately to maintain data integrity.
Question 106
A company uses Amazon Redshift and wants to improve query performance for frequently accessed summary data. What optimization should be implemented?
A) Create materialized views for common aggregations
B) Delete all indexes
C) Disable query result caching
D) Remove all sort keys
Answer: A
Explanation:
Materialized views in Amazon Redshift store precomputed query results that can be refreshed periodically. For frequently accessed aggregations or complex joins, materialized views provide near-instant query response by avoiding repeated computation. Views can be refreshed incrementally to incorporate new data while minimizing refresh time.
Redshift’s query optimizer automatically uses materialized views to answer queries when applicable, even if queries do not directly reference the view. This automatic query rewriting accelerates queries transparently without application changes. Multiple materialized views can cover different query patterns.
For summary tables accessed by many users throughout the day, materialized views eliminate redundant aggregation computation. The view is refreshed once during ETL and used for many queries, distributing computation cost across numerous query executions. This dramatically improves user experience and reduces cluster load.
Deleting indexes would degrade performance rather than improving it. While Redshift does not use traditional indexes like transactional databases, it does use sort keys and zone maps for query optimization. Removing these would slow queries significantly.
Disabling query result caching forces repeated computation of identical queries. Result caching is an optimization that improves performance and reduces costs. Disabling it provides no benefits and degrades user experience.
Removing sort keys eliminates one of Redshift’s primary optimization mechanisms. Sort keys enable zone maps that allow Redshift to skip irrelevant data blocks. Without sort keys, queries must scan more data, increasing execution time and costs.
Question 107
A data pipeline must coordinate execution of Lambda functions, Glue jobs, and EMR steps in a complex workflow. What orchestration service should be used?
A) AWS Step Functions
B) Manual execution in sequence
C) No orchestration
D) Email-based coordination
Answer: A
Explanation:
AWS Step Functions provides state machine-based orchestration that coordinates multiple AWS services in complex workflows. Step Functions natively integrates with Lambda, Glue, EMR, and many other services through service integrations that require minimal code. Workflows are defined using JSON state machine definitions.
Step Functions handles error handling, retries, and state management automatically. You define which errors should trigger retries, how many attempts, and what to do after exhausting retries. The service maintains execution history showing exactly what happened during workflow execution.
Visual workflow designer shows the flow graphically and execution history for each run. This visibility simplifies debugging and understanding complex pipelines. Step Functions scales automatically and charges only for state transitions, making it cost-effective for workflows of any size.
Manual execution requires human intervention for each pipeline run and cannot handle errors gracefully. Manual orchestration does not scale and introduces human error risks. Automated orchestration is essential for reliable production pipelines.
Operating without orchestration means no systematic way to coordinate dependent tasks or handle failures. Complex pipelines require orchestration to ensure tasks execute in correct order with appropriate dependencies. Without orchestration, pipelines are fragile and unreliable.
Email-based coordination requires humans to manually trigger each step based on email notifications. This introduces delays, does not scale, and creates opportunities for errors. Automated orchestration eliminates human bottlenecks and ensures reliable execution.
Question 108
A data engineer needs to provide business users with a self-service portal to request access to datasets. What AWS service helps implement data catalog and access management?
A) AWS Lake Formation with data catalog and permissions management
B) Manual email approval process
C) Unrestricted access for everyone
D) No access request mechanism
Answer: A
Explanation:
AWS Lake Formation provides a data catalog with metadata about available datasets and centralized permissions management. Users can discover datasets through the catalog and request access through integrated workflows. Administrators approve requests and Lake Formation automatically grants appropriate permissions across integrated services.
The catalog includes descriptions, schemas, and data classifications that help users understand what data is available. Search and discovery features enable users to find relevant datasets without requiring knowledge of underlying storage locations. This self-service approach democratizes data access while maintaining governance.
Lake Formation’s permission management integrates with IAM, enabling fine-grained access control at database, table, and column levels. Approved access grants are enforced consistently across Athena, Redshift Spectrum, EMR, and other services. This eliminates the need to configure permissions separately in each tool.
Manual email approval processes do not scale and introduce significant delays. As user counts and dataset numbers grow, manual processes become bottlenecks. Email-based workflows also lack audit trails and systematic tracking of access approvals.
Providing unrestricted access to everyone violates data governance principles and security requirements. Different users have different access needs based on their roles. Unrestricted access exposes sensitive data and makes compliance impossible.
Operating without an access request mechanism requires users to directly contact administrators or guess at data locations. This ad-hoc approach is inefficient, frustrates users, and provides no systematic governance. Self-service portals with approval workflows balance accessibility with security.
Question 109
A company wants to minimize data transfer costs when moving data between S3 and other AWS services. What design principle should be followed?
A) Deploy resources in the same AWS region as the data
B) Use cross-region transfers for everything
C) Store data in multiple regions unnecessarily
D) Ignore data transfer costs
Answer: A
Explanation:
Deploying compute resources like EC2, EMR, Lambda, and Glue in the same region as S3 data eliminates cross-region data transfer charges. Within a region, data transfer between S3 and other services is generally free or minimal cost. This co-location principle significantly reduces operational costs.
For applications requiring high-volume data processing, cross-region transfer costs can exceed compute costs. A Glue job in us-east-1 processing data in eu-west-1 incurs substantial transfer fees. Deploying the Glue job in eu-west-1 eliminates these charges while likely improving performance due to reduced latency.
When data must be accessed from multiple regions, use S3 Cross-Region Replication to create regional copies. Applications then access local replicas, avoiding transfer charges during regular operations. Transfer costs are incurred once during replication rather than continuously during access.
Using cross-region transfers unnecessarily wastes money without providing benefits. Unless business requirements mandate multi-region deployment for disaster recovery or data residency, keeping data and compute in the same region is more cost-effective.
Storing data in multiple regions without justification multiplies storage costs and data transfer costs for maintaining synchronization. Multi-region storage should be driven by business requirements like compliance or availability, not implemented arbitrarily.
Ignoring data transfer costs can result in unexpectedly high AWS bills. Data transfer represents a significant portion of costs for data-intensive applications. Architecting to minimize unnecessary transfers is essential for cost optimization.
Question 110
A data pipeline processes personally identifiable information and must support data subject requests to delete individual records for GDPR compliance. What architecture enables this?
A) Partition data by user ID with lifecycle policies and catalog tracking
B) Store all data in immutable format
C) Refuse all deletion requests
D) Lose track of where user data is stored
Answer: A
Explanation:
Partitioning data by user ID creates a structure where each user’s data is isolated in specific S3 prefixes or files. When a deletion request arrives, you can identify and delete the specific partitions containing that user’s data. Maintaining catalog metadata about which partitions contain which users enables efficient deletion.
For append-only systems like data lakes, implement logical deletion by marking records as deleted and filtering them during queries. Maintain a deletion registry that query engines check to exclude deleted records. Physical deletion can occur during periodic compaction jobs that rewrite files without deleted records.
AWS Lake Formation and Glue can help track data lineage, showing where user data flows through your systems. This lineage information is essential for responding to deletion requests across multiple datasets. Comprehensive tracking ensures you find and delete all copies of user data.
Storing data in immutable format without deletion capabilities violates GDPR’s right to deletion. Organizations must be able to delete personal data upon request. Immutability is valuable for audit trails but must be balanced with regulatory requirements.
Refusing deletion requests violates GDPR and subjects organizations to substantial fines. The right to deletion is a fundamental GDPR requirement. Systems must be designed to support deletion even if it adds architectural complexity.
Losing track of where user data is stored makes compliance impossible. You cannot delete what you cannot find. Comprehensive data cataloging and lineage tracking are essential for supporting deletion requests and demonstrating compliance.
Question 111
A data engineer needs to optimize costs for Amazon Redshift by identifying and removing unused tables. What approach should be used?
A) Query system tables to identify tables with no recent access
B) Delete all tables randomly
C) Keep all tables forever regardless of use
D) No cost optimization needed
Answer: A
Explanation:
Amazon Redshift system tables track query execution history including which tables are accessed. Query the STL_SCAN table to identify tables that have not been read within a defined period like 90 or 180 days. Tables with no recent access are candidates for archival or deletion after validation with stakeholders.
Before deleting unused tables, export them to S3 using the UNLOAD command as a backup. This preserves data if it is needed later while freeing expensive Redshift storage. External tables in Redshift Spectrum can reference archived data, maintaining query access if required.
Systematic analysis of table usage prevents accumulation of obsolete tables that consume storage and complicate maintenance. Regular review cycles identify unused tables before they become significant cost drivers. Combining usage analysis with stakeholder consultation ensures important tables are not inadvertently removed.
Deleting tables randomly risks removing critical data and breaking dependent queries and applications. Deletion decisions must be based on actual usage patterns and business requirements. Random deletion is reckless and unacceptable in production environments.
Keeping all tables forever accumulates storage costs as obsolete data persists. Development tables, failed experiments, and superseded datasets should be removed when no longer needed. Systematic cleanup maintains efficient operations and controls costs.
Claiming no cost optimization is needed ignores opportunities to reduce expenses without impacting functionality. Storage costs for unused tables compound over time. Regular optimization is a best practice for responsible cost management.
Question 112
A company needs to ensure that data loaded into Amazon Redshift maintains referential integrity between fact and dimension tables. What approach enforces this?
A) Application-level constraint checking before loading
B) Load data without any validation
C) Ignore referential integrity
D) Random data insertion
Answer: A
Explanation:
Amazon Redshift supports defining foreign key constraints but does not enforce them during data loading for performance reasons. Applications must validate referential integrity before loading by checking that foreign keys in fact tables match primary keys in dimension tables. This validation can be implemented in ETL jobs using joins to verify relationships.
AWS Glue jobs can perform pre-load validation by joining fact and dimension tables to identify orphaned records with foreign keys that do not match any dimension records. Invalid records can be logged, corrected, or rejected based on business rules. Only validated data proceeds to Redshift loading.
Implementing validation in ETL ensures clean data enters the data warehouse, maintaining integrity despite Redshift not enforcing constraints. This approach also provides opportunities to handle errors gracefully, logging issues for resolution rather than failing entire loads.
Loading data without validation allows orphaned records and broken relationships to corrupt the warehouse. Analytical queries joining facts to dimensions return incomplete results when referential integrity is violated. Data quality suffers and user trust erodes.
Ignoring referential integrity produces unreliable analytics where metrics do not accurately reflect business reality. Fact records without corresponding dimensions cannot be properly categorized or analyzed. Integrity is fundamental to data warehouse correctness.
Random data insertion guarantees corruption and makes analytics meaningless. Data warehouses require systematic, validated data loading that maintains relationships between tables. Random insertion violates all data management principles.
Question 113
A data pipeline must process data incrementally, loading only new records since the last successful run. What pattern tracks processing progress?
A) Maintain watermark or high-water mark in control table
B) Always process all data from beginning
C) Random record selection
D) No progress tracking
Answer: A
Explanation:
A watermark or high-water mark tracks the last successfully processed record by storing a timestamp, sequence number, or ID in a control table. Each pipeline run queries for records with values greater than the stored watermark. After successful processing, the watermark is updated to the maximum value processed.
This pattern enables efficient incremental loading by processing only new data. For time-series data, the watermark might be a timestamp representing the latest processed record time. For databases with auto-incrementing IDs, the watermark stores the maximum ID processed.
Control tables can be implemented in DynamoDB, RDS, or even S3. The control table stores job name, last run timestamp, watermark value, and status information. Atomic updates to the watermark ensure consistency even if jobs fail partway through processing.
Processing all data from the beginning on every run wastes resources and time. As historical data accumulates, full processing becomes impractical. Incremental processing scales efficiently as data volumes grow.
Random record selection provides no guarantee that all new records are processed or that records are not processed multiple times. Random selection cannot implement reliable incremental processing semantics. Deterministic watermark-based tracking is required.
Operating without progress tracking makes incremental processing impossible. You cannot determine which records are new versus already processed without tracking. This forces full reprocessing or risks missing new data.
Question 114
A data engineer needs to join a small lookup table with a large fact table in AWS Glue. What join strategy optimizes performance?
A) Broadcast join for the small lookup table
B) Sort-merge join for both tables
C) Load everything in memory
D) Avoid joins completely
Answer: A
Explanation:
Broadcast joins are optimal when one table is small enough to fit in memory on all worker nodes. The small lookup table is replicated to all nodes, and the large fact table is distributed across nodes. Each partition of the fact table can join with the local copy of the lookup table without shuffling data.
This strategy eliminates expensive shuffle operations that redistribute data across the network. Broadcasting a small table is far more efficient than shuffling a large table. For lookups tables with thousands of rows joining with fact tables of millions or billions of rows, broadcast joins provide dramatic performance improvements.
AWS Glue and Spark automatically use broadcast joins when the small table size is below a threshold. You can also explicitly hint broadcast joins using DataFrame APIs. Monitor job metrics to verify that broadcasts occur and shuffle volumes remain low.
Sort-merge joins require shuffling both tables, which is inefficient when one table is small. Both tables are redistributed across nodes by join key, incurring network overhead. This strategy makes sense for joining two large tables but not for small-large combinations.
Loading everything in memory simultaneously causes out-of-memory failures on large fact tables. Broadcast joins work because only the small table needs to fit in memory on each node. The large table streams through incrementally.
Avoiding joins completely when business logic requires them forces join operations into application code or later query stages. This shifts complexity and likely degrades performance. Joins should be performed using optimal strategies, not avoided.
Question 115
A company stores sensitive data in Amazon S3 and needs to ensure that only encrypted data can be uploaded. What S3 feature enforces this?
A) Bucket policies requiring encryption headers on PUT requests
B) Allow unencrypted uploads
C) Disable all encryption
D) Hope users encrypt voluntarily
Answer: A
Explanation:
S3 bucket policies can require that PUT requests include encryption headers like x-amz-server-side-encryption. Policies deny requests that do not specify encryption, preventing unencrypted data from being uploaded. This enforced encryption ensures all data is protected at rest without relying on user compliance.
Example bucket policy conditions check for the presence of server-side encryption headers and deny requests without them. This can enforce SSE-S3, SSE-KMS, or SSE-C encryption. Combining with default bucket encryption provides defense-in-depth where encryption occurs even if clients do not specify it.
Enforcing encryption through bucket policies ensures consistent protection across all applications and users. Individual users cannot bypass encryption requirements. This centralized control is essential for maintaining security posture and compliance with regulations.
Allowing unencrypted uploads creates security gaps where sensitive data might be stored without protection. Relying on users to remember encryption is unreliable. Mandatory controls through bucket policies eliminate the possibility of unencrypted storage.
Disabling encryption when storing sensitive data violates security requirements and data protection regulations. Encryption at rest is a fundamental control for protecting confidential information. Disabling it exposes data to unauthorized access risks.
Hoping users voluntarily encrypt data without enforcement is not a security strategy. Voluntary compliance fails when users forget, misunderstand requirements, or use tools that do not support encryption. Mandatory enforcement through technical controls is required.
Question 116
A data engineer needs to implement a medallion architecture with bronze, silver, and gold data layers. What does each layer represent?
A) Bronze is raw data, silver is cleaned data, gold is aggregated business data
B) All layers contain identical data
C) Bronze only with no other layers
D) Random data in each layer
Answer: A
Explanation:
The medallion architecture organizes data lake layers by processing stage. Bronze layer contains raw data ingested from sources with minimal transformation, preserving original formats and all records. This layer serves as the system of record and enables reprocessing if downstream logic changes.
Silver layer contains cleaned, validated, and enriched data. Raw data from bronze is deduplicated, standardized, and quality-checked. Silver provides a trusted foundation for analytics, removing most data quality issues while maintaining detailed records. This layer often conforms data to common schemas.
Gold layer contains aggregated, denormalized, and business-oriented datasets optimized for consumption by analysts and BI tools. Gold tables are typically star or snowflake schemas with dimensions and facts. This layer provides the curated, performant datasets that business users query directly.
Each layer serves different purposes with different quality levels and structures. Bronze enables auditability and reprocessing, silver enables analysis on clean data, and gold enables performant business intelligence. This progressive refinement balances flexibility with usability.
Having identical data in all layers wastes storage and provides no value. The purpose of multiple layers is progressive refinement and optimization for different uses. Duplication without transformation serves no architectural purpose.
Using only bronze without downstream refinement forces analysts to work with raw, unclean data. This increases query complexity, reduces performance, and leads to inconsistent results as analysts apply different cleaning logic. Refined layers improve productivity and consistency.
Question 117
A data pipeline must handle schema evolution where source systems add new columns over time. What approach accommodates schema changes gracefully?
A) Use schema-on-read with flexible formats like Parquet or JSON
B) Reject any schema changes
C) Fail pipeline on new columns
D) Require manual code changes for each schema change
Answer: A
Explanation:
Schema-on-read allows tables to evolve without breaking existing queries. When new columns appear in source data, they are added to the schema automatically by crawlers or during query time. Existing queries that do not reference new columns continue working unchanged. Queries can optionally access new columns when needed.
Flexible formats like Parquet and JSON handle schema evolution naturally. Parquet stores schema metadata with data, allowing readers to understand structure without external definitions. Missing columns in older files return nulls, and new columns in newer files are available transparently. JSON is inherently flexible with optional fields.
AWS Glue Data Catalog supports schema versioning where multiple schema versions coexist for the same table. Queries can specify which version to use, or use the latest version automatically. This versioning enables backward compatibility while supporting evolution.
Rejecting schema changes forces rigid pipelines that require downtime for schema updates. In dynamic environments where sources evolve continuously, rejection causes frequent pipeline failures. Graceful schema evolution maintains pipeline availability during source system changes.
Failing pipelines on new columns treats expected evolution as errors. Source systems naturally add features over time, adding corresponding data fields. Pipelines should adapt to this evolution rather than failing. Flexible schema handling maintains operational continuity.
Requiring manual code changes for each schema change does not scale and creates operational burden. As sources and change frequency increase, manual updates become bottlenecks. Automated schema evolution reduces operational overhead and accelerates time-to-insight for new data.
Question 118
A company wants to implement data quality gates that prevent poor quality data from propagating through the pipeline. Where should quality checks be implemented?
A) At ingestion before data enters the pipeline and between processing stages B) Only at the end after all processing
C) No quality checks needed
D) Only in reporting tools
Answer: A
Explanation:
Implementing quality checks at ingestion catches problems early before they propagate and corrupt downstream data. Validating source data immediately prevents cascading failures through multiple pipeline stages. Early detection is cheaper to fix and prevents waste from processing bad data through the entire pipeline.
Quality gates between processing stages ensure each stage produces valid output before downstream stages consume it. This staged validation creates checkpoint boundaries that isolate problems. If a transformation introduces errors, they are caught before affecting subsequent processing rather than being discovered in final reports.
AWS Glue Data Quality can be integrated at multiple points in pipelines to create comprehensive quality gates. Configure quality rules that must pass before data proceeds to the next stage. Failed quality checks can halt processing, quarantine bad data, or trigger alerts based on severity.
Checking quality only at the end means bad data has already corrupted intermediate datasets and consumed processing resources. End-of-pipeline checks discover problems after maximum damage has occurred. Root cause analysis is harder when errors have propagated through multiple transformations.
Operating without quality checks allows errors to flow unchecked through systems, producing unreliable analytics. Users lose trust in data products when quality issues are frequent. Quality checks are essential infrastructure for reliable data pipelines.
Checking quality only in reporting tools shifts responsibility to end users and provides no protection for the pipeline itself. Reporting tools should present clean, validated data. Quality assurance is a pipeline responsibility, not a reporting responsibility.
Question 119
A data engineer needs to optimize S3 storage costs for log data that is accessed frequently for the first month, then rarely accessed afterward. What storage strategy should be used?
A) Use S3 Lifecycle policies to transition to S3 IA after 30 days
B) Keep all data in S3 Standard forever
C) Delete all data after 30 days
D) Manually move data monthly
Answer: A
Explanation:
S3 Lifecycle policies automatically transition objects between storage classes based on age. Configure a policy that keeps objects in S3 Standard for 30 days, then transitions them to S3 Standard-IA for infrequent access. As access frequency decreases further, subsequent transitions to Glacier tiers provide additional savings.
This tiered approach optimizes costs by matching storage class to access patterns. Frequent access during the first month justifies Standard storage costs. After 30 days when access becomes rare, the lower storage costs of Standard-IA outweigh retrieval fees for occasional access.
Lifecycle policies execute automatically without operational overhead. Once configured, S3 handles transitions continuously as objects age. This eliminates manual management and ensures consistent cost optimization across all objects.
Keeping all data in S3 Standard forever incurs unnecessary costs for infrequently accessed data. Standard storage is optimized for frequent access, which is wasteful for archived logs. Tiering data by access patterns reduces costs significantly without sacrificing availability.
Deleting data after 30 days may violate retention requirements for logs. Compliance regulations often mandate log retention for months or years. Deletion should only occur after retention periods expire, not based solely on access patterns.
Manually moving data monthly requires operational effort and is prone to delays or errors. As data volumes grow, manual management becomes impractical. Automated lifecycle policies scale effortlessly and ensure timely transitions.
Question 120
A data pipeline processes data from multiple sources with different security classifications. How should access controls be implemented?
A) Use data classification tags and Lake Formation permissions
B) Give everyone access to all data
C) No access controls needed
D) Single password for all data
Answer: A
Explanation:
Data classification tags identify the sensitivity level of datasets using standardized labels like public, internal, confidential, or restricted. AWS Lake Formation uses these tags to enforce permissions, granting access only to users whose clearance level matches or exceeds the data classification. This tag-based security scales efficiently across large numbers of datasets.
Lake Formation’s attribute-based access control evaluates user attributes and data tags to make authorization decisions. Users with appropriate roles or clearances automatically gain access to correspondingly classified data. This dynamic approach adapts as data classifications change without requiring manual permission updates.
Tag-based security separates data classification from permission management. Data owners tag datasets based on sensitivity, and security administrators define policies mapping classifications to user attributes. This separation of concerns improves governance and reduces administrative burden.
Giving everyone access to all data violates the principle of least privilege and exposes sensitive information. Different security classifications exist because different data has different protection requirements. Unrestricted access makes security classifications meaningless.
Operating without access controls allows unauthorized users to access sensitive data, creating security breaches and compliance violations. Access controls are fundamental to data governance and security. Every production system requires appropriate access restrictions.
Using a single password for all data provides no individual accountability and cannot implement classification-based restrictions. Shared credentials are insecure and prevent audit trails from identifying who accessed what data. Individual credentials with role-based access are required.