Snowflake SnowPro Core Recertification (COF-R02) Exam Dumps and Practice Test Questions Set8 Q141-160

Visit here for our full Snowflake SnowPro Core exam dumps and practice test questions.

Q141. A financial services company requires a data pipeline that can load millions of low-latency events per hour from a Kafka-based messaging system directly into Snowflake tables. The primary objectives are minimal data latency (seconds, not minutes) and efficient credit consumption. The data structure is JSON and is relatively consistent. Which data ingestion method provides the optimal balance of low latency and cost-effectiveness for this specific scenario?

A Using the Snowflake Kafka Connector configured to leverage Snowpipe.
B Building a custom application that micro-batches records and executes COPY INTO commands every minute.
C Using Snowpipe Streaming for direct, row-set ingestion from the Kafka application.
D Batching data in an S3 bucket and triggering a Snowpipe REST API call for every new file.

Answer: C

Explanation: Option C is the correct answer. Snowpipe Streaming introduces a new ingestion paradigm specifically designed for exceptionally low-latency, high-volume streaming scenarios. It bypasses the need for intermediate file creation in cloud storage. Instead, a lightweight client SDK (available for Java, which integrates with Kafka) writes row-sets directly into Snowflake tables. This minimizes the latency overhead associated with file creation, notification, and processing that is inherent in traditional Snowpipe (Option A and D). Because it writes row-sets instead of individual files, it avoids the per-file processing overhead, making it highly efficient for the described volume of “millions of events per hour.” It is the premier solution when latency must be measured in seconds.

Option A, using the Kafka Connector with Snowpipe, is a very common and robust pattern, but it is not the lowest latency solution. This connector automates the process of collecting messages from Kafka topics, staging them as files in an internal (or external) stage, and then triggering Snowpipe to load them. This process, while automated, introduces latency. It takes time to buffer messages, create a file, flush it to the stage, and for Snowpipe to pick up the notification and process the file. This typically results in latencies of minutes (e.g., 1-2 minutes) rather than the seconds achievable with Snowpipe Streaming.

Option B involves building a custom application to micro-batch data. This approach is highly inefficient and complex. Executing COPY INTO commands manually requires a running virtual warehouse. If a warehouse is run continuously to handle 1-minute micro-batches, the credit consumption will be substantial, as the warehouse consumes credits for every second it is active, even if it’s idle between batches. If the warehouse is resumed for each batch, the resume/suspend latency is added to the process, and the 60-second minimum billing for warehouse compute makes this financially punitive for frequent, small batches.

Option D is functionally similar to Option A but implies a different trigger mechanism. Using the Snowpipe REST API for every new file is a valid pattern, but if “millions of events per hour” are being batched into many small files, this can lead to high notification overhead. More importantly, it still relies on the file-based Snowpipe process, which, as established, has higher inherent latency than the direct streaming API. Snowpipe Streaming (Option C) is unequivocally the superior choice for high-volume, row-based, low-latency streaming.

Q142. A data science team is performing complex analytical queries on a 50TB table of customer transactions. Queries frequently filter on the TRANSACTION_TIMESTAMP column and the CUSTOMER_CATEGORY column. The CUSTOMER_CATEGORY column has very low cardinality (e.g., only 10 distinct values: ‘New’, ‘Bronze’, ‘Silver’, ‘Gold’, etc.), while TRANSACTION_TIMESTAMP has extremely high cardinality. To optimize query performance, a data architect decides to define a clustering key. Which clustering key definition would be the most effective and efficient?

A CLUSTER BY (TRANSACTION_TIMESTAMP, CUSTOMER_CATEGORY)
B CLUSTER BY (CUSTOMER_CATEGORY, TRANSACTION_TIMESTAMP)
C CLUSTER BY (TRANSACTION_TIMESTAMP)
D CLUSTER BY (CUSTOMER_CATEGORY)

Answer: B

Explanation: Option B provides the most logical and effective clustering strategy for this scenario. When defining a multi-column clustering key, Snowflake recommends ordering the keys from lowest cardinality to highest cardinality. The CUSTOMER_CATEGORY column has extremely low cardinality (10 distinct values), while TRANSACTION_TIMESTAMP has very high cardinality. By placing CUSTOMER_CATEGORY first, Snowflake will physically co-locate all records for ‘Bronze’ customers, then all records for ‘Silver’ customers, and so on. Within each of those large blocks, the data will then be sorted by TRANSACTION_TIMESTAMP. This drastically improves query performance for filters on CUSTOMER_CATEGORY because Snowflake’s micro-partition pruning can skip massive numbers of partitions that do not contain the requested category. Any subsequent filtering on TRANSACTION_TIMESTAMP will then be highly efficient within that already-pruned subset of data.

Option A is incorrect because it reverses the recommended cardinality order. By clustering on TRANSACTION_TIMESTAMP first, the data is sorted chronologically. Because CUSTOMER_CATEGORY is a low-cardinality column, its values will be scattered randomly across almost every micro-partition. A query filtering WHERE CUSTOMER_CATEGORY = ‘Gold’ would find ‘Gold’ customers in partitions from January, February, March, etc. This ordering provides almost no pruning benefit for the CUSTOMER_CATEGORY column, defeating one of the primary goals of clustering.

Option C, clustering only by TRANSACTION_TIMESTAMP, would be effective for time-based range queries (e.g., WHERE TRANSACTION_TIMESTAMP BETWEEN ‘X’ AND ‘Y’). However, it completely ignores the other common filter predicate, CUSTOMER_CATEGORY. As explained for Option A, this would result in the low-cardinality category values being spread across all micro-partitions, leading to poor pruning and high I/O when filtering by category.

Option D, clustering only by CUSTOMER_CATEGORY, is suboptimal. While it would provide excellent pruning for filters on the category, it does nothing to organize the data within that category. A query for ‘Gold’ customers in the last 30 days (WHERE CUSTOMER_CATEGORY = ‘Gold’ AND TRANSACTION_TIMESTAMP > …) would force Snowflake to scan all micro-partitions containing ‘Gold’ customers, even those from years prior, just to find the recent records. Option B solves this by creating a “sorted index” on the timestamp within each category block, allowing for efficient pruning on both columns.

Q143. A marketing department needs to analyze customer data but must be prevented from seeing sensitive Personally Identifiable Information (PII) such as email addresses and phone numbers. A data governance team wants to implement a solution that automatically redacts this data for the MARKETING_ROLE but allows the HR_ROLE to see the original, unredacted data. The solution must be centrally managed and automatically apply to any query run by the MARKETING_ROLE, without requiring them to use a special view. Which Snowflake feature is designed for this exact requirement?

A Creating a secure view that uses a CASE statement to show NULL for PII columns based on the CURRENT_ROLE() function.
B Implementing a Row-Level Access Policy on the table.
C Implementing a Dynamic Data Masking (DDM) policy on the PII columns.
D Granting SELECT access to the MARKETING_ROLE only on the non-PII columns.

Answer: C

Explanation: Option C is the correct solution. Dynamic Data Masking (DDM) is a column-level security feature designed precisely for this use case. A masking policy is a schema-level object that defines how data should be transformed at query time based on specific conditions, most commonly the user’s active role. The administrator would create a policy (e.g., email_mask) that states: CASE WHEN CURRENT_ROLE() = ‘HR_ROLE’ THEN val ELSE ‘***REDACTED***’ END. This policy is then applied to the EMAIL_ADDRESS column. When a user with MARKETING_ROLE runs SELECT EMAIL_ADDRESS FROM customers, Snowflake’s query optimizer automatically and transparently rewrites the query to apply the mask, and the user only sees ‘REDACTED‘. The user with HR_ROLE runs the exact same query and sees the actual data. This is centrally managed, secure, and transparent to the end-user.

Option A, creating a secure view, is an older method of achieving a similar result. The marketing team would be given access to v_customers_masked instead of the base table. This view would use a CASE statement with CURRENT_ROLE() to determine what to show. While functional, this approach has drawbacks. It proliferates the number of objects in the database (a view for every masked table) and can be difficult to manage at scale. Dynamic Data Masking is the modern, more robust, and centrally governed feature that replaces this pattern.

Option B, a Row-Level Access Policy, is the incorrect type of policy. As the name implies, row-level access policies are used to filter entire rows based on a user’s role. For example, a policy could be used to ensure a sales manager for the ‘WEST’ region can only see customer rows where REGION = ‘WEST’. It does not redact or mask data within a column; it hides the whole row, which is not the requirement here.

Option D, granting SELECT on specific columns, is a form of column-level security but fails the requirement. If the MARKETING_ROLE is denied access to the EMAIL_ADDRESS column entirely, they cannot query it. A SELECT * query would fail, and a SELECT EMAIL_ADDRESS query would fail. The requirement is not to deny access to the column, but to show a masked version of the data in that column, which is the specific function of Dynamic Data Masking.

Q144. An organization has multiple Snowflake accounts across different business units and cloud providers (e.g., AWS and Azure). The central data governance team needs to query metadata about all warehouses, users, and roles across all accounts from a single, centralized administrative account. Which Snowflake feature enables this cross-account, cross-cloud federated metadata querying?

A Creating a Data Exchange and inviting all accounts to be providers.
B Using the ORGANIZATION_USAGE schema.
C Replicating all local ACCOUNT_USAGE schemas to the central account.
D Querying the INFORMATION_SCHEMA in the central account.

Answer: B

Explanation: Option B is the correct answer. The ORGANIZATION_USAGE schema is a special, shared schema available only to accounts that have been designated as an “organization” administrator. This feature is designed specifically for centralized monitoring and governance across multiple accounts within the same organization, even if those accounts are in different regions or on different cloud platforms. It contains views (like WAREHOUSE_METERING_HISTORY, LOGIN_HISTORY, USER_HISTORY, etc.) that are pre-federated, meaning they automatically aggregate metadata from all accounts linked to the organization. This allows the central team to run a single query (e.g., SELECT * FROM ORGANIZATION_USAGE.WAREHOUSE_METERING_HISTORY) and get results from the AWS and Azure accounts simultaneously without any complex setup.

Option A, creating a Data Exchange, is for sharing data with external or internal consumers in a hub-and-spoke model. While one could technically create shares of ACCOUNT_USAGE data and list them in an exchange, this is not the intended, out-of-the-box solution for centralized administrative metadata monitoring. The ORGANIZATION_USAGE schema is built precisely for this administrative purpose.

Option C, replicating ACCOUNT_USAGE schemas, is overly complex and not the most direct solution. While you could set up replication for each account to push its ACCOUNT_USAGE data to the central account, this requires significant configuration for each account. The ORGANIZATION_USAGE feature (Option B) provides this functionality automatically with minimal setup, simply by defining the organization structure with Snowflake support. Replication is a powerful tool, but it’s not the specific feature designed for federated administrative metadata.

Option D is incorrect. The INFORMATION_SCHEMA is a schema that exists within each database. It contains metadata only for the objects within that specific database (e.g., tables, views, procedures). It has no visibility outside its own database, let alone outside its own account. The ACCOUNT_USAGE schema provides metadata at the account level, and the ORGANIZATION_USAGE schema provides metadata at the organization (multi-account) level.

Q145. A large raw data table (raw_events) stores JSON payloads in a VARIANT column. A business intelligence team needs to run frequent, high-performance queries that aggregate data from specific nested attributes within this JSON (e.g., variant_payload:device.type and variant_payload:user.region). The queries are running slowly due to the need to scan and parse the large JSON structures at query time. Which Snowflake feature should be implemented to dramatically accelerate these specific, predictable aggregate queries without changing the raw data table?

A Create a clustering key on the VARIANT column.
B Create a Materialized View that extracts and aggregates the nested attributes.
C Use the Search Optimization Service on the VARIANT column.
D Create a standard (non-materialized) view that flattens the VARIANT data.

Answer: B

Explanation: Option B is the correct solution. A Materialized View (MV) is a pre-computed data set. In this scenario, the MV would be defined to extract the specific nested attributes (device.type, user.region) from the VARIANT column and, ideally, pre-aggregate them (e.g., GROUP BY these attributes and COUNT(*) events). When the BI team runs a query that matches the MV’s definition, Snowflake’s query optimizer will automatically and transparently rewrite the query to read from the much smaller, pre-computed, and pre-structured MV instead of the massive raw VARIANT table. This eliminates the on-the-fly JSON parsing and scanning, providing a dramatic performance boost for these specific, repeatable queries. The MV is automatically maintained and refreshed by Snowflake as new data arrives in the base table.

Option A, creating a clustering key on a VARIANT column, is not supported. Clustering keys must be defined on standard structured columns, not semi-structured types. Even if it were, clustering organizes data storage but does not eliminate the need to parse the JSON at query time; it would just (hypothetically) co-locate similar JSON documents, which is not the primary bottleneck.

Option C, using the Search Optimization Service, is designed to accelerate point lookup queries (e.g., WHERE variant_payload:user.id = ‘xyz-123’). It builds a persistent data structure to optimize equality and substring searches. It is not designed to accelerate large-scale analytical aggregate queries, which is the stated requirement. The query profile is analytical (GROUP BY, COUNT), not a point lookup.

Option D, creating a standard view, provides no performance benefit. A standard view is simply a stored SELECT statement. When the BI team queries the view, Snowflake executes the view’s definition at query time. This means the raw VARIANT table would still be scanned, and the JSON flattened and parsed for every single query, which is the exact problem that needs to be solved. It offers logical simplicity but zero performance acceleration.

Q146. A data governance policy requires that all warehouse compute costs must be strictly segregated and billed back to the specific department (e.g., ‘Finance’, ‘Marketing’, ‘Sales’) that initiated the queries. Furthermore, if any department’s warehouse usage exceeds 10,000 credits in a calendar month, all subsequent queries from that department must be automatically suspended to prevent budget overruns. Which combination of Snowflake objects achieves both of these requirements?

A A separate resource monitor for each department, assigned to their respective warehouses, with a ‘Suspend’ action at 10,000 credits.
B A single, account-level resource monitor with a ‘Notify’ action at 10,000 credits.
C A virtual warehouse for each department, with the STATEMENT_TIMEOUT_IN_SECONDS parameter set to a low value.
D Using role-based access control (RBAC) to grant warehouse usage only to departmental roles.

Answer: A

Explanation: Option A is the correct and complete solution. Resource monitors are the specific Snowflake objects designed for managing and controlling warehouse credit consumption. To meet the requirements, one would first create a separate virtual warehouse for each department (e.g., FINANCE_WH, MARKETING_WH). Then, a separate resource monitor (e.g., FINANCE_MONITOR) would be created for each department. This monitor would be assigned to its corresponding warehouse (or warehouses). This assignment allows for the clear segregation and tracking of costs, as the monitor only tracks usage for the warehouses it’s assigned to. Finally, the monitor’s action would be set to SUSPEND (or SUSPEND_IMMEDIATE) when the credit_quota of 10,000 is reached within the specified interval (monthly). This directly addresses both requirements: segregation of cost tracking (per-monitor) and automatic suspension on budget overrun.

Option B is incorrect because a single, account-level resource monitor would track the total credits consumed by all warehouses in the account. It could not differentiate between ‘Finance’ and ‘Marketing’ and would suspend all queries for the entire account once the 10,000 credit limit is hit, which is not the desired behavior.

Option C is incorrect. The STATEMENT_TIMEOUT_IN_SECONDS parameter is a query governance control that automatically aborts a single query if it runs longer than the specified time. This is used to prevent runaway queries; it has absolutely no awareness of or control over cumulative credit consumption and cannot suspend a warehouse based on a monthly budget.

Option D is a necessary part of a good security model but does not solve the problem. Using RBAC to ensure only ‘Finance’ users can use the FINANCE_WH is a best practice. However, this only controls access; it does not monitor or control the cost or consumption of that warehouse. The resource monitor (Option A) is the missing piece that adds the financial governance and automatic suspension capabilities on top of the RBAC model.

Q147. A data operations team is analyzing the performance of a nightly batch load job. They notice that the COPY INTO command frequently encounters data type conversion errors and other data quality issues from the source Parquet files. The business requirement is to load all valid rows and automatically shunt only the erroneous rows into a separate “exceptions” table for later review, without failing the entire load operation. Which COPY INTO option is designed for this specific error-handling requirement?

A ON_ERROR = ABORT_STATEMENT
B VALIDATION_MODE = RETURN_ALL_ERRORS
C PURGE = TRUE
D ON_ERROR = ‘CONTINUE’

Answer: D

Explanation: Option D is the correct answer. The ON_ERROR = ‘CONTINUE’ copy option instructs the COPY command to not abort the operation when it encounters a data error (like a type mismatch or a null in a non-null column). Instead, it will skip the problematic row (or rows) and continue processing the remainder of the file. This directly fulfills the requirement to “load all valid rows” without failing the entire job. By default, ON_ERROR = ‘CONTINUE’ will only skip the invalid row. To “shunt” the erroneous rows, it is typically used in conjunction with the VALIDATION_MODE or by querying the information_schema.copy_history function after the load to find and retrieve the rejected records, which can then be inserted into an exceptions table. However, CONTINUE is the essential parameter that enables this workflow.

Option A, ON_ERROR = ABORT_STATEMENT, is the default behavior. If any error is encountered in any file, the entire COPY operation is rolled back, and no data is loaded. This is the exact opposite of the stated requirement.

Option B, VALIDATION_MODE = RETURN_ALL_ERRORS, is a validation tool, not a load tool. When this mode is used, the COPY command simulates the load, parses the files, and returns a list of all errors it would have encountered. It does not actually load any data, valid or invalid. This is useful for testing a COPY command before running it, but it does not accomplish the goal of loading the good data.

Option C, PURGE = TRUE, is a data lifecycle management option. When set to TRUE, Snowflake automatically deletes the source files from the stage after they have been successfully loaded. This has no bearing on error handling. If the load fails (e.g., with the default ABORT_STATEMENT), the files are not purged. This option is unrelated to how data quality errors are processed during the load.

Q148. A development team is creating a new data application. They need to create a named, reusable SQL logic block that can dynamically construct and execute DDL (Data Definition Language) and DML (Data Manipulation Language) statements. For example, the logic needs to receive a table name as a parameter, then create a new table (CREATE TABLE …), and finally insert data into it (INSERT INTO …). Which Snowflake object should they create?

A A SQL User-Defined Function (UDF).
B A JavaScript User-Defined Function (UDF).
C A Snowflake Stored Procedure.
D A Secure View.

Answer: C

Explanation: Option C is the correct answer. Snowflake Stored Procedures are designed specifically for executing procedural logic, which includes dynamically building and executing SQL statements. A stored procedure can accept parameters (like a table name), use string concatenation or other logic to build a SQL command as a string, and then execute that command using functions like EXECUTE IMMEDIATE. Stored procedures can also execute DDL (CREATE, ALTER, DROP) and DML (INSERT, UPDATE, DELETE), and they can chain multiple statements together within a transaction. This perfectly matches the requirement. Stored procedures can be written in several languages, including JavaScript, Python, Java, and Snowflake Scripting (SQL).

Option A, a SQL UDF, is incorrect. SQL UDFs are scalar or tabular functions designed to return a single value or a set of rows. They are intended to be used within a SELECT statement (e.g., SELECT my_sql_udf(column_name) FROM table). They are explicitly prohibited from executing DDL or DML statements and cannot perform transactions.

Option B, a JavaScript UDF, is also incorrect for the same fundamental reason as a SQL UDF. While JavaScript UDFs offer more complex procedural logic than SQL UDFs, they are still “functions” that are called within a query and are intended to return a value. They operate in a restricted environment and are not permitted to execute DML or DDL commands.

Option D, a Secure View, is completely unrelated. A view (secure or otherwise) is a stored SELECT query. It is a read-only object that provides a logical representation of data. It cannot execute DDL, cannot execute DML, and cannot accept parameters in the way a procedure does. Its purpose is data abstraction and security, not procedural execution.

Q149. A company shares a large production table (customer_data) with an external partner using Snowflake’s Secure Data Sharing. The partner, who has their own Snowflake account, needs to query this shared data and join it with their own local tables (partner_sales_data). The partner’s queries are running slowly. The partner’s data architect determines that query performance would improve significantly if a clustering key was defined on the shared customer_data table’s region_id column. Who is responsible for and capable of creating this clustering key?

A The partner (consumer) can define a clustering key on the shared table within their own account.
B The provider must define the clustering key on the original customer_data table in their account.
C The partner (consumer) must import the shared data into a new table in their account and then cluster that new table.
D Clustering keys cannot be used on shared tables by either the provider or the consumer.

Answer: B

Explanation: Option B is the correct answer. Secure Data Sharing is a read-only mechanism. The consumer (the partner) receives a live, read-only pointer to the provider’s data. The consumer cannot modify the shared data or its underlying physical structure in any way. This includes defining clustering keys, adding or dropping columns, or changing data. All physical data management, including clustering, remains the sole responsibility and capability of the data provider. If the consumer requires better performance via clustering, they must request that the provider apply clustering to the original source table in the provider’s account. This change will then be automatically and immediately reflected in the consumer’s query performance, as they are querying the exact same micro-partitions.

Option A is incorrect because, as stated, the consumer has no write or DDL privileges on the shared database or its objects. They cannot alter the shared table to add a clustering key.

Option C describes a workaround, not the solution. The partner could create a new table in their account (partner_customer_data), run a CREATE TABLE … AS SELECT … to copy all the shared data, and then cluster this new, locally owned table. However, this completely defeats the purpose of Secure Data Sharing. The data is no longer “live” (it’s a static copy), the partner now pays for storage of this duplicated data, and they must build a pipeline to keep this copy in sync. This is not the intended or proper use of the sharing feature. The correct solution is for the provider to cluster the source table.

Option D is false. Clustering keys work perfectly with shared tables and are a primary way for providers to offer high-performance data products to their consumers. The physical optimization (clustering) is done by the provider on the source table and is transparently leveraged by the consumer.

Q150. A user executes a complex query that takes over 30 minutes to run. A warehouse administrator needs to investigate why the query was slow. They want to see a graphical, step-by-step breakdown of the query’s execution plan, identify which operators spilled to remote storage, how many micro-partitions were scanned, and which join operator was the most expensive. Which Snowflake feature provides this detailed level of diagnostic information for a completed query?

A Querying the ACCOUNT_USAGE.QUERY_HISTORY view.
B Running EXPLAIN PLAN FOR <query>.
C The Query Profile in the Snowsight UI.
D The SYSTEM$EXPLAIN_PLAN_JSON function.

Answer: C

Explanation: Option C is the correct answer. The Query Profile is the primary graphical and statistical tool used for “query tuning” and diagnosing the performance of completed (or currently running) queries. It is accessible through the query history in the Snowsight UI. It provides a visual, node-based representation of the execution plan, showing the flow of data between different operators (e.g., TableScan, Join, Aggregate, Filter). For each operator, it provides detailed statistics, including the number of micro-partitions scanned, the number of rows produced, the time spent in that step, and, crucially, whether the operator “spilled” data to local or remote disk, which is a common source of slowness. This directly matches all parts of the administrator’s requirement.

Option A, querying the ACCOUNT_USAGE.QUERY_HISTORY view, provides high-level metadata about the query’s execution. It will show the query text, the user, the warehouse used, the total execution time, the number of bytes scanned, etc. However, it does not provide the step-by-step operator tree or details about “spilling.” It tells you that the query was slow, but not why (i.e., which specific join or aggregation was the bottleneck).

Option B, running EXPLAIN PLAN, is a static analysis tool. It shows the logical execution plan that Snowflake intends to use to run the query. It does not show the actual execution statistics from a completed run. It won’t tell you how many partitions were scanned or if spilling actually occurred. It’s a predictive tool, whereas Query Profile is a diagnostic tool for a query that has already run.

Option D, SYSTEM$EXPLAIN_PLAN_JSON, is simply a function that returns the same logical plan as EXPLAIN PLAN (Option B), but in a JSON format. It shares the same limitations: it is a logical plan, not an actual execution profile with runtime statistics.

Q151. A data architect is designing a table to store user clickstream data, which arrives as deeply nested JSON. The two most common query patterns are: 1) Point lookups to find a specific event by its event_id (e.g., WHERE event_payload:event_id = ‘…’). 2) Analytical queries that aggregate events by event_type (e.g., GROUP BY event_payload:event_type). The data must be stored in its raw JSON form in a VARIANT column. Which performance optimization features should be applied to the table to accelerate both query patterns?

A A clustering key on event_payload:event_type.
B The Search Optimization Service.
C A clustering key on event_payload:event_id and a Materialized View for event_type aggregates.
D The Search Optimization Service and a clustering key on an expression of event_payload:event_type.

Answer: D

Explanation: Option D is the most comprehensive and correct solution to address both requirements. The Search Optimization Service (SOS) is specifically designed to accelerate point-lookup queries (equality and substring searches) on semi-structured data. Enabling SOS on the VARIANT column will create a persistent search structure that allows Snowflake to find rows matching WHERE event_payload:event_id = ‘…’ without scanning the entire table, dramatically speeding up the first query pattern. Secondly, analytical queries that GROUP BY event_payload:event_type benefit from clustering. While you cannot cluster on the VARIANT column directly, you can create a clustering key on an expression, such as CLUSTER BY (event_payload:event_type::STRING). This will physically co-locate all rows of the same event_type in the same micro-partitions, allowing Snowflake to prune partitions efficiently for the GROUP BY query, which accelerates the second query pattern.

Option A is incorrect because it only addresses the second query pattern (analytics). It does nothing to optimize the first query pattern (point lookups), which would still require a full table scan.

Option B is incorrect because it only addresses the first query pattern (point lookups). The Search Optimization Service is not designed for and does not accelerate large-scale analytical aggregation queries, which would still be slow.

Option C is incorrect. You cannot define a clustering key on event_payload:event_id if it’s an attribute inside a VARIANT column, unless you use an expression. More importantly, clustering is ineffective for high-cardinality values like a unique event_id. The data would be spread across all partitions, and clustering would provide no pruning benefit. A Materialized View could accelerate the GROUP BY query, but it adds maintenance overhead and storage costs. The combination in Option D (SOS + clustering on the low-cardinality expression) is a more direct and efficient way to optimize the base table for both patterns.

Q152. A company’s data retention policy mandates that all data in the TRANSACTIONS table must be recoverable for exactly 90 days after it is modified or deleted. After 90 days, the data must be immediately and permanently purged from all recovery systems and storage. The table is part of a database replication setup for business continuity. Which combination of settings correctly enforces this policy?

A Set DATA_RETENTION_TIME_IN_DAYS = 90 for the table and FAIL_SAFE_PERIOD_IN_DAYS = 0.
B Set DATA_RETENTION_TIME_IN_DAYS = 90 for the table. Fail-safe is automatically 7 days and cannot be disabled.
C Set DATA_RETENTION_TIME_IN_DAYS = 83 for the table and rely on the 7-day Fail-safe period.
D This policy cannot be implemented because Fail-safe is non-configurable and data participating in replication is retained indefinitely.

Answer: A

Explanation: Option A is the correct answer. The DATA_RETENTION_TIME_IN_DAYS parameter controls the Time Travel period. Setting this to 90 ensures that users can recover data for exactly 90 days. The Fail-safe period is a separate 7-day (by default) recovery period that begins after the Time Travel period ends. The requirement is that data must be purged immediately after 90 days. For transient or temporary tables, the Fail-safe period is 0 or 1 day and non-configurable. However, for permanent tables (which this would be), the Fail-safe period can be configured by Snowflake, typically by request for an entire account. More importantly, if a table is part of a database replication setup (as stated in the question), the Fail-safe period is not fixed at 7 days; it is tied to the replication schedule to ensure data consistency. The only way to guarantee the “immediate purge” requirement is to have both Time Travel set to 90 and Fail-safe explicitly set to 0. While setting Fail-safe to 0 account-wide is a significant action, it is the correct configuration to meet this stringent policy. Correction/Refinement: The key insight is that Time Travel (DATA_RETENTION_TIME_IN_DAYS) is configurable per-object. Fail-safe is an account-level (or sometimes object-level, for transient) setting. The policy must be met by configuring Time Travel to 90. The Fail-safe period then adds to this. To meet the “immediately…purged” requirement, the Fail-safe period must be 0. Thus, Option A is the only logical configuration.

Option B is incorrect because it ignores the Fail-safe period. If Time Travel is 90 days, the data would still be recoverable for an additional 7 days (total 97 days) via Fail-safe, which violates the “immediately…purged” requirement.

Option C is an incorrect calculation. Fail-safe follows Time Travel; it does not run concurrently. Setting Time Travel to 83 days would mean user-driven recovery is only possible for 83 days, and the total recovery period (including Fail-safe) would be 90 days. This meets the “purged after 90 days” part, but it fails the “recoverable for exactly 90 days” part, as users lose their Time Travel capability after only 83 days.

Option D is incorrect. Fail-safe is configurable (though often requires contacting Snowflake support to change from the default 7), and data is not retained indefinitely. The combination of Time Travel and Fail-safe defines the total retention.

Q153. A data engineering team maintains a large-scale data pipeline using Snowflake Streams and Tasks. A stream named source_stream captures changes on a table. A task named process_task runs every 5 minutes to consume this stream. The team observes that the process_task runs every 5 minutes, even when source_stream is empty, leading to unnecessary warehouse usage and a long history of “empty” task runs. How can the team modify the task to only run its SQL logic when the source_stream actually contains new data?

A Add WHERE METADATA$ACTION = ‘INSERT’ to the task’s SQL definition.
B Change the task’s SCHEDULE from ‘5 MINUTE’ to WHEN SYSTEM$STREAM_HAS_DATA(‘source_stream’).
C Add a WHEN SYSTEM$STREAM_HAS_DATA(‘source_stream’) clause to the task definition.
D Recreate the stream with the SHOW_INITIAL_ROWS = TRUE parameter.

Answer: C

Explanation: Option C is the correct solution. A Snowflake task definition includes an optional WHEN clause that specifies a boolean SQL expression. If this expression evaluates to TRUE, the task’s body (the BEGIN…END block) is executed. If it evaluates to FALSE, the task skips its execution for that scheduled run. The SYSTEM$STREAM_HAS_DATA function is designed precisely for this purpose. By adding WHEN SYSTEM$STREAM_HAS_DATA(‘source_stream’) to the process_task definition (while keeping the SCHEDULE = ‘5 MINUTE’), the task will wake up every 5 minutes as scheduled, evaluate the WHEN clause, and only if the stream has data will it resume a warehouse and execute its SQL logic. This prevents the cost and clutter of unnecessary runs.

Option A is incorrect. METADATA$ACTION is a metadata column within a stream, used inside the DML statement (e.g., in the MERGE command) to identify whether a row is an INSERT or DELETE. It has no bearing on the task’s execution condition and cannot be used in the WHEN clause in this way.

Option B is incorrect because it attempts to replace the SCHEDULE with the WHEN clause. This is a syntax error. A task must have a SCHEDULE. The WHEN clause is an additional predicate that is evaluated at the scheduled time.

Option D is incorrect. The SHOW_INITIAL_ROWS parameter is a stream creation parameter that determines whether the stream is populated with the current rows in the table at the moment of stream creation. It has no effect on how a task consumes the stream after it has been created.

Q154. A development team is using the Snowflake Python Connector to build an application. The application needs to load a large Pandas DataFrame (1 million rows) from the client machine’s memory directly into a Snowflake table named TARGET_TABLE. The team wants to achieve the highest possible throughput and avoid manually writing the DataFrame to a local CSV file and using PUT. Which Python Connector function is purpose-built for this high-performance, in-memory DataFrame load?

A snowflake.connector.connect().cursor().execute(“INSERT INTO …”) in a loop.
B snowflake.connector.pandas_tools.write_pandas()
C snowflake.connector.connect().cursor().executemany()
D snowflake.connector.connect().cursor().execute(“PUT …”) followed by execute(“COPY INTO …”).

Answer: B

Explanation: Option B is the correct and most efficient method. The snowflake.connector.pandas_tools module includes the write_pandas() function, which is highly optimized for this exact use case. This function intelligently handles the DataFrame “under the hood.” It serializes the DataFrame from memory (e.g., to Parquet), compresses it, automatically streams it to an internal stage (without the user needing to manage files), and then executes a COPY INTO command to load the staged data into the target table. This entire process is orchestrated by a single function call and provides bulk-loading performance, which is vastly superior to row-by-row insertion.

Option A is hideously inefficient. Executing an INSERT statement in a loop would perform one million separate network round-trips and one million single-row transactions. This is the slowest possible way to load data and would take hours, if it completed at all.

Option C, executemany(), is a significant improvement over Option A. It allows the connector to batch the INSERT statements, reducing network overhead. However, it is still performing INSERTs, which are not as efficient as the bulk-loading COPY command. The write_pandas() function is superior because it leverages the COPY command’s parallel, bulk-loading capabilities.

Option D describes the correct logical steps but is not the function the developer uses. The write_pandas() function performs these steps (staging and copying) automatically. A developer could manually save the DataFrame to a file, use the PUT command to upload it, and then run COPY INTO, but this is more complex, requires local disk I/O, and is precisely the manual work that the write_pandas() utility is designed to eliminate.

Q155. A database administrator (DBA) needs to grant a new role, ANALYST_ROLE, the ability to create new virtual warehouses in the account. The DBA wants to follow the principle of least privilege and not grant excessive permissions. Which single privilege, when granted to ANALYST_ROLE, will allow a user with that role to execute the CREATE WAREHOUSE command?

A GRANT USAGE ON ACCOUNT TO ROLE ANALYST_ROLE
B GRANT CREATE WAREHOUSE ON ACCOUNT TO ROLE ANALYST_ROLE
C GRANT MANAGE WAREHOUSES ON ACCOUNT TO ROLE ANALYST_ROLE
D GRANT OWNERSHIP ON WAREHOUSE <wh_name> TO ROLE ANALYST_ROLE

Answer: B

Explanation: Option B is the correct answer. In Snowflake’s access control model, the ability to create new objects at the account level (such as warehouses, users, roles, or databases) is governed by CREATE <object_type> privileges. To allow a role to create warehouses, the CREATE WAREHOUSE privilege must be granted at the account level. The syntax is GRANT CREATE WAREHOUSE ON ACCOUNT TO ROLE <role_name>. This grant is specific and adheres to the principle of least privilege, as it does not grant any other, more powerful permissions.

Option A, GRANT USAGE ON ACCOUNT, is not a valid privilege. Privileges like USAGE apply to specific object types, such as databases or warehouses, not to the account itself.

Option C, GRANT MANAGE WAREHOUSES, is a powerful global privilege. A role with this privilege can not only CREATE warehouses but also ALTER and DROP any warehouse in the account, even those it does not own. While this would allow the analyst to create a warehouse, it violates the principle of least privilege by granting far more power (alter/drop all) than was requested (create new).

Option D, GRANT OWNERSHIP ON WAREHOUSE, is used to transfer ownership of an existing warehouse to a new role. It cannot be used to grant creation privileges for new warehouses that do not yet exist.

Q156. An analytics query is performing a large, complex join between a 10TB FACT_SALES table and a 5GB DIM_STORE table. The Query Profile reveals that a significant amount of time is spent in a “TableScan” operator on the FACT_SALES table, and the “Pruning” statistic shows that 9,500 out of 10,000 total partitions were scanned. The join condition is fact.store_key = dim.store_key, and a common filter is dim.region = ‘North America’. The FACT_SALES table is currently not clustered. How can this query be optimized to significantly reduce the number of partitions scanned from the FACT_SALES table?

A Define a clustering key on the store_key column in the DIM_STORE table.
B Define a clustering key on the store_key column in the FACT_SALES table.
C Increase the size of the virtual warehouse to X-Large.
D Create a Materialized View that pre-joins the two tables.

Answer: B

Explanation: Option B is the correct optimization. The problem is that Snowflake is scanning 95% of the 10TB FACT_SALES table (the TableScan operator) because the data needed for the join is scattered across almost all micro-partitions. By defining a clustering key on the store_key (the join key) in the FACT_SALES table, Snowflake will physically co-locate all rows for a given store_key (e.g., all sales for ‘Store 101’) into the same micro-partitions. When a query filters on the dimension table (e.g., dim.region = ‘North America’), Snowflake can first identify the list of store_keys that are in North America and then use that list to prune the FACT_SALES table before the join. Because the fact table is now clustered by store_key, Snowflake can skip all partitions that do not contain the relevant store keys, drastically reducing the I/O and improving performance.

Option A is incorrect. Clustering the DIM_STORE table will have a negligible impact on performance. This table is only 5GB; it is tiny compared to the 10TB fact table and likely fits in the warehouse cache anyway. The performance bottleneck is the scan on the fact table, so the optimization must be applied to the fact table.

Option C, increasing the warehouse size, is a “brute force” method. This might make the query run faster by providing more compute resources to scan the 9,500 partitions, but it does not solve the underlying problem. It doesn’t reduce the amount of I/O; it just does the I/O more quickly. The more efficient, “smart” solution is to reduce the I/O via clustering (Option B).

Option D, creating a Materialized View, is a plausible but different optimization. An MV would be very fast, as the join would be pre-computed. However, it comes with its own trade-offs: it consumes storage for the materialized results, and it adds compute overhead for maintenance as the base tables change. Clustering (Option B) is a more direct optimization of the base table’s physical layout to improve join performance, which is often the first and best step.

Q157. A data provider wants to share a secure view (ANALYTICS.V_SALES_SUMMARY) with an external business partner. The business partner does not have their own Snowflake account and has no plans to become a Snowflake customer. The provider wants to give the partner read-only SQL access to this view without incurring any compute costs for the partner’s queries. Which type of account should the provider create for this partner?

A A full Snowflake account with a data share.
B A trial Snowflake account.
C A reader account.
D A separate user and role within the provider’s own account.

Answer: C

Explanation: Option C is the correct answer. A “reader account” is a specific type of Snowflake account designed for this exact use case: sharing data with a consumer who is not an existing Snowflake customer. The provider can create one or more reader accounts, which are owned by and billed to the provider. The provider then creates a share and grants access to this share to the reader account. The partner (consumer) can log in to the reader account, which looks and feels like a regular Snowflake account, but they can only access the data shared with them. They cannot create their own tables or load their own data. Crucially, all compute (virtual warehouse) and storage costs incurred by the reader account are billed back to the provider, which aligns with the provider’s requirement.

Option A is incorrect because it assumes the partner has their own full account, which contradicts the prompt.

Option B, a trial account, is incorrect. A trial account is a temporary, 30-day trial for a new customer. It is not a mechanism for data sharing. At the end of 30 days, the account would be suspended, and it would not be billed to the provider.

Option D is a very bad practice. Creating a user for an external partner within the provider’s own account is a significant security risk. It co-mingles the partner’s activity with the provider’s internal operations and makes cost allocation and governance extremely difficult. Reader accounts were created specifically to provide a secure, isolated environment that avoids this.

Q158. A company has a virtual warehouse configured as a multi-cluster warehouse in Auto-scale mode. The configuration is: Minimum Clusters: 1, Maximum Clusters: 8. The scaling policy is set to ‘Economy’. The warehouse is currently running with 3 active clusters. A sudden, massive spike in concurrent queries arrives, which would require 6 clusters to handle. What will be Snowflake’s immediate response?

A Immediately start 3 new clusters to bring the total to 6.
B Immediately start 5 new clusters to bring the total to the maximum of 8.
C Place all new queries in a queue and not start any new clusters immediately.
D Shut down the 3 running clusters and restart the warehouse in Maximized mode.

Answer: C

Explanation: Option C is the correct answer. The key detail is the scaling policy: ‘Economy’. The ‘Economy’ policy is designed to conserve credits by being very conservative about starting new clusters. It prioritizes fully utilizing the currently running clusters. It will only start a new cluster if it calculates that there is enough queued query load to keep the new cluster busy for at least 5-6 minutes. A “sudden, massive spike” will first result in all new queries being queued. The ‘Economy’ policy will wait to see if this queue is sustained. It will not react “immediately” to the spike. This is the fundamental trade-off of the ‘Economy’ policy: you save credits, but you sacrifice performance and risk queuing during sudden bursts.

Option A is incorrect. This describes the behavior of the ‘Standard’ scaling policy. The ‘Standard’ policy is designed to minimize queuing and will react very aggressively (within seconds or tens of seconds) to start new clusters as soon as a queue is detected.

Option B is incorrect. No scaling policy will automatically scale to the maximum defined limit unless the query load actually justifies that many clusters. It scales to meet the demand, not the limit.

Option D is nonsensical. A warehouse’s mode (Auto-scale vs. Maximized) is a static configuration and cannot be changed dynamically by the system in response to query load.

Q159. A developer needs to write a SQL query to parse a JSON array stored in a VARIANT column. The column V contains the following JSON: [ { “id”: “a”, “val”: 10 }, { “id”: “b”, “val”: 20 } ]. The developer wants to produce a relational table output with two rows, one for “a” and one for “b”. Which function or operator is required to “un-nest” or “pivot” the JSON array into separate rows?

A The FLATTEN function.
B The PARSE_JSON function.
C The OBJECT_CONSTRUCT function.
D The CHECK_JSON function.

Answer: A

Explanation: Option A is the correct answer. The FLATTEN function is a table function specifically designed to “un-nest” semi-structured data. When applied to a VARIANT column containing a JSON array, it will produce a new row for each element in that array. It is typically used with a LATERAL JOIN (or the comma-separated equivalent) to join the flattened results back to the source table. A query would look like: SELECT t.value:id::STRING AS id, t.value:val::INT AS val FROM my_table, LATERAL FLATTEN(input => V) t;. This would produce the exact relational output required.

Option B, PARSE_JSON, is used to convert a VARCHAR (string) that contains JSON text into a VARIANT data type. The data in this question is already in a VARIANT column, so PARSE_JSON is not needed.

Option C, OBJECT_CONSTRUCT, is the opposite of FLATTEN. It is used to create a new OBJECT (JSON object) from key-value pairs, typically during a SELECT statement. It is for data construction, not data deconstruction.

Option D, CHECK_JSON, is a validation function. It simply checks if a VARCHAR contains valid JSON text and returns a boolean or error message. It does not parse or un-nest the data.

Q160. An account administrator is auditing role-based access control (RBAC) in their Snowflake account. They need to find a definitive list of all roles that have been granted directly to a specific user, JSMITH. They also need to see all roles that have been granted to other roles, forming a hierarchy. Which SHOW commands provide this information?

A SHOW GRANTS TO USER JSMITH and SHOW GRANTS OF ROLE <role_name>.
B SHOW ROLES and SHOW USERS.
C SHOW GRANTS ON ACCOUNT and SHOW GRANTS ON DATABASE.
D SHOW GRANTS TO ROLE <role_name> and SHOW GRANTS OF ROLE <role_name>.

Answer: A

Explanation: Option A correctly identifies the two commands needed to audit this. SHOW GRANTS TO USER JSMITH will list all roles that have been directly granted to that user principal. This answers the first part of the requirement. SHOW GRANTS OF ROLE <role_name> does the reverse: it shows all users and other roles that have been granted a specific role. By executing this command for each role, the administrator can map out the entire role hierarchy (which roles are granted to other roles).

Option B is incorrect. SHOW ROLES just lists all roles that exist in the account. SHOW USERS just lists all users that exist. Neither command shows the relationships (grants) between them.

Option C is incorrect. SHOW GRANTS ON ACCOUNT (or ON DATABASE) shows the permissions (e.g., USAGE, CREATE SCHEMA) that have been granted to roles on that specific object. It does not show the grants of roles to users or to other roles.

Option D is incorrect because it confuses two similar commands. SHOW GRANTS TO ROLE <role_name> shows the privileges (like SELECT on a table or USAGE on a warehouse) granted to a role. SHOW GRANTS OF ROLE <role_name> shows the users and roles that have been given that role. This second command is useful, but the first command (SHOW GRANTS TO ROLE…) does not show which roles are granted to a user. Option A (SHOW GRANTS TO USER…) is the correct command for that specific purpose.

Exam

Related posts:

Leave a Reply Cancel reply