Visit here for our full Databricks Certified Associate Developer for Apache Spark exam dumps and practice test questions.
Question 101:
What is the purpose of the collect_list() aggregation function?
A) To count list elements
B) To collect all values in a column into an array
C) To concatenate strings
D) To gather distributed data
Answer: B
Explanation:
The collect_list() aggregation function collects all values from a column into an array within each group when used with groupBy(), or into a single array for the entire DataFrame when used without grouping. Unlike collect() action which brings data to the driver, collect_list() is an aggregation function that creates array columns containing multiple values. This function preserves duplicate values and maintains insertion order.
Usage example: df.groupBy(“customer_id”).agg(collect_list(“product_id”).alias(“products”)) creates an array of all products for each customer. The result is a DataFrame where each row contains a customer ID and an array of their product IDs. This is useful for creating nested structures, preparing data for array operations, or pivoting data from long to wide format.
A confuses collection with counting. C describes string concatenation functions. D describes the collect() action, not the aggregation function.
Understanding collect_list() is valuable for data restructuring and creating nested schemas. Common use cases include aggregating time-series data into arrays for each entity, collecting tags or categories for items, preparing data for machine learning features that expect array inputs, or creating JSON-compatible nested structures. The related collect_set() function is similar but removes duplicates and doesn’t guarantee order. Be cautious with collect_list() on columns with many values per group, as large arrays can cause memory issues. For very large arrays, consider whether you truly need all values or if sampling or limits would suffice. Collect_list() works well with explode() for reshaping data: explode flattens arrays into rows, while collect_list aggregates rows into arrays.
Question 102:
Which method is used to add a new column to a DataFrame?
A) addColumn()
B) withColumn()
C) insertColumn()
D) newColumn()
Answer: B
Explanation:
The withColumn() method is used to add a new column or replace an existing column in a DataFrame. This method takes two parameters: the column name and a Column expression defining its values. If the column name already exists, it’s replaced; otherwise, a new column is added. This is the primary method for DataFrame column transformations and feature engineering.
Usage example: df.withColumn(“price_with_tax”, col(“price”) * 1.1) adds a calculated column. The Column expression can use existing columns, literals, functions, and complex logic. You can chain multiple withColumn() calls for sequential transformations, though for many columns, select() might be more efficient. WithColumn() returns a new DataFrame due to immutability.
A, C, and D use incorrect method names that don’t exist in Spark’s DataFrame API.
Understanding withColumn() is fundamental to DataFrame transformations. It’s used extensively for data cleaning, type conversions, calculations, and feature engineering. Common patterns include deriving new features from existing columns, applying functions or UDFs, casting data types, and creating flags or categories based on conditions. When adding many columns, chaining withColumn() calls creates intermediate DataFrame objects; using select() with all transformations at once can be more efficient. WithColumn() works with any Column expression, including complex combinations of functions, when() conditionals, and user-defined functions. The operation is a transformation, not an action, so it’s lazily evaluated. Use withColumn() liberally to create readable, maintainable transformation pipelines.
Question 103:
What is the purpose of the flatten() function?
A) To remove nested structures
B) To flatten arrays of arrays into a single array
C) To normalize data
D) To compress data structures
Answer: B
Explanation:
The flatten() function transforms arrays of arrays into a single flat array by concatenating all nested arrays. This function is useful when working with nested data structures where you need to collapse multiple levels of arrays into one dimension. It only works one level deep, flattening arrays of arrays into simple arrays, but doesn’t recursively flatten deeper nesting.
Usage example: df.withColumn(“flat_array”, flatten(col(“array_of_arrays”))) converts a column like [[1,2], [3,4], [5,6]] into [1,2,3,4,5,6]. This is commonly needed when processing JSON data with nested arrays, combining results from multiple array operations, or preparing data for operations that expect flat arrays.
A is too general; flatten() specifically works on array structures. C and D mischaracterize the function’s purpose.
Understanding flatten() is valuable for working with complex nested data. It’s particularly useful in JSON processing where nested arrays are common. Flatten() complements other array functions like explode() which converts arrays to rows, and array() which creates arrays from columns. Common patterns include flattening after array transformations that produce nested arrays, preprocessing for array aggregations, or preparing data for machine learning that expects flat feature arrays. Flatten() only handles one level of nesting; for deeper nesting, you may need to apply it multiple times or use custom logic. The function preserves element order from the nested arrays. When working with arrays of different lengths, flatten() concatenates them all, potentially losing information about original groupings.
Question 104:
Which function is used to generate a UUID column?
A) uuid()
B) generate_uuid()
C) createUUID()
D) uniqueid()
Answer: A
Explanation:
The uuid() function generates a universally unique identifier (UUID) for each row in a DataFrame. This function creates random UUID strings in standard UUID format (e.g., “550e8400-e29b-41d4-a716-446655440000”), which are guaranteed to be unique with extremely high probability. UUIDs are useful for creating unique identifiers when no natural key exists and when you need globally unique IDs across different systems.
Usage example: df.withColumn(“id”, uuid()) adds a UUID column to each row. Unlike monotonically_increasing_id() which generates numeric IDs with gaps and dependencies on partitioning, uuid() generates truly random, universally unique identifiers independent of data distribution. Each call to uuid() generates different UUIDs, so the generated IDs are non-deterministic across executions.
B, C, and D use incorrect function names that don’t exist in Spark SQL.
Understanding UUID generation is important for creating unique identifiers in distributed systems. UUIDs are ideal when you need identifiers that are globally unique across different systems, databases, or time periods. They’re particularly useful in microservices architectures, distributed databases, or when merging data from multiple sources. The random nature of UUIDs means they don’t provide ordering information unlike sequential IDs. For scenarios requiring ordered or consecutive IDs, consider monotonically_increasing_id() or row_number(). UUID generation has minimal performance overhead but produces string values that are larger than numeric IDs, consuming more storage and memory. UUIDs are not sortable in meaningful ways and don’t provide information about insertion order or relationships.
Question 105:
What is the purpose of the cube() operation in Spark SQL?
A) To calculate cubic values
B) To create multi-dimensional aggregations with subtotals
C) To partition data into cubes
D) To compress data
Answer: B
Explanation:
The cube() operation creates multi-dimensional aggregations by computing aggregates for all possible combinations of specified grouping columns, including subtotals and a grand total. This generates a comprehensive result set that includes aggregate values for every combination of grouping columns, aggregates for each individual column, and an overall aggregate. The operation is particularly useful for OLAP-style reporting and business intelligence applications where you need multiple levels of aggregation in a single query.
For example, df.cube(“region”, “product”).agg(sum(“sales”)) produces aggregates for region and product together, region alone, product alone, and a grand total. The result includes rows where some grouping columns are null, representing subtotal levels. This is much more efficient than running multiple separate groupBy queries for each combination, as cube() computes all combinations in a single pass through the data.
The cube operation is related to rollup(), but while rollup() creates hierarchical aggregations respecting column order, cube() creates all possible combinations regardless of hierarchy. For N grouping columns, cube() generates 2^N combinations. This can produce large result sets with many columns, so use it judiciously. Common use cases include creating pivot tables, generating comprehensive sales reports with multiple dimensions, and preparing data for OLAP cubes. The NULL values in result rows indicate aggregation levels, distinguishing actual NULL data from aggregation markers. Understanding cube() enables efficient multi-dimensional analysis without writing complex union queries. It’s optimized by Spark’s query planner to minimize data scanning and computation overhead.
Question 106:
Which method is used to write data to a JSON file?
A) df.write.json()
B) df.save.json()
C) df.export.json()
D) df.output.json()
Answer: A
Explanation:
The df.write.json() method writes DataFrame data to JSON files using the DataFrameWriter API. This method converts each row of the DataFrame into a JSON object and writes multiple JSON files based on the DataFrame’s partitioning. The method supports various options for controlling output format, compression, and write behavior. JSON is a common interchange format that’s human-readable and widely supported across different systems and programming languages.
When writing JSON, each row becomes a separate JSON object on its own line, creating JSON Lines format by default. You can specify options like compression codec, write mode (append, overwrite, error, ignore), and partitioning columns. For example, df.write.mode(“overwrite”).json(path) writes JSON files to the specified path, replacing any existing data. The number of output files depends on the DataFrame’s partition count, which you can control using coalesce() or repartition() before writing.
JSON format is useful for data exchange between systems, especially when working with web services or applications expecting JSON input. However, JSON is less efficient than binary formats like Parquet for storage and processing. It’s human-readable, making it valuable for debugging and data inspection, but has larger file sizes and slower read/write performance compared to columnar formats. JSON naturally handles nested structures and arrays, making it suitable for semi-structured data. When writing large datasets for production analytics, consider Parquet instead. Use JSON primarily for small to medium datasets, data exchange scenarios, or when human readability is important. The write operation supports partitionBy() to organize output into directories based on column values.
Question 107:
What is the purpose of the array_contains() function?
A) To count array elements
B) To check if an array contains a specific value
C) To concatenate arrays
D) To sort array elements
Answer: B
Explanation:
The array_contains() function checks whether an array column contains a specific value, returning a boolean result. This function is essential for filtering and searching within array-typed columns, enabling queries that need to find rows where arrays include particular elements. It’s commonly used with semi-structured data, tags, categories, or any scenario where multiple values are stored in array columns.
Usage example: df.filter(array_contains(col(“tags”), “spark”)) returns rows where the tags array contains “spark”. The function takes two arguments: the array column and the value to search for. It returns true if the value exists anywhere in the array, false otherwise. For null arrays or null search values, it returns null. This function is particularly valuable when working with JSON data or denormalized schemas where related values are stored as arrays rather than separate rows.
Array_contains() is more efficient than exploding arrays and filtering because it operates directly on the array structure without creating additional rows. Common use cases include filtering products by tags, finding records with specific attributes, searching through multi-valued fields, and implementing recommendation logic. The function performs equality checks, so it works with any data type that supports equality comparison. For more complex array operations, Spark provides additional functions like array_intersect(), array_union(), and array_except(). When filtering on multiple array values, you can combine array_contains() calls with AND/OR operators. For finding arrays that contain any of several values, consider using array_intersect() with a literal array. Understanding array functions enables efficient querying of nested and semi-structured data without expensive unnesting operations.
Question 108:
Which method is used to get the first N rows of a DataFrame?
A) first()
B) head()
C) take()
D) All of the above
Answer: D
Explanation:
All three methods can retrieve the first N rows from a DataFrame, though they differ slightly in behavior and return types. The head() and take() methods are essentially equivalent, both returning an array of Row objects containing the first N rows. The first() method without arguments returns just the first row, not multiple rows. Understanding these methods helps choose the appropriate one for data inspection and sampling needs.
The take(n) method returns an array of the first n rows as Row objects to the driver program. For example, df.take(10) retrieves the first 10 rows. Similarly, df.head(10) does the same thing. These methods are actions that trigger computation and bring data from executors to the driver, so use them carefully with small values to avoid memory issues. The head() method without arguments (df.head()) returns the first row, equivalent to first(). These methods are useful for quick data inspection during development and debugging.
Unlike limit() which returns a DataFrame, these methods return local collections to the driver. They’re efficient for small samples because Spark only computes enough partitions to satisfy the request, stopping once N rows are found. Common use cases include previewing data during development, validating transformations on small samples, and extracting small result sets for further processing. For displaying formatted results in notebooks, use show() instead, which prints a table. When retrieving results for processing, consider whether you need a local collection (take/head) or a DataFrame (limit). Remember that the “first” rows returned depend on data partitioning and aren’t deterministic without ordering. For reproducible samples, sort data before using these methods.
Question 109:
What is the purpose of the when() function in Spark SQL?
A) To filter data by time
B) To create conditional expressions
C) To schedule operations
D) To validate data
Answer: B
Explanation:
The when() function creates conditional expressions for transforming column values based on conditions, similar to SQL CASE WHEN statements or if-else logic in programming. This function enables complex conditional logic within DataFrame operations, allowing different transformations based on column values or expressions. It’s fundamental for implementing business rules, data categorization, and feature engineering with conditional logic.
The when() function is used with otherwise() to create complete conditional expressions. The syntax is when(condition, value).when(another_condition, another_value).otherwise(default_value). For example, df.withColumn(“category”, when(col(“age”) < 18, “minor”).when(col(“age”) < 65, “adult”).otherwise(“senior”)) creates categories based on age ranges. Conditions are evaluated in order, with the first matching condition determining the result. The otherwise() clause provides a default value when no conditions match.
You can chain multiple when() clauses for complex multi-condition logic, and nest them for more sophisticated scenarios. Each when() accepts any boolean expression, including complex combinations of multiple columns and functions. The function works with all data types and can return different types for different conditions as long as they’re compatible. Common use cases include categorizing continuous values into bins, implementing business rules for data transformation, creating flags or indicators, handling missing values with conditional logic, and deriving features based on multiple column conditions. When() is more performant than UDFs for conditional logic because it’s optimized by Catalyst. For very complex conditional logic with many branches, consider whether the logic should be simplified or refactored. The function integrates seamlessly with other DataFrame operations and can be used in select(), withColumn(), filter(), and other contexts.
Question 110:
Which storage level stores data on disk only?
A) MEMORY_ONLY
B) DISK_ONLY
C) MEMORY_AND_DISK
D) OFF_HEAP
Answer: B
Explanation:
The DISK_ONLY storage level stores RDD or DataFrame data exclusively on disk without using any memory for caching. This storage level is appropriate when you have limited memory but sufficient disk space, and when you need to cache data that’s too large to fit in memory. While disk storage is much slower than memory, it’s still faster than recomputing data from scratch when the computation is expensive.
When you use persist(StorageLevel.DISK_ONLY), Spark writes RDD partitions to disk on each executor. Subsequent operations read from these disk files rather than recomputing the RDD through its lineage. This provides a tradeoff between memory usage and computation time. Reading from disk is slower than memory access but much faster than re-executing complex transformations, especially those involving shuffles or expensive operations. The disk storage includes serialization overhead, as data must be serialized before writing and deserialized when reading.
Use DISK_ONLY when memory is constrained but you have ample disk space, when cached data is large but recomputation is expensive, or when you need to cache many datasets simultaneously and can’t fit them all in memory. It’s also useful as a fallback strategy when MEMORY_AND_DISK configurations show that data frequently spills to disk anyway. Monitor the Storage tab in Spark UI to understand actual storage usage patterns. For most scenarios, MEMORY_AND_DISK is preferred as it uses memory when available and automatically spills to disk when necessary. Pure DISK_ONLY is less common but valuable in specific resource-constrained environments. Consider whether caching is necessary at all, as sometimes recomputation might be faster than disk I/O, especially for simple transformations.
Question 111:
What is the purpose of the struct() function?
A) To validate data structures
B) To create structured columns from multiple columns
C) To define schemas
D) To organize partitions
Answer: B
Explanation:
The struct() function creates a structured column (StructType) by combining multiple columns into a single nested column. This function enables hierarchical data structures within DataFrames, allowing you to group related columns together logically. The resulting struct column can contain multiple fields of different types, creating nested schemas similar to objects in programming languages or nested JSON structures.
Usage example: df.withColumn(“address”, struct(col(“street”), col(“city”), col(“state”), col(“zipcode”))) creates a single address column containing four fields. You can access nested fields using dot notation: col(“address.city”) or using getField(): col(“address”).getField(“city”). Struct columns are particularly useful when preparing data for systems expecting nested structures, organizing related fields together, or preparing data for complex array and map operations.
Creating struct columns helps organize related data logically, reduces the number of top-level columns, and facilitates operations on related fields as a unit. It’s commonly used when writing to formats like JSON or Parquet that support nested schemas, when interfacing with APIs expecting nested objects, or when implementing data models with natural hierarchical relationships. Struct columns can be nested arbitrarily deep, allowing complex hierarchical structures. You can also create arrays of structs, combining both structural patterns for sophisticated data modeling.
The inverse operation is using select with the asterisk notation (col(“struct_column.*”)) to flatten struct columns back into separate columns. Understanding struct() is important for working with semi-structured data, JSON processing, and creating efficient schemas that represent data relationships naturally. Struct columns are efficiently handled by Spark’s columnar storage formats like Parquet, which can read individual nested fields without loading entire struct columns.
Question 112:
Which method is used to perform approximate distinct count?
A) countDistinct()
B) approx_count_distinct()
C) estimateDistinct()
D) distinctCount()
Answer: B
Explanation:
The approx_count_distinct() function computes an approximate count of distinct values in a column efficiently using probabilistic algorithms, specifically HyperLogLog. This function provides much faster performance than exact distinct counts, especially on large datasets, by trading perfect accuracy for speed. It’s particularly valuable for large-scale analytics where approximate counts are acceptable and exact counts would be prohibitively expensive.
Usage example: df.agg(approx_count_distinct(“user_id”)).show() estimates the number of unique users. You can optionally specify a relative standard deviation parameter to control accuracy: approx_count_distinct(“user_id”, 0.05) aims for 5% accuracy. Lower values increase accuracy but require more computation and memory. The default accuracy is sufficient for most use cases, typically providing results within a few percent of exact counts.
The function uses HyperLogLog algorithm, which maintains a small, fixed-size sketch of the data rather than storing all distinct values. This makes it extremely memory-efficient and allows counting billions of distinct values with minimal overhead. The approximation becomes more accurate with larger datasets due to the statistical properties of HyperLogLog. Common use cases include counting unique users, distinct product views, unique IP addresses, or any scenario where exact counts aren’t critical but understanding scale is important.
For small datasets or when exact counts are required, use countDistinct() instead. Approx_count_distinct() shines with very large datasets, high-cardinality columns, or when counting distinct values across multiple columns. The function can be combined with groupBy() for approximate distinct counts per group. Understanding when approximate algorithms are acceptable enables significant performance improvements in big data analytics. The tradeoff between accuracy and performance makes this function essential for exploratory analysis and monitoring dashboards where perfect precision isn’t necessary.
Question 113:
What is the purpose of the array_union() function?
A) To join arrays
B) To combine arrays removing duplicates
C) To merge DataFrames
D) To concatenate arrays with duplicates
Answer: B
Explanation:
The array_union() function combines two array columns into a single array containing all unique elements from both arrays, removing duplicates. This function performs a set union operation on arrays, making it useful for merging collections while ensuring uniqueness. It’s particularly valuable when working with tags, categories, or any multi-valued attributes where you need to combine values without repetition.
Usage example: df.withColumn(“all_tags”, array_union(col(“user_tags”), col(“system_tags”))) creates a new array containing all unique tags from both sources. If an element appears in both input arrays, it appears only once in the result. The order of elements in the result array is not guaranteed. NULL values in either array are handled appropriately, with null inputs typically resulting in null output.
Array_union() is one of several array manipulation functions in Spark. Related functions include array_intersect() which finds common elements, array_except() which finds elements in the first array not in the second, and array_distinct() which removes duplicates from a single array. These functions enable set operations on array columns without exploding arrays into rows, making them efficient for array manipulation.
Common use cases include combining tags or categories from multiple sources, merging feature sets for machine learning, consolidating multi-valued attributes after joins, and implementing set-based logic on collections. The function is particularly useful in data integration scenarios where different sources provide overlapping sets of values. For combining arrays without removing duplicates, use concat() function instead. Array_union() works with arrays of any comparable data type and is optimized by Spark’s query planner. Understanding array functions enables efficient manipulation of nested data structures without expensive array explosion and aggregation operations.
Question 114:
Which method is used to cast a column to a different data type?
A) convert()
B) cast()
C) asType()
D) changeType()
Answer: B
Explanation:
The cast() method converts a column to a different data type, enabling type transformations within DataFrame operations. This method is essential for ensuring columns have appropriate types for operations, correcting type inference issues, and preparing data for systems with specific type requirements. Cast() accepts either a data type object or a string representing the type name.
Usage example: df.withColumn(“age”, col(“age”).cast(“integer”)) or df.withColumn(“age”, col(“age”).cast(IntegerType())) converts the age column to integer type. Common casting operations include converting strings to numeric types, timestamps to dates, numeric types to strings, or adjusting precision of decimal types. When casting fails (like converting non-numeric strings to integers), the result is null rather than an error, allowing transformations to proceed with data quality issues flagged by nulls.
Cast() supports all Spark SQL data types including primitive types (IntegerType, DoubleType, StringType), date/time types (DateType, TimestampType), complex types (ArrayType, StructType, MapType), and DecimalType with specific precision and scale. String to numeric casting attempts to parse the string; string to date/timestamp casting requires proper format or uses default parsers. Numeric type conversions may lose precision when casting to types with less precision (double to integer truncates decimals).
Common use cases include correcting type inference from CSV or JSON files where Spark may infer incorrect types, preparing data for external systems with strict type requirements, converting between string representations and typed values, and adjusting numeric precision for calculations. Always validate casting results, especially when converting from strings, as invalid conversions produce nulls. For more control over string-to-date conversions, use to_timestamp() or to_date() with format specifications. Understanding type casting is fundamental to data engineering, ensuring data has correct types throughout processing pipelines.
Question 115:
What is the purpose of the expr() function?
A) To evaluate expressions
B) To create Column objects from SQL expression strings
C) To optimize queries
D) To validate syntax
Answer: B
Explanation:
The expr() function creates Column objects from SQL expression strings, allowing you to use SQL syntax within DataFrame operations. This function provides flexibility to write complex transformations using familiar SQL syntax rather than DataFrame method chains, bridging SQL and programmatic APIs. It’s particularly useful for complex expressions that are more readable in SQL format or when migrating SQL queries to DataFrame operations.
Usage example: df.withColumn(“discount”, expr(“price * 0.1”)) creates a calculated column using SQL expression syntax. You can use expr() for complex expressions: expr(“CASE WHEN age < 18 THEN ‘minor’ ELSE ‘adult’ END”) or expr(“substring(name, 1, 3)”). The expression string can include SQL functions, operators, and conditional logic. Expr() essentially compiles the SQL string into a Column object that integrates with DataFrame operations.
This function is valuable when you have existing SQL logic to port into Spark DataFrames, when SQL syntax is more concise or readable than method chains, or when dynamically constructing expressions from strings. Complex CASE statements, window functions, or nested function calls are often more readable as SQL expressions. However, programmatic DataFrame methods offer better IDE support, type checking, and refactoring capabilities.
Common use cases include porting legacy SQL logic to Spark, creating dynamic column transformations based on configuration, writing complex conditional logic more readably, and using SQL functions not yet available as DataFrame methods. The expr() function is parsed and optimized by Catalyst just like any other Column expression, so there’s no performance penalty. Combine expr() with DataFrame methods for flexible transformations that leverage both API strengths. Be cautious with dynamic expression construction to avoid SQL injection-style vulnerabilities if incorporating user input.
Question 116:
Which function is used to calculate the length of a string column?
A) size()
B) length()
C) len()
D) count()
Answer: B
Explanation:
The length() function calculates the length of string columns, returning the number of characters in each string value. This function is fundamental for string analysis, validation, and filtering based on text length. It works with string-typed columns and returns integer values representing character counts, with null inputs producing null outputs.
Usage example: df.withColumn(“name_length”, length(col(“name”))) adds a column showing the length of each name. This is commonly used for data validation (finding strings exceeding limits), filtering (selecting records with names longer than 10 characters), or analysis (understanding text field distributions). You can filter based on length: df.filter(length(col(“description”)) > 100) finds records with lengthy descriptions.
The length() function counts characters, not bytes, which is important for Unicode text where multi-byte characters each count as one character. For byte length, you might need to use different functions depending on requirements. Length() handles null values gracefully by returning null, allowing easy identification of missing values. It’s often combined with other string functions for comprehensive text processing.
Common use cases include validating data meets length requirements, identifying unusually short or long text entries that might indicate data quality issues, analyzing text field usage patterns, filtering records by description length, and creating features for machine learning based on text characteristics. Understanding string functions is essential for text processing pipelines. Related functions include trim() for removing whitespace, substring() for extracting portions of strings, and concat() for combining strings. Length calculations are computationally inexpensive and well-optimized by Spark. For finding lengths of arrays or maps rather than strings, use size() function instead.
Question 117:
What is the purpose of the rlike() function?
A) To compare strings for equality
B) To match strings against regular expression patterns
C) To find similar strings
D) To search substrings
Answer: B
Explanation:
The rlike() function matches string columns against regular expression patterns, returning true for rows where the pattern matches and false otherwise. This function enables powerful pattern-based filtering and validation using regex syntax, making it essential for complex text searching and data quality checks. It’s equivalent to SQL’s RLIKE or REGEXP operators and supports full Java regular expression syntax.
Usage example: df.filter(col(“email”).rlike(“^[a-zA-Z0-9]+@[a-zA-Z0-9]+.[a-zA-Z]{2,}$”)) filters for valid email patterns. The regex pattern is specified as a string and can include character classes, quantifiers, anchors, and other regex features. Rlike() performs partial matching by default, meaning the pattern can match anywhere in the string unless anchors (^ for start, $ for end) are used.
This function is particularly useful for validating data formats like emails, phone numbers, postal codes, or identifiers following specific patterns. It’s also valuable for finding records matching complex search criteria, extracting records with specific patterns, implementing flexible filtering logic, and cleaning data by identifying malformed entries. Rlike() complements other pattern matching functions like like() which uses simpler SQL wildcards, and regexp_extract() which extracts matching substrings.
Common use cases include data validation during ETL, flexible search functionality, identifying patterns in log files or text data, and implementing business rules based on text patterns. Regular expressions can be complex and computationally expensive, so use them judiciously on large text fields. Test regex patterns thoroughly as they can be subtle and error-prone. For simple substring matching, contains() is more efficient. Understanding regex syntax is crucial for effective use of rlike(), enabling sophisticated text processing without UDFs. The function is case-sensitive by default; convert strings to lowercase for case-insensitive matching.
Question 118:
Which method is used to compute covariance between two columns?
A) df.stat.cov()
B) df.covariance()
C) df.cov()
D) df.stat.covariance()
Answer: A
Explanation:
The df.stat.cov() method computes the sample covariance between two numeric columns in a DataFrame. Covariance measures how two variables change together, indicating whether they tend to increase or decrease simultaneously. Positive covariance suggests variables move in the same direction, negative covariance suggests opposite directions, and values near zero suggest little linear relationship. This statistical measure is fundamental for understanding relationships between variables in data analysis.
Usage example: covariance_value = df.stat.cov(“age”, “income”) calculates covariance between age and income columns. The method returns a single numeric value representing the covariance. Unlike correlation which is normalized to [-1, 1], covariance values depend on the scale of the variables, making direct comparison across different variable pairs challenging. For standardized relationship measures, use correlation instead.
The stat property provides access to various statistical methods beyond covariance, including corr() for correlation, crosstab() for contingency tables, freqItems() for frequent items, and sampleBy() for stratified sampling. These methods enable comprehensive statistical analysis without collecting data to the driver. Covariance calculation requires two numeric columns and performs a distributed computation across all data.
Common use cases include exploratory data analysis to understand variable relationships, feature engineering for machine learning where understanding feature interactions matters, financial analysis for portfolio risk assessment, and validating assumptions about data relationships. Covariance is sensitive to data scale, so variables with larger numeric ranges dominate covariance values. For scale-independent relationship measures, prefer correlation. Understanding covariance helps in multivariate analysis, though correlation is typically more interpretable. Both measures only capture linear relationships; non-linear relationships may exist even with low covariance.
Question 119:
What is the purpose of the crossJoin() method?
A) To perform inner joins
B) To create Cartesian product of two DataFrames
C) To join on multiple conditions
D) To merge DataFrames vertically
Answer: B
Explanation:
The crossJoin() method creates a Cartesian product of two DataFrames, combining every row from the first DataFrame with every row from the second DataFrame without any join condition. This operation produces a result with rows equal to the product of input row counts, making it potentially very large and expensive. Cross joins are occasionally necessary for specific algorithms but should be used cautiously due to their explosive growth characteristics.
Usage example: df1.crossJoin(df2) combines all rows from df1 with all rows from df2. If df1 has 100 rows and df2 has 50 rows, the result has 5000 rows. Unlike regular joins that match rows based on conditions, cross joins combine everything unconditionally. This is equivalent to a join without any ON clause in SQL or using cartesian() on RDDs.
Cross joins are necessary for certain operations like generating all combinations, creating comparison matrices, or implementing specific algorithmic patterns. However, they’re rarely needed in typical data processing and can cause severe performance issues or failures if used unintentionally. Common legitimate use cases include generating test data combinations, creating feature interactions for machine learning, computing distance matrices where every item must be compared with every other item, and implementing certain recommendation algorithms.
Always verify that cross join is truly necessary before using it. Often, proper join conditions can achieve desired results more efficiently. When cross join is unavoidable, filter data aggressively beforehand to minimize input sizes. Monitor result size carefully as it grows multiplicatively. The Spark UI will show extremely large shuffle writes for cross joins. Alternative patterns like broadcast variables or window functions might achieve similar results more efficiently. Understanding cross join behavior prevents accidental Cartesian products from missing join conditions, a common cause of job failures.
Question 120:
Which function is used to extract day of month from a date?
A) day()
B) dayofmonth()
C) getDay()
D) extractDay()
Answer: B
Explanation:
The dayofmonth() function extracts the day of the month component from date or timestamp columns, returning an integer from 1 to 31. This function is part of Spark’s comprehensive date/time function library and enables temporal analysis, grouping, and filtering based on day-of-month values. It works with both Date and Timestamp data types, making it versatile for various temporal data processing needs.
Usage example: df.withColumn(“day”, dayofmonth(col(“date_column”))) creates a column containing just the day portion of dates. This is frequently used alongside year() and month() for complete date parsing, or for analyzing patterns by day of month such as identifying billing cycles, paydays, or monthly trends. The function is useful for grouping transactions by day, filtering for specific days, or creating time-based features.
Spark provides comprehensive date/time extraction functions including year(), month(), dayofmonth(), dayofweek(), dayofyear(), hour(), minute(), and second(). These functions enable detailed temporal analysis without complex date arithmetic. Dayofmonth() specifically helps analyze monthly patterns, identify days with unusual activity, or create day-based aggregations. It’s commonly combined with groupBy() for daily aggregations within months.
Common use cases include analyzing sales patterns by day of month, implementing business logic triggered on specific days, creating temporal features for machine learning, and validating date values. When working with string dates, first convert using to_date() or to_timestamp() before applying extraction functions. For day of week (Monday through Sunday), use dayofweek() instead. Understanding date extraction functions is essential for time-series analysis and temporal data processing. These functions are optimized by Catalyst and more efficient than UDFs for date manipulation. Combine with date arithmetic functions like date_add() and date_sub() for comprehensive temporal operations.