Databricks Certified Generative AI Engineer Associate Exam Dumps and Practice Test Questions Set 3 Q 41-60

Visit here for our full Databricks Certified Generative AI Engineer Associate exam dumps and practice test questions.

Question 41:

A generative AI engineer needs to implement a RAG system that retrieves relevant context from a vector database before generating responses. Which component is responsible for converting user queries into vector embeddings?

A) Large language model

B) Embedding model

C) Vector database

D) Prompt template

Answer: B

Explanation:

The embedding model is responsible for converting user queries into vector embeddings in a RAG (Retrieval-Augmented Generation) system. This component transforms text into high-dimensional numerical vectors that capture semantic meaning, enabling similarity searches in vector databases to find relevant context documents.

Embedding models work by encoding text into fixed-size vector representations where semantically similar text produces similar vectors. In RAG systems, the same embedding model must be used for both indexing documents during ingestion and encoding user queries at runtime to ensure vector space consistency. When a user submits a query, the embedding model converts it into a vector, which is then used to search the vector database for documents with similar embeddings.

The embedding process is critical for RAG performance because high-quality embeddings determine retrieval accuracy. Popular embedding models include OpenAI’s text-embedding-ada-002, sentence-transformers models, and specialized domain-specific embedding models. These models are typically pre-trained on large text corpora and can capture semantic relationships, synonyms, and contextual meanings beyond simple keyword matching.

Once the query is embedded, the vector database performs similarity search using distance metrics like cosine similarity or Euclidean distance to find the most relevant documents. These retrieved documents are then passed as context to the large language model along with the original query, enabling the LLM to generate responses grounded in the retrieved information rather than relying solely on training data.

Option A is incorrect because the large language model generates responses but does not create query embeddings. Option C is wrong because the vector database stores and searches embeddings but does not create them. Option D is incorrect because prompt templates structure how information is presented to the LLM but do not generate embeddings.

Question 42:

An engineer is fine-tuning a foundation model for a specific domain. Which technique adjusts only a small subset of model parameters while keeping most of the pre-trained weights frozen?

A) Full fine-tuning

B) Parameter-efficient fine-tuning (PEFT)

C) Transfer learning

D) Pre-training from scratch

Answer: B

Explanation:

Parameter-efficient fine-tuning (PEFT) adjusts only a small subset of model parameters while keeping most pre-trained weights frozen. This approach dramatically reduces computational requirements, memory usage, and training time compared to full fine-tuning while achieving comparable performance for domain-specific tasks.

PEFT techniques include several methods such as LoRA (Low-Rank Adaptation) which adds trainable low-rank matrices to existing weight matrices, prefix tuning which prepends trainable vectors to input embeddings, and adapter layers which insert small trainable modules between frozen transformer layers. These methods typically update less than one percent of total model parameters, making fine-tuning feasible on consumer hardware.

The efficiency of PEFT stems from the hypothesis that task-specific adaptations lie in a low-dimensional subspace of the full parameter space. By constraining updates to this subspace, PEFT captures necessary task-specific knowledge without catastrophic forgetting of pre-trained capabilities. This is particularly valuable when working with large foundation models where full fine-tuning would require hundreds of gigabytes of GPU memory.

PEFT also provides practical benefits including faster training iterations enabling rapid experimentation, multiple task-specific adapters that can be swapped for different use cases without maintaining separate full models, and reduced storage requirements as only small adapter weights need to be saved. Organizations commonly use PEFT when adapting large language models for enterprise-specific terminology, industry jargon, or specialized reasoning patterns.

Option A is incorrect because full fine-tuning updates all model parameters, requiring significantly more computational resources. Option C is wrong because transfer learning is a broader concept that includes various adaptation methods, not specifically parameter-efficient approaches. Option D is incorrect because pre-training from scratch trains all parameters from random initialization rather than adapting existing models.

Question 43:

A generative AI application needs to provide citations for information used in generated responses. Which RAG implementation feature enables tracking which documents contributed to the response?

A) Temperature sampling

B) Source attribution metadata

C) Prompt compression

D) Response caching

Answer: B

Explanation:

Source attribution metadata enables tracking which documents contributed to generated responses in RAG implementations, providing transparency and verifiability for generative AI applications. This feature maintains references to retrieved documents throughout the generation process, allowing the system to cite specific sources that informed the response.

Source attribution works by associating metadata with retrieved documents including document identifiers, titles, authors, URLs, page numbers, and timestamps. When the vector database returns relevant documents, this metadata is preserved alongside the content. After the LLM generates a response using the retrieved context, the system includes citations referencing the source documents, enabling users to verify information and explore original sources.

Implementation approaches include explicit citation formatting where the LLM is prompted to include inline citations in its response, post-processing attribution where the system analyzes which retrieved documents most influenced specific portions of the response, and confidence scoring that indicates how strongly each source contributed. Many RAG systems return both the generated response and a list of source documents with relevance scores.

Source attribution provides critical benefits for enterprise applications including compliance with information governance requirements, support for fact-checking and verification workflows, transparency that builds user trust in AI-generated content, and protection against hallucinations by grounding responses in verifiable sources. Legal, medical, and financial applications particularly benefit from robust source attribution capabilities.

Option A is incorrect because temperature sampling controls randomness in generation but does not track document sources. Option C is wrong because prompt compression reduces prompt size rather than enabling source tracking. Option D is incorrect because response caching improves performance but does not provide citation capabilities.

Question 44:

An engineer needs to evaluate a generative AI model’s tendency to produce false or nonsensical information. Which evaluation metric specifically measures this behavior?

A) Perplexity

B) BLEU score

C) Hallucination rate

D) Token throughput

Answer: C

Explanation:

Hallucination rate specifically measures a generative AI model’s tendency to produce false or nonsensical information not supported by training data or provided context. This metric quantifies how often models generate plausible-sounding but factually incorrect content, which is a critical reliability concern for production applications.

Measuring hallucination rate involves comparing generated outputs against ground truth data or verified sources. For question-answering tasks, evaluators check whether answers contain fabricated facts, incorrect attributions, or unsupported claims. For summarization, they verify whether the summary introduces information not present in the source document. For RAG systems, evaluation confirms that responses stay faithful to retrieved context rather than incorporating model-generated fabrications.

Common evaluation approaches include human evaluation where experts assess factual accuracy, automated fact-checking using knowledge bases or search engines to verify claims, consistency checking where the model is queried multiple times and inconsistent answers indicate potential hallucinations, and attribution verification in RAG systems confirming that all claims can be traced to source documents.

Reducing hallucination rates requires multiple strategies including instruction tuning that emphasizes factual accuracy, RAG systems that ground responses in retrieved evidence, confidence scoring that allows models to express uncertainty, temperature reduction that decreases sampling randomness, and prompt engineering that explicitly instructs models to acknowledge knowledge limitations. Monitoring hallucination rates in production ensures model outputs maintain acceptable accuracy levels.

Option A is incorrect because perplexity measures how well the model predicts text sequences rather than factual accuracy. Option B is wrong because BLEU score evaluates similarity to reference translations rather than measuring false information. Option D is incorrect because token throughput measures generation speed rather than output quality or accuracy.

Question 45:

A generative AI engineer needs to implement conversation memory for a multi-turn chatbot. Which technique stores previous exchanges to maintain context across interactions?

A) Stateless generation

B) Conversation history management

C) Model quantization

D) Batch inference

Answer: B

Explanation:

Conversation history management stores previous exchanges to maintain context across interactions in multi-turn chatbot applications. This technique accumulates user messages and assistant responses throughout a conversation session, providing this history as context for subsequent generations to ensure coherent, contextually aware dialogue.

Conversation history management works by maintaining a conversation buffer that grows with each turn. When a user submits a new message, the system retrieves previous messages from the buffer and includes them in the prompt sent to the language model. The LLM generates a response considering the full conversation context rather than treating each message in isolation. This enables the assistant to reference previous topics, maintain consistent personas, and follow complex multi-turn reasoning.

Implementation considerations include managing conversation length to stay within token limits, using conversation summarization when history becomes too long, implementing sliding windows that retain recent messages while discarding older ones, and conversation pruning that removes less relevant exchanges. Many systems implement hybrid approaches maintaining detailed recent history with summarized earlier context.

Different storage strategies exist for conversation history including in-memory storage for short-lived sessions, database storage for persistent conversations across sessions, and embedding-based retrieval for long conversations where relevant historical context is retrieved based on current topic rather than including all history. Enterprise applications often persist conversation history for audit trails, quality monitoring, and continuous improvement.

Option A is incorrect because stateless generation treats each request independently without maintaining context across turns. Option C is wrong because model quantization reduces model size for efficiency rather than managing conversation context. Option D is incorrect because batch inference processes multiple requests simultaneously but does not maintain conversation state.

Question 46:

An organization needs to deploy a generative AI model that processes sensitive customer data. Which privacy-preserving technique allows model inference without exposing raw data?

A) Model compression

B) Federated learning

C) Transfer learning

D) Prompt caching

Answer: B

Explanation:

Federated learning allows model training and inference on distributed data without exposing raw data to central servers, providing privacy-preserving capabilities for sensitive customer information. This technique trains models across multiple decentralized devices or servers holding local data samples, without exchanging the actual data.

Federated learning works by distributing model training across data sources. Each participant receives the current global model, trains it on their local data, and sends only model updates (gradients or weights) back to the central server. The central server aggregates these updates to improve the global model without ever accessing the raw training data. This approach satisfies privacy regulations while enabling model improvement from diverse data sources.

For generative AI applications, federated learning enables training on sensitive data like medical records across hospitals, financial data across banks, or personal communications across user devices. Each institution or device contributes to model improvement while maintaining data sovereignty and privacy. Differential privacy techniques can be combined with federated learning to provide mathematical privacy guarantees, adding noise to updates before transmission.

Challenges in federated learning include communication costs of transmitting model updates, heterogeneous data distributions across participants, handling participants with varying computational capabilities, and defending against adversarial attacks that attempt to extract information from model updates. Despite challenges, federated learning is increasingly adopted in healthcare, finance, and consumer applications requiring strong privacy protections.

Option A is incorrect because model compression reduces model size but does not address data privacy during processing. Option C is wrong because transfer learning adapts pre-trained models but does not provide privacy-preserving inference. Option D is incorrect because prompt caching improves performance but does not protect sensitive data.

Question 47:

A generative AI engineer needs to implement a system that generates structured output like JSON objects. Which technique ensures the model produces valid structured data?

A) Temperature reduction

B) Constrained decoding with grammar rules

C) Prompt compression

D) Beam search

Answer: B

Explanation:

Constrained decoding with grammar rules ensures generative models produce valid structured data like JSON objects by restricting token generation to sequences that conform to predefined structural constraints. This technique guides the decoding process to only generate tokens that maintain valid syntax according to specified grammars or schemas.

Constrained decoding works by maintaining a parser state during generation that tracks valid next tokens according to the target structure. For JSON generation, the constraint system ensures brackets are balanced, commas appear in correct positions, keys are quoted strings, and values match expected types. At each generation step, the model’s token probability distribution is filtered to only include tokens that maintain structural validity.

Implementation approaches include context-free grammar constraints that define valid token sequences, JSON schema validation that enforces type constraints and required fields, and regular expression patterns for simpler structured outputs. Modern frameworks like Outlines, Guidance, and LMQL provide constraint specification languages and efficient constrained decoding algorithms that work with various language models.

Constrained decoding eliminates post-processing validation and retry loops that waste computation when models generate invalid structures. This is particularly valuable for function calling, API response generation, database query construction, and any scenario requiring reliable structured output. The technique maintains generation quality while guaranteeing syntactic correctness, making generative AI more suitable for production systems requiring programmatic output consumption.

Option A is incorrect because temperature reduction affects randomness but does not guarantee structural validity. Option C is wrong because prompt compression reduces prompt length rather than constraining output structure. Option D is incorrect because beam search explores multiple generation paths but does not enforce structural constraints.

Question 48:

An engineer is implementing a RAG system and needs to split large documents into smaller chunks for embedding. Which factor most significantly impacts retrieval quality?

A) Chunk size and overlap strategy

B) Database connection pooling

C) GPU memory allocation

D) API rate limiting

Answer: A

Explanation:

Chunk size and overlap strategy most significantly impacts retrieval quality in RAG systems when splitting large documents for embedding. These parameters determine how information is segmented, directly affecting whether relevant context can be retrieved and whether retrieved chunks contain sufficient information for the language model to generate accurate responses.

Chunk size represents the number of tokens or characters in each document segment. Smaller chunks (100-300 tokens) provide focused retrieval and reduce noise in retrieved context but may fragment information that should be understood together. Larger chunks (500-1000 tokens) preserve more context and relationships but may dilute relevance scoring and exceed LLM context windows when multiple chunks are retrieved. The optimal size depends on document structure and query patterns.

Overlap strategy addresses information fragmentation by creating overlapping chunks where consecutive segments share some content. For example, a 50-token overlap means the last 50 tokens of one chunk become the first 50 tokens of the next chunk. This prevents important information from being split at chunk boundaries and ensures concepts spanning boundaries appear completely in at least one chunk. Overlap typically ranges from 10 to 20 percent of chunk size.

Advanced chunking strategies include semantic chunking that splits at natural boundaries like paragraphs or sections, hierarchical chunking that maintains document structure, and sliding windows that create multiple overlapping views of the same content. Evaluation should measure retrieval precision and recall with representative queries to optimize chunking parameters for specific use cases.

Option B is incorrect because database connection pooling affects performance but not retrieval quality. Option C is wrong because GPU memory impacts processing speed rather than retrieval accuracy. Option D is incorrect because API rate limiting controls request frequency without affecting retrieval quality.

Question 49:

A generative AI application needs to handle multiple concurrent user requests efficiently. Which serving pattern distributes requests across multiple model instances?

A) Sequential processing

B) Load balancing

C) Cache invalidation

D) Model quantization

Answer: B

Explanation:

Load balancing distributes requests across multiple model instances to handle concurrent user requests efficiently in generative AI applications. This serving pattern improves throughput, reduces latency, and provides fault tolerance by spreading workload across multiple replicas of the model running on different hardware resources.

Load balancing for generative AI works by maintaining a pool of model inference servers, each capable of handling generation requests independently. An load balancer receives incoming requests and distributes them to available instances based on various strategies including round-robin distribution, least-connections routing that directs requests to servers with fewest active connections, or weighted distribution that considers instance capabilities.

Implementation considerations include health checking to detect and remove failed instances from rotation, session affinity for stateful conversations requiring consistent routing to the same instance, auto-scaling that adjusts the number of instances based on demand, and request queuing that buffers incoming requests during traffic spikes. Many deployments use Kubernetes with horizontal pod autoscaling or managed services like AWS Application Load Balancer.

Load balancing provides several benefits including horizontal scalability where capacity increases by adding more instances, high availability through redundancy where failures affect only a portion of capacity, zero-downtime deployments using rolling updates, and efficient resource utilization by distributing load evenly. This architecture pattern is essential for production generative AI applications serving large user bases.

Option A is incorrect because sequential processing handles requests one at a time without distributing load. Option C is wrong because cache invalidation manages cache freshness rather than distributing requests. Option D is incorrect because model quantization reduces model size but does not distribute requests across instances.

Question 50:

An engineer needs to evaluate whether a generated text summary captures the key information from a source document. Which evaluation approach compares semantic similarity between generated and reference summaries?

A) Exact string matching

B) BERTScore

C) Token count comparison

D) Response time measurement

Answer: B

Explanation:

BERTScore compares semantic similarity between generated and reference summaries by leveraging contextualized embeddings from BERT-based models to evaluate how well the generated text captures meaning rather than requiring exact word matching. This evaluation metric is particularly effective for summarization tasks where different phrasings can convey the same information.

BERTScore works by computing embeddings for each token in both the generated and reference texts using a pre-trained BERT model. It then matches tokens between the two texts based on cosine similarity of their embeddings, computing precision by measuring how many generated tokens have similar reference tokens, recall by measuring how many reference tokens have similar generated tokens, and an F1 score combining both metrics.

The semantic matching capability makes BERTScore more robust than traditional metrics like BLEU or ROUGE that rely on n-gram overlap. BERTScore recognizes synonyms, paraphrases, and contextually equivalent expressions, providing more accurate quality assessment for summarization where perfect word matching is unrealistic. For example, it would correctly identify that “the company announced record profits” and “the firm reported unprecedented earnings” convey similar meaning.

BERTScore variants use different underlying models (BERT, RoBERTa, etc.) and can be fine-tuned for specific domains. The metric correlates well with human judgments of summary quality and is widely used for evaluating abstractive summarization, text generation, and machine translation. Implementations are available in libraries like HuggingFace Evaluate, making it accessible for production evaluation pipelines.

Option A is incorrect because exact string matching is too strict and misses semantically equivalent paraphrases. Option C is wrong because token count comparison only measures length without assessing content quality. Option D is incorrect because response time measures performance rather than output quality.

Question 51:

A generative AI system needs to adjust response style based on user preferences. Which technique allows the model to adopt different personas or tones without retraining?

A) Model fine-tuning

B) Prompt engineering with system instructions

C) Architecture modification

D) Dataset augmentation

Answer: B

Explanation:

Prompt engineering with system instructions allows models to adopt different personas or tones without retraining by providing explicit instructions about desired behavior, style, and characteristics in the prompt. This technique leverages the instruction-following capabilities of modern language models to dynamically adjust output style based on textual guidance.

System instructions typically appear at the beginning of prompts and define the model’s role, personality, expertise level, and communication style. For example, instructions like “You are a friendly customer service representative who uses simple language and empathetic responses” or “You are a technical expert who provides detailed, precise answers with technical terminology” shape the model’s subsequent generations accordingly.

The flexibility of prompt-based persona control enables single models to serve multiple use cases with different style requirements. Customer service chatbots can adopt supportive tones, technical support systems can use precise language, educational applications can adjust complexity based on learner level, and creative writing assistants can match different literary styles. All these variations use the same underlying model with different system instructions.

Advanced prompt engineering includes few-shot examples demonstrating desired style, explicit constraints about what to avoid, formatting instructions for structured responses, and chain-of-thought prompts that guide reasoning patterns. Organizations often maintain libraries of tested system instructions for different use cases, enabling rapid deployment of new applications without model retraining.

Option A is incorrect because fine-tuning requires retraining the model, which contradicts the requirement for dynamic adjustment without retraining. Option C is wrong because architecture modification involves model design changes requiring extensive development. Option D is incorrect because dataset augmentation is a training-time technique rather than a deployment-time adaptation method.

Question 52:

An organization needs to monitor a deployed generative AI model for performance degradation over time. Which metric helps detect when model outputs become less relevant or accurate?

A) GPU utilization

B) User feedback scores

C) Network bandwidth

D) Storage capacity

Answer: B

Explanation:

User feedback scores help detect when model outputs become less relevant or accurate over time by capturing direct quality assessments from end users interacting with the generative AI system. This human-in-the-loop monitoring provides ground truth data about model performance that automated metrics may miss.

User feedback mechanisms include explicit ratings where users score responses on quality scales, thumbs up/down buttons for binary quality indicators, detailed feedback forms capturing specific quality dimensions like relevance, accuracy, helpfulness, and coherence, and implicit signals like response revision rates or conversation abandonment. These signals aggregate into metrics tracking model performance trends over time.

Monitoring user feedback reveals various degradation patterns including concept drift where the model’s understanding becomes misaligned with evolving user needs, quality regression from production issues or configuration changes, edge case failures where specific query types consistently receive poor ratings, and bias emergence where certain user groups report lower satisfaction.

Production monitoring systems typically combine user feedback with automated metrics like response latency, error rates, and output length statistics. Alerting triggers when feedback scores drop below thresholds or show negative trends, prompting investigation and potential remediation through prompt updates, additional fine-tuning, or retrieval system improvements. Continuous feedback collection enables data-driven model improvement and ensures deployed systems maintain quality standards.

Option A is incorrect because GPU utilization measures infrastructure performance rather than output quality. Option C is wrong because network bandwidth affects delivery speed but not content relevance or accuracy. Option D is incorrect because storage capacity is an infrastructure metric unrelated to model output quality.

Question 53:

A generative AI engineer needs to implement token-level access control to prevent the model from generating certain sensitive information. Which technique restricts specific tokens from being generated?

A) Prompt injection

B) Logit bias manipulation

C) Temperature scaling

D) Context window expansion

Answer: B

Explanation:

Logit bias manipulation restricts specific tokens from being generated by adjusting the probability distribution over the vocabulary before sampling, effectively implementing token-level access control. This technique modifies logits (pre-softmax scores) for specified tokens, either reducing their probability to near zero to prevent generation or increasing it to encourage specific outputs.

Logit bias works by identifying tokens representing sensitive information such as personal identifiable information, proprietary data, or inappropriate content, then applying negative bias values to their logits before the sampling step. A large negative bias (e.g., -100) effectively prevents the token from being selected during generation. Multiple tokens can be simultaneously biased, enabling comprehensive filtering.

Common use cases include preventing profanity generation, blocking generation of specific names or identifiers, avoiding prohibited topics or content, enforcing brand guidelines by suppressing competitor mentions, and implementing regulatory compliance by preventing regulated information disclosure. The technique provides fine-grained control without requiring model retraining or complex post-processing filters.

Implementation involves maintaining allow lists or block lists of token IDs, applying bias values before each generation step, and handling subword tokens appropriately since sensitive words may be split across multiple tokens. Many LLM APIs including OpenAI and Anthropic support logit bias parameters, making this technique accessible. However, sophisticated users might circumvent token-level controls through creative prompting, so logit bias should complement rather than replace other safety measures.

Option A is incorrect because prompt injection is an attack technique rather than a control mechanism. Option C is wrong because temperature scaling affects randomness uniformly across all tokens rather than selectively restricting specific ones. Option D is incorrect because context window expansion increases input capacity but does not restrict token generation.

Question 54:

An engineer is building a code generation system that needs to execute generated code safely. Which approach provides isolated execution environments?

A) Direct execution in production

B) Sandboxed containers

C) Database transactions

D) Prompt filtering

Answer: B

Explanation:

Sandboxed containers provide isolated execution environments for safely running generated code by creating restricted environments with limited access to system resources, network, and sensitive data. This approach enables code generation systems to execute and validate generated programs without risking the host system or production environment.

Sandboxed containers work by leveraging containerization technologies like Docker or specialized sandboxing solutions that create isolated process spaces. The sandbox defines resource limits including CPU time, memory allocation, disk space, and network access. Generated code executes within these boundaries, unable to access file systems, make network requests, or consume excessive resources beyond configured limits.

Security controls in code execution sandboxes include disabling network access to prevent data exfiltration or remote code execution, mounting read-only file systems to prevent unauthorized modifications, enforcing execution timeouts to prevent infinite loops, limiting system calls to prevent privilege escalation, and monitoring resource consumption to detect resource exhaustion attacks. Multi-layer sandboxing provides defense in depth.

Code generation systems typically execute generated code in sandboxes to validate correctness through test cases, demonstrate functionality to users safely, and extract execution results for further processing. This is essential for applications like interactive coding assistants, automated testing frameworks, and code explanation tools. The sandbox ensures malicious or buggy generated code cannot compromise system security or stability.

Option A is incorrect because direct execution in production exposes critical systems to potential security risks from untrusted generated code. Option C is wrong because database transactions provide data consistency rather than code execution isolation. Option D is incorrect because prompt filtering controls inputs but does not provide execution environment isolation.

Question 55:

A generative AI application needs to process images along with text inputs. Which model architecture supports multimodal inputs?

A) Text-only transformer

B) Vision-language model (VLM)

C) Recurrent neural network

D) Decision tree

Answer: B

Explanation:

Vision-language models (VLMs) support multimodal inputs by processing both images and text through specialized architectures that encode visual information and textual information into compatible representations. These models enable applications requiring understanding of both modalities including image captioning, visual question answering, and multimodal document understanding.

VLM architectures typically consist of separate encoders for each modality followed by fusion mechanisms. The vision encoder (often a convolutional neural network or vision transformer) processes images into feature representations. The language encoder (typically a transformer) processes text into embeddings. Cross-attention mechanisms or multimodal fusion layers combine these representations, enabling the model to reason about relationships between visual and textual information.

Popular VLM architectures include CLIP which learns aligned image-text representations through contrastive learning, LLaVA which connects vision encoders to large language models, and GPT-4V which extends GPT-4 with vision capabilities. These models are trained on large datasets of image-text pairs, learning to associate visual concepts with language descriptions.

Applications of VLMs in generative AI include image-based question answering where users ask questions about image content, visual content generation with text guidance, document understanding for processing scanned documents or infographics containing both text and diagrams, and accessibility tools that describe visual content for visually impaired users. The multimodal capabilities significantly expand what generative AI systems can accomplish.

Option A is incorrect because text-only transformers cannot process image inputs directly. Option C is wrong because while RNNs can theoretically handle sequences, they are not designed for multimodal fusion. Option D is incorrect because decision trees are not used for processing complex multimodal data like images and text.

Question 56:

An engineer needs to reduce the cost of serving a large language model. Which technique reduces model size while maintaining acceptable performance?

A) Increasing batch size

B) Model quantization

C) Adding more layers

D) Increasing context window

Answer: B

Explanation:

Model quantization reduces model size while maintaining acceptable performance by converting model weights and activations from high-precision formats like 32-bit floating point to lower-precision formats like 8-bit integers or 4-bit representations. This compression significantly decreases memory requirements, inference latency, and serving costs.

Quantization works by mapping the range of floating-point values to a smaller set of discrete values. Post-training quantization converts trained models without retraining, using techniques like calibration datasets to determine optimal quantization parameters. Quantization-aware training incorporates quantization simulation during training, enabling models to adapt to precision reduction and maintain better accuracy after quantization.

Common quantization schemes include INT8 quantization reducing model size by approximately 75 percent with minimal accuracy loss, INT4 quantization providing even greater compression for models robust to aggressive quantization, and mixed-precision quantization applying different precision levels to different layers based on sensitivity analysis. Specialized hardware like NVIDIA Tensor Cores accelerates quantized inference.

For large language models, quantization enables deployment on less expensive hardware, increases throughput by processing more requests per GPU, reduces cloud infrastructure costs, and enables edge deployment on resource-constrained devices. Libraries like bitsandbytes, GPTQ, and GGML provide quantization implementations supporting various precision levels and quantization schemes.

Option A is incorrect because increasing batch size improves throughput but does not reduce model size or per-request costs. Option C is wrong because adding layers increases model size and costs. Option D is incorrect because increasing context window increases computational requirements rather than reducing costs.

Question 57:

A generative AI system needs to maintain consistency across multiple generated responses for the same query. Which generation parameter should be adjusted?

A) Increase temperature

B) Decrease temperature

C) Increase top-p value

D) Disable prompt caching

Answer: B

Explanation:

Decreasing temperature maintains consistency across multiple generated responses for the same query by reducing randomness in the sampling process, making the model more likely to select high-probability tokens deterministically. Lower temperature values produce more focused, predictable outputs that vary less between generations.

Temperature controls the probability distribution over vocabulary tokens before sampling. Temperature values range from 0 to infinity, with 1.0 representing the model’s original probability distribution. Lower values (0.1-0.7) sharpen the distribution, dramatically increasing probabilities of top tokens while suppressing alternatives. Temperature 0 approximates greedy decoding, always selecting the highest probability token and producing nearly identical outputs for the same input.

Setting low temperature is appropriate for tasks requiring consistency and determinism including factual question answering where consistent correct answers are needed, structured data extraction where reliability matters more than creativity, function calling where specific API parameters must be generated correctly, and production systems where unpredictable variations could cause issues.

However, low temperature reduces diversity and creativity. Applications like creative writing, brainstorming, or generating varied examples benefit from higher temperature values that introduce randomness and explore alternative phrasings. Many systems allow users to adjust temperature based on their preferences, trading off consistency against creativity.

Option A is incorrect because increasing temperature introduces more randomness, reducing consistency. Option C is wrong because increasing top-p value expands the sampling pool, increasing variation. Option D is incorrect because disabling prompt caching affects performance but not generation consistency.

Question 58:

An organization needs to ensure generated content complies with brand guidelines. Which technique validates output against predefined rules before returning responses to users?

A) Pre-generation filtering

B) Post-generation content moderation

C) Embedding normalization

D) Gradient descent

Answer: B

Explanation:

Post-generation content moderation validates output against predefined rules before returning responses to users, ensuring generated content complies with brand guidelines, safety standards, and regulatory requirements. This technique inspects completed generations, applying filters and classifiers to detect and handle non-compliant content.

Post-generation moderation systems typically include multiple components such as profanity filters detecting inappropriate language, safety classifiers identifying harmful content categories, brand consistency checkers validating terminology and tone, factual accuracy validators comparing claims against knowledge bases, and PII detectors identifying personally identifiable information that should be redacted.

The moderation pipeline processes generated outputs through these checks sequentially or in parallel. When violations are detected, the system can take various actions including blocking the response and returning an error message, regenerating with modified prompts or constraints, automatically editing the output to remove violations, or flagging for human review in ambiguous cases. Multi-tier moderation applies fast automated checks followed by slower but more sophisticated analysis.

Implementing effective moderation requires balancing false positives that frustrate users with legitimate requests against false negatives that allow policy violations. Continuous monitoring and improvement of moderation rules based on production data helps maintain this balance. Many organizations combine rule-based filters with machine learning classifiers trained on examples of acceptable and unacceptable content.

Option A is incorrect because pre-generation filtering operates on inputs rather than validating outputs against guidelines. Option C is wrong because embedding normalization is a technical operation for vector representations, not content validation. Option D is incorrect because gradient descent is a training algorithm unrelated to content compliance checking.

Question 59:

A generative AI engineer needs to implement a system where the model can use external tools like calculators or search engines. Which technique enables models to decide when and how to invoke these tools?

A) Static prompt templates

B) Function calling/tool use

C) Model compression

D) Batch normalization

Answer: B

Explanation:

Function calling or tool use enables models to decide when and how to invoke external tools like calculators, search engines, databases, or APIs by generating structured function calls that systems can execute and return results back to the model. This technique extends language models with capabilities beyond text generation, enabling interaction with external systems.

Function calling works by providing the model with specifications of available functions including function names, descriptions, and parameter schemas. When a user query requires external capabilities, the model generates a function call in structured format (typically JSON) specifying which function to invoke and what parameters to pass. The application executes the function, obtains results, and provides them back to the model to incorporate into its response.

For example, when asked “What’s the current temperature in New York and what is 15% of 80?”, a model with function calling might generate two function calls: get_weather(location=”New York”) and calculate(expression=”15% of 80″). The system executes these functions, returns results, and the model synthesizes a natural language response incorporating the information.

Function calling enables building sophisticated agents that can search databases, perform mathematical calculations, retrieve real-time information, control smart home devices, manage calendars, and interact with enterprise systems. This bridges the gap between language understanding and programmatic action execution. Major LLM providers including OpenAI, Anthropic, and open-source frameworks support function calling through standardized interfaces.

Option A is incorrect because static prompt templates provide fixed structures without dynamic tool invocation capabilities. Option C is wrong because model compression reduces model size but does not enable external tool interaction. Option D is incorrect because batch normalization is a training technique for neural networks, not a tool interaction method.

Question 60:

An organization needs to track the provenance of training data used in their generative AI models for compliance purposes. Which practice documents dataset sources, preprocessing steps, and usage rights?

A) Model versioning

B) Data cards and dataset documentation

C) Hyperparameter logging

D) Inference caching

Answer: B

Explanation:

Data cards and dataset documentation document dataset sources, preprocessing steps, and usage rights, providing comprehensive provenance tracking for training data used in generative AI models. This practice creates structured documentation about dataset composition, collection methodology, licensing, intended uses, and known limitations.

Data cards (also called datasheets for datasets) include several key sections. The motivation section explains why the dataset was created and who funded its creation. Composition describes what data is included, how many instances exist, whether it contains personally identifiable information, and what the data represents. Collection process documents how data was acquired, who performed collection, what quality control was applied, and whether subjects consented.

Preprocessing and cleaning sections document transformations applied to raw data, what was removed or filtered, how missing values were handled, and whether any data augmentation occurred. Uses section specifies intended applications, tasks the dataset should and should not be used for, and any restrictions or licenses governing use. The limitations section acknowledges biases, gaps, or other issues that users should consider.

Maintaining rigorous dataset documentation supports several compliance objectives including demonstrating due diligence in data sourcing, respecting intellectual property and licensing terms, enabling bias audits by documenting data composition, supporting reproducibility by capturing preprocessing details, and facilitating impact assessments by tracking downstream uses. Regulatory frameworks increasingly require this transparency for AI systems.

Option A is incorrect because model versioning tracks model artifacts rather than documenting training data provenance. Option C is wrong because hyperparameter logging captures training configuration but not dataset documentation. Option D is incorrect because inference caching improves performance but does not address data provenance tracking.

Exam

Related posts:

Leave a Reply Cancel reply