Q1. What is the temperature parameter in LLM generation? How does it affect output?
Temperature is a hyperparameter that controls the randomness or creativity of the model's output. It scales the logits before applying softmax. Lower temperature (e.g., 0.1) makes the model more deterministic, favoring the most likely tokens. Higher temperature (e.g., 0.9) increases randomness, allowing less likely tokens to be chosen. Temperature values range from 0 to 2 (sometimes higher). Examples: • Temperature = 0: Greedy decoding; always picks the highest probability token. • Temperature = 0.5: Slightly creative but mostly focused. • Temperature = 1: Standard randomness; the original probability distribution. • Temperature = 1.5: Very high randomness; may produce incoherent output. Use low temperature for factual Q&A, high temperature for creative writing or brainstorming.
Q2. What is Top‑p (nucleus sampling) and how does it differ from temperature?
Top‑p (nucleus sampling) selects the smallest set of tokens whose cumulative probability exceeds the threshold p. Instead of sampling from all tokens, it dynamically chooses a variable number of tokens based on the probability mass. For example, p=0.9 means the model considers only the top tokens that together make up 90% of the probability mass, and then samples from that set. Differences from temperature: • Temperature rescales probabilities (affects all tokens). • Top‑p dynamically adjusts the candidate pool. • They can be used together; typical combination: temperature 0.7–1.0, top‑p 0.9–1.0. • Top‑p is often preferred because it naturally adapts to the shape of the distribution (e.g., flat distribution yields many candidates, peaked distribution yields few).
Q3. What is Top‑k sampling and how does it work?
Top‑k sampling restricts the model to only consider the k most likely tokens at each generation step. All other tokens have zero probability. For example, top‑k=50 means only the 50 highest-probability tokens are allowed. This avoids very unlikely (often nonsensical) tokens. Top‑k is simpler than top‑p but less adaptive; the same k may be too restrictive for some distributions (where only 10 tokens are plausible) and too permissive for others (where 100 tokens are plausible). Typical values: top‑k=40 for GPT-3, top‑k=20–100 for various models. Often top‑k is used in combination with temperature and top‑p for fine control. Example:
# Pseudo code
candidates = sorted_tokens[:k]
probs = softmax(candidates / temperature)
next_token = sample(probs)Q4. How do temperature, top‑p, and top‑k interact? Give a practical tuning strategy.
These parameters can be combined, but typical best practices: • For deterministic tasks (fact extraction, code generation): temperature=0, (top‑p and top‑k irrelevant). • For creative but coherent text: temperature=0.7–0.9, top‑p=0.9–1.0, top‑k=40–100. • For highly random text (brainstorming): temperature=1.2–1.5, top‑p=0.95–1.0. • Start with temperature only; if output is too repetitive, increase temperature or add top‑p. • If output includes rare/weird tokens, reduce top‑k or top‑p. • Many APIs (OpenAI) recommend using either temperature or top‑p, not both, but some allow both. Strategy: Set temperature first to control randomness, then optionally set top‑p to further filter the token set. For production, experiment with a small validation set.
Q5. Give scenario-based examples of when to adjust these parameters.
Scenario 1: A customer support chatbot that gives consistent, safe answers. Use temperature=0 (greedy). Scenario 2: A creative story writer that needs novelty but not gibberish. Use temperature=0.8, top‑p=0.95. Scenario 3: A code completion tool that must not hallucinate rare APIs. Use temperature=0.2, top‑k=20. Scenario 4: A joke generator that should be surprising but still make sense. Use temperature=1.0, top‑p=0.9. Scenario 5: A brainstorming session for product names – very creative. Use temperature=1.3, top‑p=1.0, and also add frequency penalty to avoid repetition. Always test with multiple seeds.
