GenAI Overview
Generative AI (GenAI) is a branch of artificial intelligence that creates new content—including text, images, audio, video, and code—in response to user prompts, by learning patterns and structures from vast datasets using advanced machine learning models such as large language models (LLMs), generative adversarial networks (GANs), and other deep learning architectures.
LLM Overview
Large Language Models (LLMs) are advanced computer programs that can read, understand, and generate human language, such as writing sentences or answering questions.
- What they do:
LLMs can perform tasks like writing essays, answering questions, summarizing information, translating languages, creating computer code, and having conversations—just like a human would. - How they work:
- They are trained on huge collections of text (like books, articles, and web pages) to learn patterns and meanings in language.
- LLMs use deep learning, specifically a type of neural network called a transformer. Transformers pay attention to words and their relationships, helping the model understand context, grammar, and even some facts about the world.
- During training, LLMs guess the next word in sentences billions of times, gradually learning how language works (for example: “The cat is on the ___” → “mat”).
- Once trained, they can receive prompts (questions or instructions) and produce relevant, natural-sounding text by predicting what comes next or by following instructions.
- Why are they “large”?
They have billions or trillions of “parameters”—numbers the computer adjusts while learning—which lets them handle complex language tasks with high accuracy. - Examples:
ChatGPT, Google Bard, and Microsoft Copilot are all LLMs trained to have conversations and generate useful content.
In simple terms, LLMs are smart programs that learn from lots of text so they can talk, write, and answer questions nearly as well as people do.
What are the different types of Models –
Base Models (Foundation Models):
- Large, pre-trained AI models with broad capabilities (text, image, audio, code).
- Serve as the starting point for more specialized models.
- Examples: GPT (for text), DALL-E (images), Stable Diffusion (images), AudioGen (text to audio), Whisper (audio to text).
Fine-tuned Models:
- Models adapted for specific tasks using domain data.
- Fine-tuning creates a specialized model (e.g., legal writing, code generation).
- Examples: CodeLlama (coding, based on LLaMA), CodeQwen (coding, based on Qwen).
Instruction-tuned Models:
- Trained to follow different instructions provided at runtime.
- Respond to prompts like “Translate this text” or “Write a summary”.
- Not built from scratch, but tuned from base models.
Multimodal Models:
- Can process and generate multiple types of data (text, images, etc.).
- Examples: OpenAI’s ChatGPT 4o, Google Gemini (accept input/output in varied formats).
Reasoning Models:
- Perform advanced reasoning by solving problems step-by-step.
- Can self-correct and handle complex, multi-step tasks.
- Examples: OpenAI’s O1, F1-Preview, R1-Lite-Preview, QwQ-32B-Preview, Sky-T1-32B-Preview.
LLM vs SLM:
- LLM (Large Language Model): Hundreds of billions of parameters; high computational resource needs; handle complex tasks (e.g., Llama-3.1 with 405B, PaLM with 540B).
- SLM (Small Language Model): Fewer parameters (3B–7B); more efficient for specific tasks; examples include Google’s Gemma, Microsoft’s Phi-3-mini/ Phi-3-small.
Open-source vs Proprietary Models:
- Open-source: Freely accessible, can be run locally; examples include Meta-Llama, Qwen.
- Proprietary: Accessed through APIs or platforms, offered by companies like OpenAI (ChatGPT), Anthropic (Claude), Google (Gemini).
Model Size (7B, 13B, etc.):
Indicates the number of parameters (billions), with larger models generally capable of more complex tasks.
AI Platforms Offering Models:
- Major cloud providers: AWS, Azure, GCP.
- Specialized platforms: Replicate, Fireworks AI, Together AI.
- Local model hosting: Ollama (run models locally).
These pointers summarize the core types and categories of AI models currently prominent in the ecosystem.
Tokens, LLM Hallucination and Inferencing
Tokens
- Tokens are the smallest units of text – such as words, characters, or even parts of words—that an LLM (Large Language Model) uses to process language. Instead of handling whole sentences or paragraphs, the model breaks input down into these manageable pieces called tokens.
- For example, the sentence “The quick brown fox” might get split up as [“The”, “quick”, “brown”, “fox”], or even [“Th”, “e”, ” “, “qu”, “ick”, ” “, “br”, “own”, ” “, “fox”], depending on how the model has been designed.
- During both learning and output generation, LLMs operate by predicting one token at a time, building text sequentially.
- The number of tokens a model can process at once is limited – this affects input size and output length.
Inferencing
- Inferencing (or inference) is the process of using a trained LLM to generate output, such as answering a question, writing code, or summarizing text.
- It starts with input tokens from your prompt, and the model predicts the most probable next token – repeating this step to form sentences or longer outputs.
- This process is not guaranteed to be identical each time, especially when certain parameters (“temperature,” “top_k,” or “top_p”) make the output more creative or varied by allowing randomness in the choice of next token.
- Inferencing is measured for speed via metrics like Tokens Per Second (TPS) and Time to First Token (TtFT).
Hallucinations
- Hallucination in LLMs means generating text that sounds confident and plausible but is factually incorrect, nonsensical, or not directly supported by any provided data.
- For example, if asked about an obscure topic, an LLM might invent a source or details that don’t exist.
- Hallucinations happen because the LLM is designed to predict the next token based on patterns in training data, not to check for factual accuracy.
- Reducing hallucinations is important in high-stakes areas (such as healthcare or law) and involves improving the quality of training data and, often, adding human checks and references.
Prompts and Prompt Engineering
A prompt is the input provided to a large language model (LLM) to generate a response. It can be a question, a statement, or a set of instructions. The quality and clarity of the prompt significantly impact the quality of the output. The art of crafting effective prompts is known as prompt engineering.
Components of an Effective Prompt
An effective prompt often includes several key components:
- Instruction: A clear and specific task the model needs to perform (e.g., “Summarize the following article,” “Write a poem about the ocean”).
- Context: Background information or specific details to help the model understand the request better (e.g., “The article is about the history of artificial intelligence,” “The poem should be in the style of Edgar Allan Poe”).
- Input Data: The information the model needs to process, such as text, code, or a list of items.
- Output Indicator: A statement specifying the desired format or style of the output (e.g., “The summary should be in bullet points,” “The poem should have four stanzas”).
Prompting Techniques
Different prompting techniques can be used to elicit specific and more accurate responses from LLMs.
- Zero-shot Prompting:
This is the most basic form of prompting, where you provide the model with a task and no examples. The model relies solely on its pre-trained knowledge to generate a response.
- Example: “Translate the following English sentence to French: ‘Hello, how are you?'”
- Few-shot Prompting:
This technique involves providing the model with a few examples of the desired input-output pairs before giving it the actual task. This helps the model understand the pattern and desired format.
- Example:
“The capital of France is Paris.
The capital of Japan is Tokyo.
The capital of Italy is Rome.
The capital of Germany is?”
- Chain-of-Thought (CoT) Prompting:
CoT prompting encourages the model to break down a complex problem into intermediate steps before providing the final answer. This is particularly effective for multi-step reasoning tasks and helps the model think through the problem logically.
- Example:
“The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
- Correct Response (with CoT): The odd numbers are 9, 15, 1. Their sum is 25. 25 is not an even number. So the answer is no.
- Incorrect Response (without CoT): The model might jump to a wrong conclusion without showing the steps.”
- Role-playing Prompting:
Assigning a specific persona or role to the model can guide its response in a particular style or tone.
- Example: “You are a seasoned marketing expert. Draft a compelling social media post announcing the launch of a new product.”
- Instruction-based Prompting:
This technique involves providing explicit instructions on what the model should or should not do, often using keywords like “MUST,” “DO NOT,” “AVOID,” etc.
- Example: “Summarize the article in five bullet points. DO NOT use jargon.”
- Self-Correction Prompting:
This technique involves asking the model to first generate a response and then critique or improve its own answer based on specific criteria.
- Example: “Answer the following question. After you’ve answered, review your response for clarity and accuracy and provide a revised version if necessary.”
Prompting Guidelines
- Be Specific and Clear: Ambiguous prompts lead to ambiguous results. Clearly state your goal and provide all necessary information.
- Iterate and Refine: The first prompt is rarely perfect. Experiment with different phrasings, examples, and techniques to see what works best.
- Provide Constraints: Tell the model what to avoid or what format to follow. This helps in controlling the output.
- Use Delimiters: When providing context, use delimiters like triple quotes “””, XML tags <tag>, or other markers to separate the instructions from the input data. This helps the model understand which part is the instruction and which is the data.
- Break Down Complex Tasks: For complicated requests, divide them into smaller, manageable steps. This is a form of CoT and improves accuracy.
- Know Your Audience: Adjust the prompt based on who the final output is for. A prompt for a technical audience will differ from one for a general audience.
The Importance of Play and Experimentation
The most crucial part of prompt engineering is to play around with your prompts. There is no one-size-fits-all solution. A prompt that works perfectly for one use case might fail for another.
- Test Different Phrasings: Change a few words to see if it improves the output.
- Adjust Temperature and Other Parameters: Most LLM APIs allow you to control parameters like temperature (creativity/randomness) and top_p (diversity). Experimenting with these can significantly change the output.
- Combine Techniques: Use a mix of techniques. For example, combine few-shot examples with a role-playing persona for a more targeted response.
- Analyze Errors: When the model gives a wrong or unhelpful response, analyze why. Did the prompt lack context? Was it too vague? Use this analysis to refine your next prompt.
By understanding these techniques and guidelines and, most importantly, by embracing a mindset of continuous experimentation, you can effectively leverage the power of large language models to achieve your desired outcomes.
Function/Tool Calling
Function and tool calling are key capabilities that enable Large Language Models (LLMs) like GPT to go beyond just generating text—they let these AI models interact with external systems, APIs, or software to perform real-world tasks or retrieve up-to-date information.
What is Function and Tool Calling?
- Function calling (also known as tool calling) allows an LLM to identify when it needs to use an external function or tool to fulfill a user request. Instead of only generating a text answer, the model produces a structured request specifying which function to call and with what parameters.
- A function or tool is a pre-defined piece of code or an API that performs a specific task, like fetching weather data, querying a database, sending an email, or executing calculations.
- The LLM doesn’t execute the function itself. Instead, it outputs a structured description (typically JSON) indicating the function name and arguments. Your application then interprets this response, calls the actual function/tool, and sends the result back to the model.
How It Works (Typical Workflow)
- Define Functions/Tools: Developers provide the LLM with descriptions and schemas for the functions/tools it can call. This includes names, input parameters, and expected outcomes.
- Model Input: The user gives the model a prompt or question.
- Function Call Decision: The LLM analyzes the prompt and decides if it needs to call a function to answer correctly or perform an action.
- Function Call Output: The model outputs a structured function call with the name and parameters.
- Execution: The application receives this response, executes the specified function/tool, and obtains real-world data or performs an action accordingly.
- Response Completion: The function’s output is fed back to the LLM, which then completes its natural language response incorporating the new information.
Why Use Function/Tool Calling?
- Access real-time or specific data: LLMs are trained on static datasets and don’t have real-time knowledge. Function calling lets the model access fresh data (e.g., current weather, stock prices).
- Perform precise or complex actions: Some tasks like calculations, database queries, or booking appointments are better handled by dedicated code than by text generation alone.
- Reduce hallucinations: By grounding responses in factual data retrieved through functions, overall accuracy improves.
- Extend capabilities: Seamlessly connects language models with external applications and services to create powerful AI agents and assistants.
Examples
- Asking a chatbot, “What’s the weather in Paris?” triggers a call to a get_weather(location=’Paris’) function, and the model uses the returned weather data in its reply.
- Scheduling a meeting by calling a calendar API.
- Fetching user’s account details from a database to answer a support query.
In essence, function and tool calling bridge the language understanding of LLMs with actionable, real-world functionality, enabling AI systems to be more practical, accurate, and interactive.
Basics of RAG
Retrieval-Augmented Generation (RAG) is an AI technique that enhances the capabilities of large language models (LLMs) by integrating an external information retrieval step before generating their output. This approach allows the model to refer to up-to-date, authoritative, or domain-specific data beyond what was available in its original training, improving response accuracy, relevance, and reducing hallucinations (incorrect or fabricated information).
High-Level Overview of RAG
- Core Idea: Instead of relying solely on pre-trained knowledge, RAG first retrieves relevant information from external sources—such as databases, documents, or APIs—based on the user’s query.
- Process:
- Retrieval: The user’s query is converted into a vector representation and used to search a vector database storing external data as embeddings (numerical representations of text or other content).
- Augmentation: The retrieved relevant data is combined with the original query to form an augmented prompt.
- Generation: The LLM processes this enriched prompt to generate a more accurate, context-aware, and up-to-date response.
- Benefits:
- Access to real-time or proprietary data without retraining the model.
- Greater control over information sources, leading to transparency and trust.
- Cost-effective way to improve accuracy compared to fine-tuning or retraining large models.
- Helps reduce hallucinations by grounding answers in sourced information.
Key Terminologies Used in RAG
- Embedding Model: Converts raw data (documents, texts) into vector embeddings that capture semantic meaning, making it easier for retrieval systems to find relevant information.
- Retriever: A system component (like a semantic search engine) that matches the user query’s embedding against stored data embeddings to find the most relevant documents or content.
- Vector Database: A specialized database that stores embeddings for fast similarity searches based on vector distances.
- Augmented Prompt: The enhanced input to the LLM that includes both the user query and the retrieved relevant information, providing richer context to generate better responses.
- Reranker (optional): A component that ranks retrieved documents by relevance to the query, helping select the best documents for augmentation.
- Generation: The step where the LLM creates its response by incorporating the augmented information, blending external knowledge with its pre-trained capabilities.
- Hallucination: When the LLM produces inaccurate or fabricated details; RAG helps mitigate this by grounding outputs in retrieved information.
- External Data: Any data not part of the original training set, like company documents, APIs, or the latest news, which can be updated continuously to keep RAG responses current.
In essence, RAG is a hybrid AI framework that combines the strengths of information retrieval systems and generative language models to produce more accurate, relevant, and reliable AI-generated content, especially in scenarios that require access to fresh or domain-specific knowledge.
This page is a part of the companion pages for my book Agentic AI Demystified. The other pages are available here.