LLMs as Judges: Using Large Language Models to Evaluate AI-Generated Text

LLMs as Judges: Using Large Language Models to Evaluate AI-Generated Text 1. Introduction The Challenge of Evaluating AI-Generated Text Imagine you’re a teacher grading thousands of essays, or a company evaluating customer service responses generated by AI. How do you determine which responses are good, which are bad, and which need improvement? This is one of the biggest challenges in artifical intelligence today. Traditionally, researchers have used mathematical formulas (called metrics like BLEU and ROUGE) to automatically score text. Think of these like spell-checkers – they can catch obvious errors, but they can’t tell if a piece of writing is truly engaging, accurate, or helpful. These traditional methods often miss the nuances that make text truly good: Does it flow naturally? Is it factually correct? Does it actually answer the question asked? ...

September 27, 2025 · 36 min · 7573 words · Anoop Maurya

Building AI Agents from Scratch: Understanding the Core Components Behind the Magic

Introduction As a data scientist and AI engineer, I’ve spent countless hours working with various agentic frameworks like LangGraph, AutoGen, and CrewAI. While these tools are incredibly powerfull and make our lives easier, I often found myself wondering: what actually happens under the hood? What makes an AI agent tick? How do all these pieces come together to create something that can think, plan, and act autonomously? That curiosity led me down a fascinating rabbit hole, and eventually to building an AI agent completly from scratch - no frameworks, no abstractions, just pure Python and a deep dive into the fundamental building blocks that make agents work. ...

September 21, 2025 · 8 min · 1660 words · Anoop Maurya