Agent 开发学习：不借助框架实现网页定向信息采集 Agent

语言选用 TypeScript，本文的主题就是从零开始开发自动网页搜索+定向信息采集的 Agent。TypeScript 相关的配置文件以及环境变量配置此处不做多解释，核心关注点还是在怎么手搓 AI Agent。

该 Agent 的逻辑很清晰：用户问问题，Agent 会自动打开浏览器、搜索网页、点击链接、提取内容，最终整理成完整答案反馈给用户。

项目初始化

技术栈

技术	用途
TypeScript	开发语言
OpenAI SDK	调用 LLM API
Playwright	浏览器自动化
Zod	工具参数定义
zod-to-json-schema	自动生成 JSON Schema
dotenv	环境变量管理
tsx	直接运行 TypeScript 源文件

项目依赖

两条指令，分别用于安装运行时依赖和开发依赖：

powershell

pnpm add dotenv openai zod zod-to-json-schema

powershell

pnpm add -D @types/node playwright tsx typescript

本项目中我选择直接基于 CDP 协议连接到本地 Chrome 浏览器，所以不需要额外下载 Playwright 依赖的测试用 Chromium 浏览器。以下是用于启动开放 9777 端口 Chrome 浏览器的 PowerShell 指令：

powershell

Start-Process "chrome.exe" -ArgumentList "--remote-debugging-port=9777", "--user-data-dir=$env:LOCALAPPDATA\chrome-debug-profile", "--new-window", "about:blank"

环境变量

编写 .env 环境变量文件，本项目需要的环境变量就是 LLM 所需的 API_KEY 等变量，其他参数姑且硬编码在 Agent 中。

env

OPENAI_API_KEY=sk-your-api-key-here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o

TypeScript 编译配置文件

项目根目录下的tsconfig.json 文件，前端写得多的同学应该很熟悉了：

json

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "ES2022",
    "moduleResolution": "bundler",
    "lib": ["ES2022", "DOM"],
    "outDir": "dist",
    "rootDir": "src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "resolveJsonModule": true,
    "declaration": true,
    "declarationMap": true,
    "sourceMap": true
  },
  "include": ["src"],
  "exclude": ["node_modules", "dist"]
 }

Agent 框架搭建

入口文件

项目入口文件是 src/index.ts。

该入门级 Agent 仅与用户进行纯问答交互，工具调用的后台查询过程没必要对用户展示，因此采用 CLI 命令行交互即可满足需求，每笔套开发前端页面。所以我会将交互逻辑也直接写在入口文件中。

typescript

import "dotenv/config"
import * as readline from "node:readline/promises";

async function main() {
  console.log("Kyuu - Web Search & Information Collection Agent\n");

  // const page = await launchBrowser();
  // console.log("📕浏览器已启动\n");

  // const tools = getToolDefinitions();
  // console.log(`🔧已注册 ${tools.length} 个工具: ${tools.map((t) => t.name).join(", ")}\n`);

	const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
  });

	while (true) {
    const question = await rl.question("❓请输入你的问题 (输入 exit 退出):\n> ");

    if (question.toLowerCase() === "exit") break;
    if (!question.trim()) continue;

    console.log("\n🚀Agent 正在执行...\n");

    try {
      // 运行 Agent 获取答案
      // console.log(`\n📋 最终答案:\n\n${answer}\n`);
    } catch (err) {
      console.error("❌执行出错:", err);
    }
  }
	
  rl.close();
  // await closeBrowser();
  console.log("\n📕浏览器已关闭。再见!");
}

main().catch((err) => {
  console.error("❌Fatal Error:", err);
  // closeBrowser().then(() => process.exit(1));
 });

以上就是项目入口的运行逻辑，接下来需要我们完成对各模块的编写。

TypeScript 类型定义

这里直接将最终版的 src/types/types.ts 文件内容展示在这里：

typescript

import type { ChatCompletionMessageToolCall } from "openai/resources/chat/completions";

/** 工具定义，用于描述可供 AI 调用的能力 */
export interface ToolDefinition {
  /** 工具的名称 */
  name: string;
  /** 工具的功能描述，帮助模型理解何时调用 */
  description: string;
  /** 工具的参数定义（通常为 JSON Schema 格式） */
  parameters: Record<string, unknown>;
  /** 工具的实际执行函数 */
  execute: (args: Record<string, unknown>) => Promise<string>;
}

/** 工具调用信息 */
export interface ToolCall {
  /** 工具调用的唯一标识符 */
  id: string;
  /** 被调用的工具名称 */
  name: string;
  /** 传递给工具的参数对象 */
  arguments: Record<string, unknown>;
}

/** 消息角色类型 */
export type MessageRole = "system" | "user" | "assistant" | "tool";

/** 聊天消息定义 */
export interface Message {
  /** 消息发送者的角色 */
  role: MessageRole;
  /** 消息的文本内容 */
  content: string | null;
  /** 模型发起的工具调用列表 */
  tool_calls?: ChatCompletionMessageToolCall[];
  /** 作为工具执行结果返回时，关联的工具调用 ID */
  tool_call_id?: string;
  /** 模型的内部推理/思考内容 */
  reasoning_content?: string;
}

/** 智能体配置项 */
export interface AgentConfig {
  /** 使用的模型名称 */
  model: string;
  /** 系统提示词，用于定义智能体的行为和目标 */
  systemPrompt: string;
  /** 允许智能体执行的最大步骤数，防止无限循环 */
  maxSteps: number;
}

/** 智能体单步执行结果 */
export interface StepResult {
  /** 步骤的返回类型：工具调用或文本回复 */
  type: "tool_call" | "text_response";
  /** 解析出的工具调用列表（当 type 为 tool_call 时存在） */
  toolCalls?: ToolCall[];
  /** 最终的文本回复内容（当 type 为 text_response 时存在） */
  textContent?: string;
}

创建模块目录

创建 src/agent、src/browser、src/tools 和 src/types 目录，我们将分别在这些目录中完成 Agent 核心模块、浏览器管理模块、Agent 工具模块和 TypeScript 类型定义。

浏览器管理模块

该 Agent 的底层依赖，后续 Tools 模块的开发都需要基于浏览器。在该项目中，我们选择使用 Playwright 自动化工具通过 CDP 协议连接到我们本地的 Chrome 浏览器，这样一方面我们不需要额外下载 Playwright Test Chromium 到本地，可以节省硬盘空间，另一方面我们可以使用本地 Chrome 已存储的登录态等数据，可以在后续根据我们的需求拓展更多功能。

Powershell 启动开启 9777 远程调试端口的 Chrome

一条指令即可，当然我们可以通过修改 Chrome.exe 属性实现默认开启此端口，但没什么必要。该指令未指定 windows 本地用户的数据集，所以不会保留登录态和浏览器插件集等数据，相当于开启一个干净的 Chrome 浏览器。

powershell

Start-Process "chrome.exe" -ArgumentList "--remote-debugging-port=9777", "--user-data-dir=$env:LOCALAPPDATA\chrome-debug-profile", "--new-window", "about:blank"

该模块的核心就是在 Agent 开始工作前，为其准备好一个可用的 Chrome 浏览器，并在 Agent 结束工作后关闭该浏览器。所以我们只需要编写三个核心函数：

- launchBrowser(): 基于 CDP 连接到已有浏览器，并返回唯一的页面对象，供 Agent 执行搜索、点击、爬取等操作

- getPage() : 获取当前页面

- closeBrowser(): 关闭浏览器并清空全局变量，防止内存泄露

typescript

import { chromium, Browser, BrowserContext, Page } from "playwright";

const CDP_PORT = 9777;
let browser: Browser | null = null;
let context: BrowserContext | null = null;
let page: Page | null = null;

export async function launchBrowser(): Promise<Page> {
  if (page) return page;

  browser = await chromium.connectOverCDP(`http://localhost:${CDP_PORT}`);

  const contexts = browser.contexts();
  context = contexts.length > 0 ? contexts[0] : await browser.newContext({
    viewport: { width: 1280, height: 720 },
    locale: "zh-CN",
  });

  const pages = context.pages();
  page = pages.length > 0 ? pages[0] : await context.newPage();

  return page;
}

export function getPage(): Page {
  if (!page) throw new Error("Browser not launched. Call launchBrowser() first.");
  return page;
}

export async function closeBrowser(): Promise<void> {
  if (browser) {
    await browser.close();
    browser = null;
  }
  context = null;
  page = null;
}

Agent 模块

初次接触 Agent 开发时，最容易让人困惑的就是“为什么 Agent 能自己行动”。其实，去掉高大上的概念， Agent 的本质就是一个 while 循环。在传统的程序中，代码是“按顺序执行”的。比如先请求 A 接口，再写入数据库 B，或者根据条件判断进行特定操作。而在 Agent 中， “下一步执行什么”是由 LLM 实时决定的。

Agent 只是一个大一点的文本补全机器。它之所以显得比传统 LLM “更加智能”，本质是因为我们写了一个 while 循环，不断地拿“环境的反馈（在该项目中就是网页内容）”去喂给它，它能获取到更多更准确更实时的信息，自然就可以做出更准确的决策和回复。

ReAct 模式

Agent 遵循着一个被称为 ReAct (Reason + Act) 的循环模式：

Think: LLM 根据当前的情况，决定下一步该干嘛。

Act: LLM 告诉我们的代码“请帮我调用 XX 工具”，我们的代码去执行。

Observe: 我们的代码把工具执行的结果（比如网页的文本）返回给 LLM，然后回到第 1 步，直到 LLM 认为“我已经得到足够的信息了（循环次数已经达到设定上限），现在应该归纳所有信息然后回答用户”。

上述内容，就是我们接下来要编写的 runAgent() 函数要做的事情。

Agent 主循环

Agent 核心模块文件为 src/agent/agent.ts 。在给出主循环代码之前，我们需要先考虑好一些事情。

message

LLM 是没有记忆（无状态）的。你不能单独告诉它“打开第二个链接”，它不知道你说的是什么链接。因此，我们需要创建一个 messages 数组，把从对话开始到现在的所有系统设定、用户问题、LLM 的思考、工具执行的返回结果，全部像记流水账一样记下来。每次请求 LLM 时，都要把这个完整的“流水账”发给它。

message 不同于窗口上下文，这不是我们与 Agent 之间产生的一轮轮对话，而是 Agent 在控制 LLM 获取并整合信息中产生的思考。

typescript

const messages: Message[] = [
  { role: "system", content: systemPrompt },
  { role: "user", content: userMessage },
 ];

while loop 与 maxSteps

前面我们已经说过了 Agent 需要 while 循环来一轮轮的获取并整合信息，因为解决一个问题可能需要多步。比如问“迪迦奥特曼的剧情”，它可能需要：搜索 -> 提取页面 -> 发现内容不够 -> 再次搜索。我们无法预先知道它需要几步，所以必须用循环让它自己推进。

那在这个基于循环获取信息的过程中，我们还需要设置 maxSteps，即最大循环步数。一方面是因为 LLM 有时候会“钻牛角尖”或者“陷入死循环”（比如反复搜索同一个搜不到的词，毕竟用户的给出的 prompt 本身也可能是不准确的），另一方面是因为 API 调用是要花钱和时间的，如果没有 maxSteps ，程序可能会永远跑下去，这不是我们开发者乐于看到的，尤其是现在蛮多前沿 LLM API 价格不菲的环境下。

maxSteps 本质上就是对“无休止探索”问题的一个安全护栏：

typescript

const warnAt = maxSteps - 3;
const forceAnswerAt = maxSteps - 1;
if (stepCount === warnAt) {
  messages.push({
    role: "user",
    content: "你已经进行了多次搜索，现在必须立即根据已获取的信息整理出完整答案，不要再进行新的搜索或打开新页面。",
  });
}
const activeTools = stepCount >= forceAnswerAt ? [] : tools;
const response = await sendToLLM(messages, activeTools);

我们可以在循环次数接近最大步数时对 LLM 发出警告，这通常能逼迫 LLM 放弃探索，开始总结答案。但 LLM 毕竟不是人类，它可能无视这条警告，或者对你进行反驳，认为自己确实还需要搜集一些信息。所以在临近最大步数时，我们可以“没收作案工具”——直接把传给 LLM 的工具列表清空（activeTools = []）。LLM 发现自己没工具可用了，就只能乖乖输出文字答案。

Agent 主循环完整代码

在做好以上准备后，我们就可以编写 Agent 主循环的代码了，其中 LLM 调用相关代码我们会在下一小节进行讲解：

typescript

import type { AgentConfig, Message, StepResult, ToolCall } from "./types.js";
import type { ToolDefinition } from "./types.js";
import { sendToLLM } from "./llm.js";

export async function runAgent(
  userMessage: string,
  tools: ToolDefinition[],
  systemPrompt: string,
  onStep: (step: StepResult) => void
): Promise<string> {
  const messages: Message[] = [
    { role: "system", content: systemPrompt },
    { role: "user", content: userMessage },
  ];

  const maxSteps = 15;
  let stepCount = 0;

  while (stepCount < maxSteps) {
    stepCount++;

    const warnAt = maxSteps - 3;
    const forceAnswerAt = maxSteps - 1;

    if (stepCount === warnAt) {
      messages.push({
        role: "user",
        content: "你已经进行了多次搜索，现在必须立即根据已获取的信息整理出完整答案，不要再进行新的搜索或打开新页面。",
      });
    }

    const activeTools = stepCount >= forceAnswerAt ? [] : tools;

    const response = await sendToLLM(messages, activeTools);

    if (response.toolCalls.length > 0 && stepCount < forceAnswerAt) {
      onStep({ type: "tool_call", toolCalls: response.toolCalls });

      const assistantMsg: Message = {
        role: "assistant",
        content: response.textContent,
        reasoning_content: response.reasoningContent || undefined,
        tool_calls: response.toolCalls.map((tc) => ({
          id: tc.id,
          type: "function" as const,
          function: {
            name: tc.name,
            arguments: JSON.stringify(tc.arguments),
          },
        })),
      };
      messages.push(assistantMsg);

      for (const toolCall of response.toolCalls) {
        const tool = tools.find((t) => t.name === toolCall.name);
        if (!tool) {
          messages.push({
            role: "tool",
            content: `Tool "${toolCall.name}" not found.`,
            tool_call_id: toolCall.id,
          });
          continue;
        }

        const result = await tool.execute(toolCall.arguments);
        messages.push({
          role: "tool",
          content: result,
          tool_call_id: toolCall.id,
        });
      }
    } else {
      const finalText = response.textContent || "Agent 未返回有效响应。";
      onStep({ type: "text_response", textContent: finalText });
      return finalText;
    }
  }

  const lastResort = await sendToLLM(messages, []);
  return lastResort.textContent || "Agent 未能产出有效答案，请重试。";
}

LLM 模块

LLM 模块的文件是 src/agent/llm.ts ，该模块就两个作用，其一是发出网络请求调用云端 LLM，其二是抹平我们自定义的数据结构与 LLM API 要求的数据结构之间的差异。

SDK 实例化

首先我们要实例化 OpenAI 官方 SDK 提供的客户端对象，现在的模型基本都会兼容 OpenAI 的接口格式，只要换掉 baseURL ，这套代码基本就能无缝对接其他模型，不用修改逻辑代码。

typescript

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: process.env.OPENAI_BASE_URL || "https://api.openai.com/v1",
 });

消息结构转化

我们要把简化的 ToolDefinition （来自 registry.ts ）转换为 OpenAI API 所需的嵌套结构。因为 OpenAI API 规定，工具列表必须是一个包含 type: "function" 且内部嵌套 function 对象的数组。我们在自己项目里定义工具时，不想每次都写这么繁琐的嵌套结构，所以自己定义了扁平的 ToolDefinition ，在这里统一“包装”一下。

typescript

function toolsToOpenAIFormat(tools: ToolDefinition[]) {
  return tools.map((t) => ({
    type: "function" as const,
    function: {
      name: t.name,
      description: t.description,
      parameters: t.parameters,
    },
  }));
 }

然后，我们还需要过滤和组装发给 LLM 的消息体。因为我们的消息体主要分两类，调用工具和没调用工具的，比如我们的第一句话只有 role 和 content 。如果把 undefined 直接传给 OpenAI API，有时会引发严格模式下的报错。所以这里按需判断存在才附加。

代码里特别加了 reasoning_content 。如果你不用 DeepSeek，这个字段其实是不需要的。但 DeepSeek 强制要求把上一轮的思考内容原样传回去，否则报错。我也是用了 DeepSeek 模型才遇到这个问题 XD

typescript

function messageToOpenAIFormat(m: Message): Record<string, unknown> {
  const msg: Record<string, unknown> = {
    role: m.role,
    content: m.content,
  };
  // ... 附加 tool_calls, tool_call_id, reasoning_content
   return msg;
 }

再然后，我们还要解析 LLM 要求调用工具时的参数。这一步是把 LLM 的消息翻译为 Agent 的消息。LLM 返回的函数参数永远是一个 JSON 格式的字符串（比如 "{ \"query\": \"迪迦\" }" ），而不是一个可以直接使用的 JS 对象。所以我们需要一个函数调用 JSON.parse() 将其转化为真正的对象，这样 agent.ts 拿到后，就可以直接把这个对象传给工具函数执行了

typescript

function parseToolCalls(toolCalls?: ChatCompletionMessageToolCall[]): ToolCall[] {
  if (!toolCalls) return [];

  return toolCalls.map((tc) => ({
    id: tc.id,
    name: tc.function.name,
    arguments: JSON.parse(tc.function.arguments), // <--- 关键点
   }));
 }

sendToLLM

前一小节把消息转化来转化去，目的只有一个——实现 Agent 和 LLM 的通信，所以我们可以编写一个函数，把前面几个函数结合起来，这样 Agent 调用 LLM 时，只需要调用这一个函数，而不需要在主逻辑中就反复调用这些转化函数。此外，OpenAI API 的返回值 response 是一个巨大且层级很深的对象（包含各种 token 统计、系统指纹等）。而我们的 Agent 不需要关心这些。 sendToLLM 就像一个筛子，只把最核心的三个东西捞出来并返回：

它说了什么话（ textContent ）

它想调什么工具（ toolCalls ）

它的思考过程（ reasoningContent ）

typescript

export async function sendToLLM(
  messages: Message[],
  tools: ToolDefinition[]
): Promise<{
  textContent: string | null;
  toolCalls: ToolCall[];
  reasoningContent: string | null;
}> {
  const response = await client.chat.completions.create({
    model: process.env.OPENAI_MODEL || "gpt-4o",
    messages: messages.map(messageToOpenAIFormat) as any,
    tools: tools.length > 0 ? toolsToOpenAIFormat(tools) : undefined,
    tool_choice: tools.length > 0 ? "auto" : undefined,
    temperature: 0.3,
  });

  const choice = response.choices[0];
  const msg = choice.message as any;

  return {
    textContent: msg.content,
    toolCalls: parseToolCalls(msg.tool_calls),
    reasoningContent: msg.reasoning_content || null,
  };
 }

LLM 模块完整代码

typescript

import OpenAI from "openai";
import type { ToolDefinition, Message, ToolCall } from "../types/types";
import type { ChatCompletionMessageToolCall } from "openai/resources/chat/completions";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: process.env.OPENAI_BASE_URL || "https://api.openai.com/v1",
});

function toolsToOpenAIFormat(tools: ToolDefinition[]) {
  return tools.map((t) => ({
    type: "function" as const,
    function: {
      name: t.name,
      description: t.description,
      parameters: t.parameters,
    },
  }));
}

function messageToOpenAIFormat(m: Message): Record<string, unknown> {
  const msg: Record<string, unknown> = {
    role: m.role,
    content: m.content,
  };
  if (m.tool_calls) msg.tool_calls = m.tool_calls;
  if (m.tool_call_id) msg.tool_call_id = m.tool_call_id;
  if (m.reasoning_content) msg.reasoning_content = m.reasoning_content;
  return msg;
}

function parseToolCalls(toolCalls?: ChatCompletionMessageToolCall[]): ToolCall[] {
  if (!toolCalls) return [];

  return toolCalls.map((tc) => ({
    id: tc.id,
    name: (tc as any).function.name,
    arguments: JSON.parse((tc as any).function.arguments),
  }));
}

export async function sendToLLM(
  messages: Message[],
  tools: ToolDefinition[]
): Promise<{
  textContent: string | null;
  toolCalls: ToolCall[];
  reasoningContent: string | null;
}> {
  const response = await client.chat.completions.create({
    model: process.env.OPENAI_MODEL || "gpt-4o",
    messages: messages.map(messageToOpenAIFormat) as any,
    tools: tools.length > 0 ? toolsToOpenAIFormat(tools) : undefined,
    tool_choice: tools.length > 0 ? "auto" : undefined,
    temperature: 0.3,
  });

  const choice = response.choices[0];
  const msg = choice.message as any;

  return {
    textContent: msg.content,
    toolCalls: parseToolCalls(msg.tool_calls),
    reasoningContent: msg.reasoning_content || null,
  };
}

Tools 工具模块

在开始之前，我们要明确到底什么才算是 Agent 开发领域的“工具”。像我们正在写的这个项目，Browser 是一个工具吗？不，它不是。代码层面， browser 是一个领域分类（Category），而 search_web、click_element 才是真正注册到 LLM 那里的 Tool。

让大模型有“组合的自由”：如果我们把“搜索、点击、提取”全写死成一个超级工具 search_and_extract ，大模型就失去了灵活性。万一它点击进去发现是个验证码页面，它就无法调用 solve_captcha 工具，因为流程被我们写死了。拆分成原子工具后，大模型可以自由决定：先 search_web -> 发现目标 -> click_element -> 发现是长文 -> scroll_page -> extract_content 。

降低大模型的“认知负担”：工具越复杂，需要的参数就越多。如果一个超级工具需要传 10 个参数，大模型很容易传错（幻觉）。原子工具通常只有 1-2 个参数（比如 click_element 只需要 text ），大模型调用成功的概率就更高。

然后我们要了解 Agent Tool 开发中的三段式结构：

Zod Schema: 告诉 LLM 参数格式

Tool Definition: 告诉 LLM 工具名、功能、参数

Execute Function: 真正执行操作

这不是某个特定框架的硬性规定，但它可以算是目前的最佳实践。这套规范来自于 OpenAI 的 Function Calling API 规范。比如下面这段代码：

typescript

// Zod Schema（定义参数）
export const clickSchema = z.object({ text: z.string().describe("要点击的文字") });

// Tool Definition（定义给大模型看的说明书）
 export const clickDefinition = {
  name: "click_element",
  description: "点击页面上匹配指定文字的链接或按钮...",
  parameters: zodToJsonSchema(clickSchema),
 };

// Execute Function（真正的执行逻辑）
 export async function executeClick(args) { /* Playwright 代码 */ }

因为大模型不一定能“看懂”我们写的 TypeScript 执行逻辑，所以我们必须用人类的语言告诉它：“我这里有个工具叫什么名字，能干什么，需要什么参数”。OpenAI 规定你传给它的工具列表必须长这样：

json

{
  "type": "function",
  "function": {
    "name": "get_current_weather",
    "description": "Get the current weather in a given location",
    "parameters": {
      "type": "object",
      "properties": {
        "location": { "type": "string", "description": "The city and state" }
      },
      "required": ["location"]
    }
  }
 }

我们的 Tool Definition 就是为了生成这个 JSON 结构。

现在再仔细看上面的 OpenAI 规范，它的 parameters 要求的是一种叫做 JSON Schema 的格式。手写嵌套的 JSON Schema 非常痛苦且容易出错。而 Zod 是 TypeScript 社区最流行的校验库。我们可以通过 zodToJsonSchema 这一步，把 TypeScript 的类型推导、运行时的参数校验、以及发给大模型的 JSON Schema 统一在一起。

不止 OpenAI，其他厂家也有在做类似的规定，比如：https://sdk.vercel.ai/docs/foundations/tools

Tool 模板（以 click 为例）

文件位置 src/tools/browser/click.ts 。

在我们的 Agent 运行逻辑中，涉及了 click、search、scrool 等多种模拟页面交互操作，本文不一一指出了，仅以 click tool 为模板，小小地实践上文所说的 Tool 开发三段式结构：

typescript

import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";
import { getPage } from "../../browser/browser.js";

export const clickSchema = z.object({
  text: z.string().describe("要点击的链接文字或按钮文字"),
});

export const clickDefinition = {
  name: "click_element",
  description: "点击页面上匹配指定文字的链接或按钮。用于打开搜索结果中的目标页面。",
  parameters: zodToJsonSchema(clickSchema) as Record<string, unknown>,
};

export async function executeClick(args: Record<string, unknown>): Promise<string> {
  const { text } = clickSchema.parse(args);

  const page = getPage();

  const link = page.locator(`a:has-text("${text}"), button:has-text("${text}")`).first();
  if ((await link.count()) === 0) {
    return `未找到包含文字"${text}"的可点击元素。`;
  }

  const [newPage] = await Promise.all([
    page.context().waitForEvent("page", { timeout: 10000 }).catch(() => null),
    link.click(),
  ]);

  if (newPage) {
    await newPage.waitForLoadState("domcontentloaded");
    const title = await newPage.title();
    return `在新标签页中打开了: ${title}`;
  }

  await page.waitForLoadState("domcontentloaded");
  await page.waitForTimeout(1000);
  const title = await page.title();
  return `已点击"${text}"，当前页面: ${title}`;
 }

工具注册

我们编写好了多个 Tool，并且编写 src/tools/browser/index.ts 将这些 Tools 聚拢起来，方便统一导入。那这样就可以了吗，我们可以直接在 agent.ts 中直接导入 Tools 工具然后使用吗？当然是可以的，但不妨先看看这样做的话，会写出什么代码：

typescript

import { executeSearch } from "../tools/browser/search.js";
import { executeNavigate } from "../tools/browser/navigate.js";
import { executeClick } from "../tools/browser/click.js";
// ... 引入所有工具

// 在 agent.ts 的主循环里：
 for (const toolCall of response.toolCalls) {
  let result = "";
  
  // 必须用一堆 if-else 甚至 switch 语句来硬编码对应关系
   if (toolCall.name === "search_web") {
    result = await executeSearch(toolCall.arguments);
  } else if (toolCall.name === "navigate_to_url") {
    result = await executeNavigate(toolCall.arguments);
  } else if (toolCall.name === "click_element") {
    result = await executeClick(toolCall.arguments);
  } else if (toolCall.name === "extract_content") {
    result = await executeExtract(toolCall.arguments);
  } else if (toolCall.name === "scroll_page") {
    result = await executeScroll(toolCall.arguments);
  } else {
    result = `找不到工具 ${toolCall.name}`;
  }
  
  // 把 result 塞回 messages...
 }

这未免过于丑陋了，agent.ts 是核心，它本该只关心“思考-行动-观察”的循环逻辑，现在却被迫认识每一个具体的工具，甚至要引入每个工具的文件。假设我们明天想加一个新工具 take_screenshot，我们不仅要新建 screenshot.ts，还必须跑来修改核心的 agent.ts 文件，在这个臃肿的 if-else 里再加一个分支。项目大了之后，这会变成一场噩梦 :(

所以，我们可以引入一种经典的设计模式——注册表模式（Registry Pattern），其核心思想就是把“名字”和“执行逻辑”的绑定关系集中管理起来。

以下是 src/tools/registry.ts 的内容：

typescript

import type { ToolDefinition } from "../agent/types.js";
import {
  searchDefinition, executeSearch,
  navigateDefinition, executeNavigate,
  clickDefinition, executeClick,
  extractDefinition, executeExtract,
  scrollDefinition, executeScroll,
} from "./browser/index.js";

const toolMap = new Map<string, ToolDefinition>();

function register(def: Omit<ToolDefinition, "execute">, execute: ToolDefinition["execute"]) {
  toolMap.set(def.name, { ...def, execute });
}

register(searchDefinition, executeSearch);
register(navigateDefinition, executeNavigate);
register(clickDefinition, executeClick);
register(extractDefinition, executeExtract);
register(scrollDefinition, executeScroll);

export function getToolDefinitions(): ToolDefinition[] {
  return Array.from(toolMap.values());
}

export function getToolDefinition(name: string): ToolDefinition | undefined {
  return toolMap.get(name);
}

export async function executeTool(name: string, args: Record<string, unknown>): Promise<string> {
  const tool = toolMap.get(name);
  if (!tool) return `Tool "${name}" not found.`;
  return tool.execute(args);
 }

这本质上就是内部用一个 Map<string, ToolDefinition> 存储所有已注册的工具。注册表的存在使得 Agent 不需要知道具体工具的存在，它只需要从注册表拿 ToolDefinition[]喂给 LLM，LLM 返回工具名后，注册表负责找到对应的 execute 函数执行。这符合依赖倒置原则 ——高层模块（Agent）不依赖低层模块（Tools 具体实现），而是依赖抽象（ToolDefinition 接口）。