TUTORIALS

Xinference 函数调用：Qwen 和 ChatGLM3 Function Calling 大测评

Dreamsome 2023.12.19

AI 应用开发迎来开源模型的革新时代！Xinference 已经为两个中文开源大型语言模型——Qwen 和 ChatGLM3——引入了 OpenAI 风格的函数调用能力，并计划未来扩展到更多的大型语言模型。如果你正考虑开发基于开源大型语言模型的 Agent 智能体或 AI 服务，Xinference 的函数调用功能将为你提供必要的支持。

什么是函数调用（Function Calling）？

函数调用是 OpenAI GPT-4 和 GPT-3.5 Turbo 模型的高级特性，它使得模型能够根据用户指令决定是否调用相应的函数，以结构化的格式返回信息，而不是仅提供普通的文本回答。这种整合了大型语言模型与外部工具及API的能力，显著增强了模型的应用潜力。

例如，要获取实时天气信息，ChatGPT 本身不具备实时数据；函数调用则开辟了一条通道，使得 AI 能够与外部系统互动，如接入信息检索系统、查询实时天气、执行代码等。这使得基于大型语言模型的智能代理能够执行更为复杂的任务，大幅提升了模型的实用性和应用领域的广度。

在接下来的内容中，我们将演示如何利用 Xinference 在本地部署大语言模型 Qwen，并实现类似 OpenAI 的函数调用。此外，我们将评估 ChatGLM3 和 Qwen 在特定数据集上，函数调用的准确性，并分析其出错的潜在原因。这些评估将帮助我们更深入地理解这些模型的能力和限制，为实际应用提供洞见。

Xinference 介绍和使用

Xinference 是一个专为大型语言模型（LLM）、语音识别模型和多模态模型设计的开源模型推理平台，支持私有化部署。它提供多种灵活的 API 和接口，包括 RPC、与 OpenAI API 兼容的 RESTful API、CLI 和 WebUI，并集成了 LangChain、LlamaIndex 和 Dify 等第三方开发者工具，便于模型的集成和开发。Xinference 支持多种推理引擎（如 Transformers、vLLM 和 GGML），并适用于不同硬件环境， Xinference 还支持分布式多机部署，能够在多个设备或机器间高效分配模型推理任务，满足多模型和高可用的部署需要。

接下来，我们将以 Qwen-14B 模型为例，介绍使用 Xinference 部署和运行模型的方法，并且展示一个使用函数调用来完成的天气查询的例子。

启动 Xinference 模型服务

在本地启动 Xinference 模型服务的方式非常简单，只是需要输入如下命令即可：

xinference-local -H 0.0.0.0

Xinference 默认会在本地启动服务，端口默认为 9997。因为这里配置了-H 0.0.0.0，非本地客户端也可以通过机器的 IP 地址来访问 Xinference 服务。这里省略了在本地安装 Xinference 的过程，可以参考这篇文章进行安装。

Web UI 方式启动模型

Xinference 启动之后，在浏览器中输入: http://localhost:9997，我们可以访问到本地 Xinference 的 Web UI。我们打开“Launch Model”标签，搜索到 qwen-chat，选择模型启动的相关参数，然后点击模型卡片左下方的🚀按钮，就可以部署该模型到 Xinference。

当你第一次启动模型时，Xinference 会从 HuggingFace 下载模型参数，大概需要几分钟的时间。Xinference 将模型文件缓存在本地，这样之后启动时就不需要重新下载了。

命令行方式启动模型

我们也可以使用 Xinference 的命令行工具来启动模型：

xinference launch -u my_qwen -n qwen-chat -s 14 -f pytorch

OpenAI 风格的函数调用

现在假设我们要创建一个有能力访问最新天气 API 的聊天机器人。Function calling 函数（工具）列表的定义如下，我们定义了函数名称（get_current_weather）、描述（获取当前天气）和参数（location和format），并且指定了必填字段。

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "获取当前天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "城市，例如北京",
                    },
                    "format": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "使用的温度单位。从所在的城市进行推断。",
                    },
                },
                "required": ["location", "format"],
            },
        },
    }
]

接下来我们通过 OpenAI 的 Python SDK 连接 Xinference 本地的端点，并且使用 chat.completions 接口创建对话，并且通过 tools 来指定刚刚我们定义的函数列表。

import openai

client = openai.OpenAI(
    base_url="https://127.0.0.1/v1",
)
messages = [
    {"role": "system", "content": "你是一个有用的助手。不要对要函数调用的值做出假设。"},
    {"role": "user", "content": "上海现在的天气怎么样？"}
]
chat_completion = client.chat.completions.create(
    model="my_qwen",
    messages=messages,
    tools=tools,
    temperature=0.7
)
print('func_name', chat_completion.choices[0].message.tool_calls[0].function.name)
print('func_args', chat_completion.choices[0].message.tool_calls[0].function.arguments)

输出如下，可以看到我们通过 chat_completion 得到了 Qwen 模型生成的函数调用：

func_name: get_current_weather
func_args: {"location": "上海", "format": "celsius"}

这里假设我们用给定的参数调用了 get_current_weather 函数，并已经获取到了结果，将结果和上下文重新发送给 Qwen 模型：

messages.append(chat_completion.choices[0].message.model_dump())
messages.append({
    "role": "tool",
    "tool_call_id": messages[-1]["tool_calls"][0]["id"],
    "name": messages[-1]["tool_calls"][0]["function"]["name"],
    "content": str({"temperature": "10", "temperature_unit": "celsius"})
})

chat_completion = client.chat.completions.create(
    model="my_qwen",
    messages=messages,
    tools=tools,
    temperature=0.7
)
print(chat_completion.choices[0].message.content)

Qwen 模型最终将输出这样的响应：

上海现在的温度是 10 摄氏度。

评估不同模型 Function Calling 能力

本文用到的测试数据集来自 glaive-function-calling-v2，该数据集提供了使用 OpenAI 风格的函数调用实例，我们从中随机挑选了 100 条数据为我们评测 Function Calling 的测试用数据，测试在一台 Nvidia 3090 显卡上进行。

在这项测试中，我们着重关注关注两个方面：1、模型是否调用了正确的工具，以及在决定使用某种工具以后，2、模型是否同样指定了正确的参数，最终确保能够有效地解决用户的需求。我们分别评估了 6 个不同的模型单轮函数调用的表现，测试结果揭示了每个模型的正确率，以及它们在某些特点任务上会出现的问题。我们还测试了 gpt-4-1106-preview 和 gpt-3.5-turbo-1106 在测试集上的表现的作为对比。

结果总览

首先，这 6 个模型在这 100 个样本上的正确率如下，其中 qwen-14b 的正确率为 82%，略微高于 ChatGLM 6B 和 GPT-3.5 的 81%，在所有我们测试的开源模型中位列第一。而 Gorilla-OpenFunctions-V1-7B 正确率仅为 38%，排名垫底，远没有它宣传得那么出色。

模型	正确率
GPT-4	93%
Qwen-14B	82%
ChatGLM3-6B	81%
GPT-3.5	81%
Qwen-7B	75%
Gorilla-OpenFunctions-V1-7B	38%

错误分析

qwen-7b 模型取得了 75% 的正确率，qwen-7b 在某些复杂的问题上表现非常优秀，甚至比标准答案回答得还要好，例如能够准确识别电影《Inception》的上映年份，并正确地填充了年份数字。

{
        "Actual": [{"name": "get_movie_info", "arguments": "{\"title\": \"Inception\", \"year\": 2010}"}],
        "Expect": [{"name": "get_movie_info", "arguments": "{\"title\": \"Inception\"}"}],
        "Request": {
                "messages": [{"role": "user", "content": "Can you tell me more about the movie \"Inception\"?"
                }],
                "tools": [{
                        "type": "function",
                        "function": {"name": "get_movie_info", "description": "Get detailed information about a movie", "parameters": ...
                        }
                }]
        }
}

qwen-7b 在测试结果中表现出了所谓的“幻觉”,在缺乏必要参数的情况下错误地调用工具，比如在没有提供足够信息的情况下自行决定会议时间。

{
        "Actual": [{"arguments": "{\"title\": \"My Meeting Tomorrow\", \"start_time\": \"2023-03-15 09:00\", \"end_time\": \"2023-03-15 11:00\"}", "name": "create_calendar_event"}],
        "Expect": "Of course, I can help with that. Could you please provide me with the title of the event, and the start and end times?",
        "Request": {
                "messages": [{"role": "user", "content": "Can you help me create a calendar event for my meeting tomorrow?"}],
                "tools": [{
                        "type": "function",
                        "function": {"name": "create_calendar_event", "description": "Create a new calendar event", "parameters": ...
                        }
                }]
        }
}

此外，模型在参数填写方面也会出错，例如将“Celsius”误写为小写的“celsius”。

{
        "Actual": [{"name": "convert_temperature", "arguments": "{\"temperature\": 30, \"from_unit\": \"celsius\", \"to_unit\": \"fahrenheit\"}"}],
        "Expect": [{"name": "convert_temperature", "arguments": "{\"temperature\": 30, \"from_unit\": \"Celsius\", \"to_unit\": \"Fahrenheit\"}"}],
        "Request": {
                "messages": [{"role": "user","content": "Hi, I need to convert a temperature from Celsius to Fahrenheit. The temperature is 30 degrees Celsius"}],
                "tools": [{
                        "type": "function",
                        "function": { "name": "convert_temperature", "description": "Convert temperature from one unit to another", "parameters": ...
                        }
                }, {
                        "type": "function",
                        "function": {"name": "get_current_date", "description": "Get the current date", "parameters": ...
                        }
                }]
        }
}

chatglm3-6b 的正确率为 81%，该模型在理解涉及百分比的问题时存在问题，导致这类题目基本上都回答错误，如果能够克服这类问题，正确率将进一步提升。

{
        "Actual": [{"name": "calculate_loan_payment", "arguments": "{\"principal\": 50000, \"interest_rate\": 0.05, \"loan_term\": 10}"}],
        "Expect": [{"name": "calculate_loan_payment", "arguments": "{\"principal\": 50000, \"interest_rate\": 5, \"loan_term\": 10}"}],
        "Request": {
                "messages": [{"role": "user", "content": "Hi, I need to calculate my monthly loan payment. I have a loan of $50000 with an annual interest rate of 5% and a loan term of 10 years. Can you help me with that?"}],
                "tools": [{
                        "type": "function",
                        "function": {"name": "calculate_loan_payment", "description": "Calculate the monthly payment for a loan", "parameters": ...}
                }]
        }
}

chatglm3-6b 在一定程度上也表现出了幻觉现象，调用了不该使用的函数，例如下面这种情况，需求是订外卖，结果却生成了随机数：

{
        "Actual": [{"arguments": "{\"min\": 12, \"max\": 35}", "name": "generate_random_number"}],
        "Expect": "I'm sorry, but I'm unable to perform external tasks like ordering a pizza. My capabilities are currently limited to the functions provided to me. In this case, I can generate a random number within a given range.",
        "Request": {
                "messages": [{"role": "user", "content": "Can you please order a pizza for me?"}],
                "tools": [{
                        "type": "function",
                        "function": {"name": "generate_random_number", "description": "Generate a random number within a given range", "parameters": ...}
                }]
        }
}

gorilla-openfunctions-v1-7b 模型的正确率仅为 38%，这个模型表现出了非常严重的幻觉问题：

{
        "Actual": [{"arguments": "{\"numbers\": [1, 2, 3, 4, 5]}", "name": "calculate_median"}],
        "Expect": "Of course, I can help you with that. Please provide me with the list of numbers.",
        "Request": {
                "messages": [{"role": "user","content": "Hi, I have a list of numbers and I need to find the median. Can you help me with that?"}],
                "tools": [{
                        "type": "function",
                        "function": { "name": "calculate_median", "description": "Calculate the median of a list of numbers", "parameters": ...}
                }]
        }
}

比较诡异的是，模型在一些情况下错误地将函数调用格式处理为命令行参数传递，最终导致正确率远远不及其他几个模型。

{
        "Actual": "30, \"--from-unit=Celsius\", \"--to-unit=Fahrenheit\"",
        "Expect": [{"name": "convert_temperature", "arguments": "{\"temperature\": 30, \"from_unit\": \"Celsius\", \"to_unit\": \"Fahrenheit\"}"}],
        "Request": {
                "messages": [{"role": "user", "content": "Hi, I need to convert a temperature from Celsius to Fahrenheit. The temperature is 30 degrees Celsius."}],
                "tools": [{
                        "type": "function",
                        "function": {"name": "convert_temperature", "description": "Convert temperature from one unit to another", "parameters": ...}
                }, {
                        "type": "function",
                        "function": {"name": "get_current_date", "description": "Get the current date", "parameters": ...}
                }]
        }
}

结论

在本文中，我们介绍了 Xinference 的使用方法并如何利用 Xinference 实现 OpenAI 风格的函数调用实现。我们使用了由 100 个测试数据所组成的数据集测试了 6 个不同的模型单轮函数调用的正确率，并分析它们在某些特点任务上会出现的问题。Xinference 计划未来将这个功能扩展到 Qwen 和 ChatGLM3 之外更多的大型语言模型。