Llama Stack API (实验性)#

警告

NIM 中对 Llama Stack API 的支持是实验性的！

Llama Stack API 是 Meta 开发的一套全面的接口，供 ML 开发人员在 Llama 基础模型之上构建应用。该 API 旨在标准化与 Llama 模型的交互，简化开发人员体验，并促进整个 Llama 生态系统中的创新。Llama Stack 涵盖模型生命周期的各个组件，包括推理、微调、评估和合成数据生成。

借助 Llama Stack API，开发人员可以轻松地将 Llama 模型集成到他们的应用程序中，利用工具调用功能，并构建复杂的 AI 系统。本文档概述了如何使用 Llama Stack API 的 Python 绑定，重点介绍聊天完成和工具使用。

有关完整的 API 文档和源代码，请访问 Llama Stack GitHub 仓库。

安装#

要开始使用 Llama Stack API，您需要安装必要的软件包。您可以使用 pip 执行此操作

pip install llama-toolchain llama-models llama-agentic-system

这些软件包提供了使用 Llama Stack API 的核心功能。

常用组件#

以下示例将常用组件存储在文件 inference.py 中。此文件包含 InferenceClient 类和实用程序函数，这些函数在不同的示例中都会用到。以下是 inference.py 的内容

import json
from typing import Union, Generator
import requests
from llama_toolchain.inference.api import (
    ChatCompletionRequest, 
    ChatCompletionResponse, 
    ChatCompletionResponseStreamChunk
)

class InferenceClient:
    def __init__(self, base_url: str):
        self.base_url = base_url

    def chat_completion(self, request: ChatCompletionRequest) -> Generator[Union[ChatCompletionResponse, ChatCompletionResponseStreamChunk], None, None]:
        url = f"{self.base_url}/inference/chat_completion"
        payload = json.loads(request.json())
        
        response = requests.post(
            url,
            json=payload,
            headers={"Content-Type": "application/json"},
            stream=request.stream
        )

        if response.status_code != 200:
            raise Exception(f"Error: HTTP {response.status_code} {response.text}")

        if request.stream:
            for line in response.iter_lines():
                if line:
                    line = line.decode('utf-8')
                    if line.startswith('data: '):
                        data = json.loads(line[6:])
                        yield ChatCompletionResponseStreamChunk(**data)
        else:
            response_data = response.json()
            # Handle potential None values in tool_calls
            if 'completion_message' in response_data and 'tool_calls' in response_data['completion_message']:
                tool_calls = response_data['completion_message']['tool_calls']
                if tool_calls is not None:
                    for tool_call in tool_calls:
                        if 'arguments' in tool_call and tool_call['arguments'] is None:
                            tool_call['arguments'] = ''  # Replace None with empty string
            yield ChatCompletionResponse(**response_data)

def process_chat_completion(response: Union[ChatCompletionResponse, ChatCompletionResponseStreamChunk]):
    if isinstance(response, ChatCompletionResponse):
        print("Response content:", response.completion_message.content)
        if response.completion_message.tool_calls:
            print("Tool calls:")
            for tool_call in response.completion_message.tool_calls:
                print(f"  Tool: {tool_call.tool_name}")
                print(f"  Arguments: {tool_call.arguments}")
    elif isinstance(response, ChatCompletionResponseStreamChunk):
        print(response.event.delta, end='', flush=True)
        if response.event.stop_reason:
            print(f"\nStop reason: {response.event.stop_reason}")

基本用法#

在以下基本用法示例中使用这些常用组件

from inference import InferenceClient, process_chat_completion
from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage
from llama_models.llama3.api.datatypes import SamplingParams

def chat():
    client = InferenceClient("http://0.0.0.0:8000/experimental/ls")
    
    message = UserMessage(content="Explain the concept of recursion in programming.")
    request = ChatCompletionRequest(
        model="meta/llama-3.1-70b-instruct",
        messages=[message],
        stream=False,
        sampling_params=SamplingParams(
            max_tokens=1024
        )
    )
    
    for response in client.chat_completion(request):
        process_chat_completion(response)

if __name__ == "__main__":
    chat()

流式响应#

对于流式响应，请使用相同的结构

from inference import InferenceClient, process_chat_completion
from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage
from llama_models.llama3.api.datatypes import SamplingParams

def stream_chat():
    client = InferenceClient("http://0.0.0.0:8000/experimental/ls")
    
    message = UserMessage(content="Write a short story about a time-traveling scientist.")
    request = ChatCompletionRequest(
        model="meta/llama-3.1-70b-instruct",
        messages=[message],
        stream=True,
        sampling_params=SamplingParams(
            max_tokens=1024
        )
    )
    
    for response in client.chat_completion(request):
        process_chat_completion(response)

if __name__ == "__main__":
    stream_chat()

工具调用#

Llama Stack API 支持工具调用，允许模型与外部函数交互。

重要提示

与 OpenAI API 不同，Llama Stack API 仅支持工具选择 "auto"、“required"” 或 None。

from inference import InferenceClient, process_chat_completion
from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage, ToolDefinition, ToolParamDefinition
from llama_models.llama3.api.datatypes import SamplingParams, ToolChoice

weather_tool = ToolDefinition(
    tool_name="get_current_weather",
    description="Get the current weather for a location",
    parameters={
        "location": ToolParamDefinition(
            param_type="string",
            description="The city and state, e.g. San Francisco, CA",
            required=True
        ),
        "unit": ToolParamDefinition(
            param_type="string",
            description="The temperature unit (celsius or fahrenheit)",
            required=True
        )
    }
)

def tool_calling_example():
    client = InferenceClient("http://0.0.0.0:8000/experimental/ls")
    
    message = UserMessage(content="Get me the weather in New York City, NY.")
    request = ChatCompletionRequest(
        model="meta/llama-3.1-8b-instruct",
        messages=[message],
        tools=[weather_tool],
        tool_choice=ToolChoice.auto,
        sampling_params=SamplingParams(
            max_tokens=200
        )
    )
    
    for response in client.chat_completion(request):
        process_chat_completion(response)

if __name__ == "__main__":
    tool_calling_example()