Audio2Face-3D 微服务#

双向 gRPC 通信#


A2F-3D 微服务通过双向流式 gRPC 接收其数据。数据由以下部分组成:

  1. 一个音频流头部,包含有关即将到来的音频数据的信息,以及面部参数、后处理选项、 blendshape 参数和情感参数。

  2. 音频数据以及带有时间码的情感数据,用于开始应用情感

在 gRPC 交互的示例运行时配置文件中,我们记录了可以用作 gRPC 请求一部分的不同参数。您可以在 我们的 Github 仓库 中找到它。我们还在下面提供了一个示例


此配置文件是运行时配置文件。虽然它看起来类似于部署时配置文件,但不应将运行时和部署时配置文件混淆。这两个配置文件之间存在大小写差异(snake_case 与 camelCase)和结构差异。请参阅我们的示例以获取 A2F-3D NIM 手动容器部署和配置 的指导。

# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.

# Parameters related to the face
  # Controls the range of motion on the upper regions of the face
  # Min value: 0.0
  # Max value: 2.0
  upperFaceStrength: 1
  # Applies temporal smoothing to the upper face motion
  # Min value: 0.0
  # Max value: 0.1
  upperFaceSmoothing: 0.001
  # Controls the range of motion on the lower regions of the face1
  # Min value: 0.0
  # Max value: 2.0
  lowerFaceStrength: 1.2
  # Applies temporal smoothing to the lower face motion
  # Min value: 0.0
  # Max value: 0.1
  lowerFaceSmoothing: 0.006
  # Determines the boundary between the upper and lower regions of the face
  # Min value: 0.0
  # Max value: 1.0
  faceMaskLevel: 0.6
  # Determines how smoothly the upper and lower face regions blend on the boundary
  # Min value: 0.001
  # Max value: 0.5
  faceMaskSoftness: 0.0085
  # Controls the range of motion of the skin.
  # Min value: 0.0
  # Max value: 2.0
  skinStrength: 1.0
  # Adjusts the default pose of eyelid open-close
  # Min value: -1.0
  # Max value: 1.0
  eyelidOpenOffset: 0.06
  # Adjusts the default pose of lip close-open
  # Min value: -0.02
  # Max value: 0.2
  lipOpenOffset: -0.02

# contains multipliers and offsets to be applied on the inference result
# For more information about blendshapes see:
  enable_clamping_bs_weight: False
    EyeBlinkLeft: 1.0
    EyeLookDownLeft: 0.0
    EyeLookInLeft: 0.0
    EyeLookOutLeft: 0.0
    EyeLookUpLeft: 0.0
    EyeSquintLeft: 1.0
    EyeWideLeft: 1.0
    EyeBlinkRight: 1.0
    EyeLookDownRight: 0.0
    EyeLookInRight: 0.0
    EyeLookOutRight: 0.0
    EyeLookUpRight: 0.0
    EyeSquintRight: 1.0
    EyeWideRight: 1.0
    JawForward: 0.7
    JawLeft: 0.2
    JawRight: 0.2
    JawOpen: 0.8
    MouthClose: 0.3
    MouthFunnel: 1.0
    MouthPucker: 1.0
    MouthLeft: 0.2
    MouthRight: 0.2
    MouthSmileLeft: 1.2
    MouthSmileRight: 1.2
    MouthFrownLeft: 0.5
    MouthFrownRight: 0.5
    MouthDimpleLeft: 0.8
    MouthDimpleRight: 0.8
    MouthStretchLeft: 0.05
    MouthStretchRight: 0.05
    MouthRollLower: 0.8
    MouthRollUpper: 0.5
    MouthShrugLower: 1.0
    MouthShrugUpper: 0.4
    MouthPressLeft: 0.8
    MouthPressRight: 0.8
    MouthLowerDownLeft: 0.8
    MouthLowerDownRight: 0.8
    MouthUpperUpLeft: 0.8
    MouthUpperUpRight: 0.8
    BrowDownLeft: 1.2
    BrowDownRight: 1.2
    BrowInnerUp: 1.3
    BrowOuterUpLeft: 0.8
    BrowOuterUpRight: 0.8
    CheekPuff: 0.2
    CheekSquintLeft: 1.0
    CheekSquintRight: 1.0
    NoseSneerLeft: 0.8
    NoseSneerRight: 0.8
    TongueOut: 0.0
    EyeBlinkLeft: 0.0
    EyeLookDownLeft: 0.0
    EyeLookInLeft: 0.0
    EyeLookOutLeft: 0.0
    EyeLookUpLeft: 0.0
    EyeSquintLeft: 0.0
    EyeWideLeft: 0.0
    EyeBlinkRight: 0.0
    EyeLookDownRight: 0.0
    EyeLookInRight: 0.0
    EyeLookOutRight: 0.0
    EyeLookUpRight: 0.0
    EyeSquintRight: 0.0
    EyeWideRight: 0.0
    JawForward: 0.0
    JawLeft: 0.0
    JawRight: 0.0
    JawOpen: 0.0
    MouthClose: 0.0
    MouthFunnel: 0.0
    MouthPucker: 0.0
    MouthLeft: 0.0
    MouthRight: 0.0
    MouthSmileLeft: 0.0
    MouthSmileRight: 0.0
    MouthFrownLeft: 0.0
    MouthFrownRight: 0.0
    MouthDimpleLeft: 0.0
    MouthDimpleRight: 0.0
    MouthStretchLeft: 0.0
    MouthStretchRight: 0.0
    MouthRollLower: 0.0
    MouthRollUpper: 0.0
    MouthShrugLower: 0.0
    MouthShrugUpper: 0.0
    MouthPressLeft: 0.0
    MouthPressRight: 0.0
    MouthLowerDownLeft: 0.0
    MouthLowerDownRight: 0.0
    MouthUpperUpLeft: 0.0
    MouthUpperUpRight: 0.0
    BrowDownLeft: 0.0
    BrowDownRight: 0.0
    BrowInnerUp: 0.0
    BrowOuterUpLeft: 0.0
    BrowOuterUpRight: 0.0
    CheekPuff: 0.0
    CheekSquintLeft: 0.0
    CheekSquintRight: 0.0
    NoseSneerLeft: 0.0
    NoseSneerRight: 0.0
    TongueOut: 0.0

# Parameter related to temporal smoothing of emotion
# Expected value range (0, inf]
live_transition_time: 0.0001

# Parameter that sets the beginning emotion in A2E engine
# Can provide any subset of the following params
# Set 'enable_preferred_emotion=false' when testing this
  amazement: 1.0
  anger: 0.0
  # cheekiness: 0.0
  disgust: 0.0
  fear: 1.0
  # grief: 0.0
  # joy: 0.0
  outofbreath: 0.0
  pain: 1.0
  sadness: 0.0

# Parameters related to the post-processing of emotions.
  # Increases the spread between emotion values by pushing them higher or lower
  emotion_contrast: 1.0
  # Coefficient for smoothing emotions over time
  live_blend_coef: 0.7
  # Tells the A2F pipeline whether to use emotions weights defined under this sections
  enable_preferred_emotion: false
  # Sets the strength of the preferred emotion (if is loaded) relative to generated emotions
  preferred_emotion_strength: 0.5
  # Sets the strength of emotions relative to neutral emotion
  emotion_strength: 0.6
  # Sets a firm limit on the quantity of emotion sliders engaged by A2E - emotions with highest weight will be prioritized
  max_emotions: 3

# Manual emotion input
# Here there is a list of `emotion with timecode`
# You can add more of them if needed.
# An emotion with timecode is an emotion that is applied at a specific timecode
# Emotions must be between 0.0 and 1.0
# Timecodes must be >= 0.0
  # this first emotion with timecode will apply joy at the very beginning of the
  # audio clip
    time_code: 0.0
      amazement: 0.0
      anger: 0.0
      cheekiness: 0.0
      disgust: 0.0
      fear: 0.0
      grief: 0.0
      joy: 1.0
      outofbreath: 0.0
      pain: 0.0
      sadness: 0.0
  # this second emotion with timecode will apply fear after 1 second of audio in the
  # audio clip
    time_code: 1.0
      amazement: 0.0
      anger: 0.0
      cheekiness: 0.0
      disgust: 0.0
      fear: 1.0
      grief: 0.0
      joy: 0.0
      outofbreath: 0.0
      pain: 0.1
      sadness: 0.0

目前,我们仅支持 单声道 16 位 PCM 音频格式。我们确实支持任意采样率,但建议使用 16kHz 以获得最佳性能。最小和最大接受采样率在微服务的配置文件中定义,但我们不能保证在默认值之外获得良好的结果。


A2F-3D 微服务在双向流式 gRPC 上输出

  1. 一个动画头部,包含有关 blendshape 名称、音频输出格式等的信息。

  2. 与音频数据同步的带有时间码的 blendshape 数据。

gRPC 协议缓冲区的详细描述请参见 grpc 协议缓冲区 部分。


Audio2Face-3D 微服务每秒音频执行 30 次推理。

这意味着播放 Audio2Face-3D 的输出数据必须以 30 FPS 进行。

但是,Audio2Face-3D 的处理速度不限于每秒计算 30 次推理。例如,如果流在 Audio2Face-3D 日志中指示 300 FPS,则表示每秒计算处理 10 秒(300 帧 / 30FPS)的音频。

因此,当从 Audio2Face-3D 接收数据时,您需要缓冲它以 30FPS 重新播放。高输出速率可防止网络不稳定引起的抖动。

目前在 Audio2Face-3D 中,FPS 日志每秒打印到 stdout 2 次。




Audio2Face-3D 输出 blend shape。有关更多信息,请参阅 ARKit blendShape 文档

Audio2Face-3D 不会使头部、舌头和眼睛移动动画化。

以下 blend shape 值在 Audio2Face-3D 输出中始终为 0

  • EyeLookDownRight

  • EyeLookInRight

  • EyeLookOutRight

  • EyeLookUpRight

  • EyeLookDownLeft

  • EyeLookInLeft

  • EyeLookOutLeft

  • EyeLookUpLeft

  • TongueOut

  • HeadRoll

  • HeadPitch

  • HeadYaw


blendshape MouthClose 的定义偏离了标准的 ARKit 版本。该形状包括下巴的张开。



虽然支持这种类型的通信,但我们建议 Audio2Face-3D 的首次用户使用双向端点。

在此模式下,双向流式 gRPC 分割为两个客户端流式 gRPC,其中 Audio2Face-3D 既是服务器又是客户端。


A2F-3D 微服务通过客户端流式 RPC 接收其数据。数据由以下部分组成:

  1. 一个音频流头部,包含有关即将到来的音频数据的信息,以及面部参数、后处理选项、 blendshape 参数和情感参数。

  2. 音频数据以及带有时间码的情感数据,用于开始应用情感


A2F-3D 微服务将其数据作为客户端流式 RPC 输出。输出数据由以下部分组成:

  1. 一个动画头部,包含有关 blendshape 名称、音频输出格式等的信息。

  2. 与音频数据同步的带有时间码的 blendshape 数据。

gRPC 协议缓冲区的详细描述请参见 grpc 协议缓冲区 部分。


请参阅专门的 配置 页面。