Audio2Face-3D 微服务#

双向 gRPC 通信#

输入#

A2F-3D 微服务通过双向流式 gRPC 接收其数据。数据由以下部分组成:

  1. 一个音频流头部,包含有关即将到来的音频数据的信息,以及面部参数、后处理选项、 blendshape 参数和情感参数。

  2. 音频数据以及带有时间码的情感数据,用于开始应用情感

在 gRPC 交互的示例运行时配置文件中,我们记录了可以用作 gRPC 请求一部分的不同参数。您可以在 我们的 Github 仓库 中找到它。我们还在下面提供了一个示例

警告

此配置文件是运行时配置文件。虽然它看起来类似于部署时配置文件,但不应将运行时和部署时配置文件混淆。这两个配置文件之间存在大小写差异(snake_case 与 camelCase)和结构差异。请参阅我们的示例以获取 A2F-3D NIM 手动容器部署和配置 的指导。

config_james.yml
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://apache.ac.cn/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Parameters related to the face
face_parameters:
  # Controls the range of motion on the upper regions of the face
  # Min value: 0.0
  # Max value: 2.0
  upperFaceStrength: 1
  # Applies temporal smoothing to the upper face motion
  # Min value: 0.0
  # Max value: 0.1
  upperFaceSmoothing: 0.001
  # Controls the range of motion on the lower regions of the face1
  # Min value: 0.0
  # Max value: 2.0
  lowerFaceStrength: 1.2
  # Applies temporal smoothing to the lower face motion
  # Min value: 0.0
  # Max value: 0.1
  lowerFaceSmoothing: 0.006
  # Determines the boundary between the upper and lower regions of the face
  # Min value: 0.0
  # Max value: 1.0
  faceMaskLevel: 0.6
  # Determines how smoothly the upper and lower face regions blend on the boundary
  # Min value: 0.001
  # Max value: 0.5
  faceMaskSoftness: 0.0085
  # Controls the range of motion of the skin.
  # Min value: 0.0
  # Max value: 2.0
  skinStrength: 1.0
  # Adjusts the default pose of eyelid open-close
  # Min value: -1.0
  # Max value: 1.0
  eyelidOpenOffset: 0.06
  # Adjusts the default pose of lip close-open
  # Min value: -0.02
  # Max value: 0.2
  lipOpenOffset: -0.02

# contains multipliers and offsets to be applied on the inference result
# For more information about blendshapes see:
# https://developer.apple.com/documentation/arkit/arfaceanchor/blendshapelocation
blendshape_parameters:
  enable_clamping_bs_weight: False
  multipliers:
    EyeBlinkLeft: 1.0
    EyeLookDownLeft: 0.0
    EyeLookInLeft: 0.0
    EyeLookOutLeft: 0.0
    EyeLookUpLeft: 0.0
    EyeSquintLeft: 1.0
    EyeWideLeft: 1.0
    EyeBlinkRight: 1.0
    EyeLookDownRight: 0.0
    EyeLookInRight: 0.0
    EyeLookOutRight: 0.0
    EyeLookUpRight: 0.0
    EyeSquintRight: 1.0
    EyeWideRight: 1.0
    JawForward: 0.7
    JawLeft: 0.2
    JawRight: 0.2
    JawOpen: 0.8
    MouthClose: 0.3
    MouthFunnel: 1.0
    MouthPucker: 1.0
    MouthLeft: 0.2
    MouthRight: 0.2
    MouthSmileLeft: 1.2
    MouthSmileRight: 1.2
    MouthFrownLeft: 0.5
    MouthFrownRight: 0.5
    MouthDimpleLeft: 0.8
    MouthDimpleRight: 0.8
    MouthStretchLeft: 0.05
    MouthStretchRight: 0.05
    MouthRollLower: 0.8
    MouthRollUpper: 0.5
    MouthShrugLower: 1.0
    MouthShrugUpper: 0.4
    MouthPressLeft: 0.8
    MouthPressRight: 0.8
    MouthLowerDownLeft: 0.8
    MouthLowerDownRight: 0.8
    MouthUpperUpLeft: 0.8
    MouthUpperUpRight: 0.8
    BrowDownLeft: 1.2
    BrowDownRight: 1.2
    BrowInnerUp: 1.3
    BrowOuterUpLeft: 0.8
    BrowOuterUpRight: 0.8
    CheekPuff: 0.2
    CheekSquintLeft: 1.0
    CheekSquintRight: 1.0
    NoseSneerLeft: 0.8
    NoseSneerRight: 0.8
    TongueOut: 0.0
  offsets:
    EyeBlinkLeft: 0.0
    EyeLookDownLeft: 0.0
    EyeLookInLeft: 0.0
    EyeLookOutLeft: 0.0
    EyeLookUpLeft: 0.0
    EyeSquintLeft: 0.0
    EyeWideLeft: 0.0
    EyeBlinkRight: 0.0
    EyeLookDownRight: 0.0
    EyeLookInRight: 0.0
    EyeLookOutRight: 0.0
    EyeLookUpRight: 0.0
    EyeSquintRight: 0.0
    EyeWideRight: 0.0
    JawForward: 0.0
    JawLeft: 0.0
    JawRight: 0.0
    JawOpen: 0.0
    MouthClose: 0.0
    MouthFunnel: 0.0
    MouthPucker: 0.0
    MouthLeft: 0.0
    MouthRight: 0.0
    MouthSmileLeft: 0.0
    MouthSmileRight: 0.0
    MouthFrownLeft: 0.0
    MouthFrownRight: 0.0
    MouthDimpleLeft: 0.0
    MouthDimpleRight: 0.0
    MouthStretchLeft: 0.0
    MouthStretchRight: 0.0
    MouthRollLower: 0.0
    MouthRollUpper: 0.0
    MouthShrugLower: 0.0
    MouthShrugUpper: 0.0
    MouthPressLeft: 0.0
    MouthPressRight: 0.0
    MouthLowerDownLeft: 0.0
    MouthLowerDownRight: 0.0
    MouthUpperUpLeft: 0.0
    MouthUpperUpRight: 0.0
    BrowDownLeft: 0.0
    BrowDownRight: 0.0
    BrowInnerUp: 0.0
    BrowOuterUpLeft: 0.0
    BrowOuterUpRight: 0.0
    CheekPuff: 0.0
    CheekSquintLeft: 0.0
    CheekSquintRight: 0.0
    NoseSneerLeft: 0.0
    NoseSneerRight: 0.0
    TongueOut: 0.0

# Parameter related to temporal smoothing of emotion
# Expected value range (0, inf]
live_transition_time: 0.0001

# Parameter that sets the beginning emotion in A2E engine
# Can provide any subset of the following params
# Set 'enable_preferred_emotion=false' when testing this
beginning_emotion:
  amazement: 1.0
  anger: 0.0
  # cheekiness: 0.0
  disgust: 0.0
  fear: 1.0
  # grief: 0.0
  # joy: 0.0
  outofbreath: 0.0
  pain: 1.0
  sadness: 0.0

# Parameters related to the post-processing of emotions.
post_processing_parameters:
  # Increases the spread between emotion values by pushing them higher or lower
  emotion_contrast: 1.0
  # Coefficient for smoothing emotions over time
  live_blend_coef: 0.7
  # Tells the A2F pipeline whether to use emotions weights defined under this sections
  enable_preferred_emotion: false
  # Sets the strength of the preferred emotion (if is loaded) relative to generated emotions
  preferred_emotion_strength: 0.5
  # Sets the strength of emotions relative to neutral emotion
  emotion_strength: 0.6
  # Sets a firm limit on the quantity of emotion sliders engaged by A2E - emotions with highest weight will be prioritized
  max_emotions: 3

# Manual emotion input
# Here there is a list of `emotion with timecode`
# You can add more of them if needed.
# An emotion with timecode is an emotion that is applied at a specific timecode
# Emotions must be between 0.0 and 1.0
# Timecodes must be >= 0.0
emotion_with_timecode_list:
  # this first emotion with timecode will apply joy at the very beginning of the
  # audio clip
  emotion_with_timecode1:
    time_code: 0.0
    emotions:
      amazement: 0.0
      anger: 0.0
      cheekiness: 0.0
      disgust: 0.0
      fear: 0.0
      grief: 0.0
      joy: 1.0
      outofbreath: 0.0
      pain: 0.0
      sadness: 0.0
  # this second emotion with timecode will apply fear after 1 second of audio in the
  # audio clip
  emotion_with_timecode2:
    time_code: 1.0
    emotions:
      amazement: 0.0
      anger: 0.0
      cheekiness: 0.0
      disgust: 0.0
      fear: 1.0
      grief: 0.0
      joy: 0.0
      outofbreath: 0.0
      pain: 0.1
      sadness: 0.0

目前,我们仅支持 单声道 16 位 PCM 音频格式。我们确实支持任意采样率,但建议使用 16kHz 以获得最佳性能。最小和最大接受采样率在微服务的配置文件中定义,但我们不能保证在默认值之外获得良好的结果。

输出#

A2F-3D 微服务在双向流式 gRPC 上输出

  1. 一个动画头部,包含有关 blendshape 名称、音频输出格式等的信息。

  2. 与音频数据同步的带有时间码的 blendshape 数据。

gRPC 协议缓冲区的详细描述请参见 grpc 协议缓冲区 部分。

帧率

Audio2Face-3D 微服务每秒音频执行 30 次推理。

这意味着播放 Audio2Face-3D 的输出数据必须以 30 FPS 进行。

但是,Audio2Face-3D 的处理速度不限于每秒计算 30 次推理。例如,如果流在 Audio2Face-3D 日志中指示 300 FPS,则表示每秒计算处理 10 秒(300 帧 / 30FPS)的音频。

因此,当从 Audio2Face-3D 接收数据时,您需要缓冲它以 30FPS 重新播放。高输出速率可防止网络不稳定引起的抖动。

目前在 Audio2Face-3D 中,FPS 日志每秒打印到 stdout 2 次。

注意

请注意,虽然较高的流速率可能会由于需要额外的缓冲而引入短暂的初始延迟,但此延迟在播放期间不会持续存在。

Blendshapes

Audio2Face-3D 输出 blend shape。有关更多信息,请参阅 ARKit blendShape 文档

Audio2Face-3D 不会使头部、舌头和眼睛移动动画化。

以下 blend shape 值在 Audio2Face-3D 输出中始终为 0

  • EyeLookDownRight

  • EyeLookInRight

  • EyeLookOutRight

  • EyeLookUpRight

  • EyeLookDownLeft

  • EyeLookInLeft

  • EyeLookOutLeft

  • EyeLookUpLeft

  • TongueOut

  • HeadRoll

  • HeadPitch

  • HeadYaw

注意

blendshape MouthClose 的定义偏离了标准的 ARKit 版本。该形状包括下巴的张开。

旧版通信

注意

虽然支持这种类型的通信,但我们建议 Audio2Face-3D 的首次用户使用双向端点。

在此模式下,双向流式 gRPC 分割为两个客户端流式 gRPC,其中 Audio2Face-3D 既是服务器又是客户端。

输入

A2F-3D 微服务通过客户端流式 RPC 接收其数据。数据由以下部分组成:

  1. 一个音频流头部,包含有关即将到来的音频数据的信息,以及面部参数、后处理选项、 blendshape 参数和情感参数。

  2. 音频数据以及带有时间码的情感数据,用于开始应用情感

输出

A2F-3D 微服务将其数据作为客户端流式 RPC 输出。输出数据由以下部分组成:

  1. 一个动画头部,包含有关 blendshape 名称、音频输出格式等的信息。

  2. 与音频数据同步的带有时间码的 blendshape 数据。

gRPC 协议缓冲区的详细描述请参见 grpc 协议缓冲区 部分。

配置#

请参阅专门的 配置 页面。