配置模型

`DataConfig`

基类：BaseModel、Generic[DataModuleT]、ABC

所有数据配置的基类。

此类用于定义所有数据配置的接口。它用于定义将在训练循环中使用的数据模块。

源代码位于 bionemo/llm/run/config_models.py 中

class DataConfig(BaseModel, Generic[DataModuleT], ABC):
    """Base class for all data configurations.

    This class is used to define the interface for all data configurations. It is used to define the data module that
    will be used in the training loop.
    """

    micro_batch_size: int = 8
    result_dir: str | pathlib.Path = "./results"
    num_dataset_workers: int = 0
    seq_length: int = 128

    @field_serializer("result_dir")
    def serialize_paths(self, value: pathlib.Path) -> str:  # noqa: D102
        return serialize_path_or_str(value)

    @field_validator("result_dir")
    def deserialize_paths(cls, value: str) -> pathlib.Path:  # noqa: D102
        return deserialize_str_to_path(value)

    @abstractmethod
    def construct_data_module(self, global_batch_size: int) -> DataModuleT:
        """Construct the data module from the configuration. Cannot be defined generically."""
        ...

    def custom_model_validator(self, global_cfg: "MainConfig") -> "MainConfig":
        """Use custom implementation of this method to define the things inside global_config.

        The following expression will always be true:

        global_cfg.data_config == self
        """
        return global_cfg

`construct_data_module(global_batch_size)` `abstractmethod`

从配置构建数据模块。不能通用地定义。

源代码位于 bionemo/llm/run/config_models.py 中

@abstractmethod
def construct_data_module(self, global_batch_size: int) -> DataModuleT:
    """Construct the data module from the configuration. Cannot be defined generically."""
    ...

`custom_model_validator(global_cfg)`

使用此方法的自定义实现来定义 global_config 内部的内容。

以下表达式将始终为真

global_cfg.data_config == self

源代码位于 bionemo/llm/run/config_models.py 中

def custom_model_validator(self, global_cfg: "MainConfig") -> "MainConfig":
    """Use custom implementation of this method to define the things inside global_config.

    The following expression will always be true:

    global_cfg.data_config == self
    """
    return global_cfg

`ExperimentConfig`

基类：BaseModel

用于设置和管理实验参数的配置类。

属性

名称	类型	描述
`save_every_n_steps`	`int`	保存检查点之间的步数。
`result_dir`	`str \| Path`	将保存结果的目录。
`experiment_name`	`str`	实验名称。
`restore_from_checkpoint_path`	`Optional[str]`	从检查点恢复的路径。注意：这不会按预期调用检查点回调。
`save_last_checkpoint`	`bool`	保存最后一个检查点的标志。默认为 True。
`metric_to_monitor_for_checkpoints`	`str`	用于保存 top-k 检查点的监控指标。默认为 "reduced_train_loss"。
`save_top_k`	`int`	要根据监控指标保存的 top 检查点数量。默认为 2。
`create_tensorboard_logger`	`bool`	创建 TensorBoard 日志记录器的标志。默认为 False。
`create_checkpoint_callback`	`bool`	创建 ModelCheckpoint 回调的标志

源代码位于 bionemo/llm/run/config_models.py 中

class ExperimentConfig(BaseModel):
    """Configuration class for setting up and managing experiment parameters.

    Attributes:
        save_every_n_steps (int): Number of steps between saving checkpoints.
        result_dir (str | pathlib.Path): Directory where results will be saved.
        experiment_name (str): Name of the experiment.
        restore_from_checkpoint_path (Optional[str]): Path to restore from a checkpoint. Note: This does not invoke the checkpoint callback as expected.
        save_last_checkpoint (bool): Flag to save the last checkpoint. Default is True.
        metric_to_monitor_for_checkpoints (str): Metric to monitor for saving top-k checkpoints. Default is "reduced_train_loss".
        save_top_k (int): Number of top checkpoints to save based on the monitored metric. Default is 2.
        create_tensorboard_logger (bool): Flag to create a TensorBoard logger. Default is False.
        create_checkpoint_callback (bool): Flag to create a ModelCheckpoint callback
    """

    save_every_n_steps: int
    result_dir: str | pathlib.Path
    experiment_name: str
    # NOTE: restore_from_checkpoint_path does not invoke the checkpoint callback in the way we'd like. Avoid using.
    restore_from_checkpoint_path: Optional[str]
    save_last_checkpoint: bool = True
    metric_to_monitor_for_checkpoints: str = "reduced_train_loss"
    save_top_k: int = 2
    create_tensorboard_logger: bool = False
    create_checkpoint_callback: bool = True

    @field_serializer("result_dir")
    def serialize_paths(self, value: pathlib.Path) -> str:  # noqa: D102
        return serialize_path_or_str(value)

    @field_validator("result_dir")
    def deserialize_paths(cls, value: str) -> pathlib.Path:  # noqa: D102
        return deserialize_str_to_path(value)

`ExposedModelConfig`

基类：BaseModel、Generic[ModelConfigT]、ABC

BioNeMo 模型配置类，封装了 TransformerConfig 及其友元。

此类用于定义所有模型配置的接口。它是暴露的，以防止底层配置对象中出现类型错误或定义不明确的字段。ModelConfigT 声明了底层配置的相关类型（最常见的是 BioBertGenericConfig，但也可能是 TransformerConfig 或类似的东西）。子类应尝试公开用户配置模型所需的最小字段集，同时将更深奥的配置保留在底层 ModelConfigT 中。

源代码位于 bionemo/llm/run/config_models.py 中

class ExposedModelConfig(BaseModel, Generic[ModelConfigT], ABC):
    """BioNeMo model configuration class, wraps TransformerConfig and friends.

    This class is used to define the interface for all model configurations. It is **Exposed** to guard against ill-typed
    or poorly defined fields in the underlying configuration objects. `ModelConfigT` declares the associated type of the
    underlying config (most commonly a BioBertGenericConfig, but could also be a TransformerConfig or something similar).
    Children should try to expose the minimal set of fields necessary for the user to configure the model while keeping
    the more esoteric configuration private to the underlying ModelConfigT.
    """

    # Restores weights from a pretrained checkpoint
    initial_ckpt_path: Optional[str] = None
    # Does not attempt to load keys with these prefixes (useful if you attached extra parameters and still want to load a set of weights)
    initial_ckpt_skip_keys_with_these_prefixes: List[str] = field(default_factory=list)

    # Pydantic stuff to allow arbitrary types + validators + serializers
    class Config:  # noqa: D106
        arbitrary_types_allowed = True

    def model_class(self) -> Type[ModelConfigT]:
        """Returns the underlying model class that this config wraps."""
        raise NotImplementedError

    def custom_model_validator(self, global_cfg: "MainConfig") -> "MainConfig":
        """Use custom implementation of this method to define the things inside global_config.

        The following expression will always be true:

        global_cfg.bionemo_model_config == self
        """
        return global_cfg

    def exposed_to_internal_bionemo_model_config(self) -> ModelConfigT:
        """Converts the exposed dataclass to the underlying Transformer config.

        The underlying ModelConfigT may both be incomplete and unserializable. We use this transformation as a way to
        hide fields that are either not serializable by Pydantic or that we do not want to expose.
        """
        cls: Type[ModelConfigT] = self.model_class()
        model_dict = {}
        for attr in self.model_fields:
            if attr not in model_dict and attr in cls.__dataclass_fields__:
                model_dict[attr] = getattr(self, attr)

        # Now set fp16 and bf16 based on the precision for the underlying TransformerConfig=>ParallelConfig
        #   the only constraint is that both must not be true.
        model_dict["bf16"] = self.pipeline_dtype == dtypes.precision_to_dtype["bf16-mixed"]
        model_dict["fp16"] = self.pipeline_dtype == dtypes.precision_to_dtype["16-mixed"]
        result = cls(**model_dict)

        return result

    # NOTE: See PrecisionTypes for a list of valid literals that may be deserialized.
    params_dtype: torch.dtype
    pipeline_dtype: torch.dtype
    autocast_dtype: torch.dtype

    num_layers: int = 6
    hidden_size: int = 256
    ffn_hidden_size: int = 512
    num_attention_heads: int = 4
    seq_length: int = 512
    fp32_residual_connection: bool = False
    hidden_dropout: float = 0.02
    init_method_std: float = 0.02
    kv_channels: Optional[int] = None
    apply_query_key_layer_scaling: bool = False
    make_vocab_size_divisible_by: int = 128
    masked_softmax_fusion: bool = True
    fp16_lm_cross_entropy: bool = False
    gradient_accumulation_fusion: bool = False
    layernorm_zero_centered_gamma: bool = False
    layernorm_epsilon: float = 1.0e-12
    activation_func: Callable[[torch.Tensor, Any], torch.Tensor] = F.gelu
    qk_layernorm: bool = False
    apply_residual_connection_post_layernorm: bool = False
    bias_activation_fusion: bool = True
    bias_dropout_fusion: bool = True
    get_attention_mask_from_fusion: bool = False
    attention_dropout: float = 0.1
    share_embeddings_and_output_weights: bool = True
    enable_autocast: bool = False
    nemo1_ckpt_path: Optional[str] = None
    biobert_spec_option: BiobertSpecOption = BiobertSpecOption.bert_layer_with_transformer_engine_spec

    @field_serializer("biobert_spec_option")
    def serialize_spec_option(self, value: BiobertSpecOption) -> str:  # noqa: D102
        return value.value

    @field_validator("biobert_spec_option", mode="before")
    def deserialize_spec_option(cls, value: str) -> BiobertSpecOption:  # noqa: D102
        return BiobertSpecOption(value)

    @field_validator("activation_func", mode="before")
    @classmethod
    def validate_activation_func(cls, activation_func: str) -> Callable:
        """Validates the activation function, assumes this function exists in torch.nn.functional.

        For custom activation functions, use the CUSTOM_ACTIVATION_FUNCTIONS dictionary in the module. This method
        validates the provided activation function string and returns a callable function based on the validation
        context using the provided validator in the base class.

        Args:
            activation_func (str): The activation function to be validated.
            context (ValidationInfo): The context for validation.

        Returns:
            Callable: A callable function after validation.

        See Also:
            CUSTOM_ACTIVATION_FNS
        """
        func = getattr(torch.nn.functional, activation_func.lower(), None)
        if func is None and activation_func in CUSTOM_ACTIVATION_FNS:
            func = CUSTOM_ACTIVATION_FNS[activation_func]
            return func
        elif func is None:
            raise ValueError(
                f"activation_func must be a valid function in `torch.nn.functional`, got {activation_func=}"
            )
        else:
            return func

    @field_serializer("activation_func")
    def serialize_activation_func(self, v: Callable[[torch.Tensor, Any], torch.Tensor]) -> str:
        """Serializes a given activation function to its corresponding string representation.

        By default, all activation functions from `torch.nn.functional` are serialized to their name. User defined
        activation functions should also be defined here with a custom mapping in CUSTOM_ACTIVATION_FNS defined at the
        top of this file. This allows our Pydantic model to serialize and deserialize the activation function.

        Args:
            v (Callable[[torch.Tensor, Any], torch.Tensor]): The activation function to serialize.

        Returns:
            str: The name of the activation function if it is a standard PyTorch function,
                 or the corresponding serialization key if it is a custom activation function.

        Raises:
            ValueError: If the activation function is not supported.
        """
        func_name = v.__name__
        func = getattr(torch.nn.functional, func_name, None)
        if func is not None:
            return func_name
        elif func in REVERSE_CUSTOM_ACTIVATION_FNS:
            return REVERSE_CUSTOM_ACTIVATION_FNS[func]  # Get the serialization key
        else:
            raise ValueError(f"Unsupported activation function: {v}")

    @field_validator("params_dtype", "pipeline_dtype", "autocast_dtype", mode="before")
    @classmethod
    def precision_validator(cls, v: dtypes.PrecisionTypes) -> torch.dtype:
        """Validates the precision type and returns the corresponding torch dtype."""
        return dtypes.get_autocast_dtype(v)

    @field_serializer("params_dtype", "pipeline_dtype", "autocast_dtype")
    def serialize_dtypes(self, v: torch.dtype) -> dtypes.PrecisionTypes:
        """Serializes the torch dtype to the corresponding precision type."""
        return dtypes.dtype_to_precision[v]

`custom_model_validator(global_cfg)`

使用此方法的自定义实现来定义 global_config 内部的内容。

以下表达式将始终为真

global_cfg.bionemo_model_config == self

源代码位于 bionemo/llm/run/config_models.py 中

def custom_model_validator(self, global_cfg: "MainConfig") -> "MainConfig":
    """Use custom implementation of this method to define the things inside global_config.

    The following expression will always be true:

    global_cfg.bionemo_model_config == self
    """
    return global_cfg

`exposed_to_internal_bionemo_model_config()`

将公开的数据类转换为底层 Transformer 配置。

底层的 ModelConfigT 可能既不完整也不可序列化。我们使用此转换作为一种隐藏 Pydantic 无法序列化或我们不想公开的字段的方法。

源代码位于 bionemo/llm/run/config_models.py 中

def exposed_to_internal_bionemo_model_config(self) -> ModelConfigT:
    """Converts the exposed dataclass to the underlying Transformer config.

    The underlying ModelConfigT may both be incomplete and unserializable. We use this transformation as a way to
    hide fields that are either not serializable by Pydantic or that we do not want to expose.
    """
    cls: Type[ModelConfigT] = self.model_class()
    model_dict = {}
    for attr in self.model_fields:
        if attr not in model_dict and attr in cls.__dataclass_fields__:
            model_dict[attr] = getattr(self, attr)

    # Now set fp16 and bf16 based on the precision for the underlying TransformerConfig=>ParallelConfig
    #   the only constraint is that both must not be true.
    model_dict["bf16"] = self.pipeline_dtype == dtypes.precision_to_dtype["bf16-mixed"]
    model_dict["fp16"] = self.pipeline_dtype == dtypes.precision_to_dtype["16-mixed"]
    result = cls(**model_dict)

    return result

`model_class()`

返回此配置包装的底层模型类。

源代码位于 bionemo/llm/run/config_models.py 中

def model_class(self) -> Type[ModelConfigT]:
    """Returns the underlying model class that this config wraps."""
    raise NotImplementedError

`precision_validator(v)` `classmethod`

验证精度类型并返回相应的 torch dtype。

源代码位于 bionemo/llm/run/config_models.py 中

@field_validator("params_dtype", "pipeline_dtype", "autocast_dtype", mode="before")
@classmethod
def precision_validator(cls, v: dtypes.PrecisionTypes) -> torch.dtype:
    """Validates the precision type and returns the corresponding torch dtype."""
    return dtypes.get_autocast_dtype(v)

`serialize_activation_func(v)`

将给定的激活函数序列化为其相应的字符串表示形式。

默认情况下，torch.nn.functional 中的所有激活函数都序列化为其名称。用户定义的激活函数也应在此处定义，并在位于此文件顶部的 CUSTOM_ACTIVATION_FNS 中进行自定义映射。这允许我们的 Pydantic 模型序列化和反序列化激活函数。

参数

名称	类型	描述	默认
`v`	`Callable[[Tensor, Any], Tensor]`	要序列化的激活函数。	必需

返回

名称	类型	描述
`str`	`str`	如果激活函数是标准 PyTorch 函数，则返回其名称；如果它是自定义激活函数，则返回相应的序列化键。

引发

类型	描述
`ValueError`	如果不支持该激活函数。

源代码位于 bionemo/llm/run/config_models.py 中

@field_serializer("activation_func")
def serialize_activation_func(self, v: Callable[[torch.Tensor, Any], torch.Tensor]) -> str:
    """Serializes a given activation function to its corresponding string representation.

    By default, all activation functions from `torch.nn.functional` are serialized to their name. User defined
    activation functions should also be defined here with a custom mapping in CUSTOM_ACTIVATION_FNS defined at the
    top of this file. This allows our Pydantic model to serialize and deserialize the activation function.

    Args:
        v (Callable[[torch.Tensor, Any], torch.Tensor]): The activation function to serialize.

    Returns:
        str: The name of the activation function if it is a standard PyTorch function,
             or the corresponding serialization key if it is a custom activation function.

    Raises:
        ValueError: If the activation function is not supported.
    """
    func_name = v.__name__
    func = getattr(torch.nn.functional, func_name, None)
    if func is not None:
        return func_name
    elif func in REVERSE_CUSTOM_ACTIVATION_FNS:
        return REVERSE_CUSTOM_ACTIVATION_FNS[func]  # Get the serialization key
    else:
        raise ValueError(f"Unsupported activation function: {v}")

`serialize_dtypes(v)`

将 torch dtype 序列化为相应的精度类型。

源代码位于 bionemo/llm/run/config_models.py 中

@field_serializer("params_dtype", "pipeline_dtype", "autocast_dtype")
def serialize_dtypes(self, v: torch.dtype) -> dtypes.PrecisionTypes:
    """Serializes the torch dtype to the corresponding precision type."""
    return dtypes.dtype_to_precision[v]

`validate_activation_func(activation_func)` `classmethod`

验证激活函数，假设此函数存在于 torch.nn.functional 中。

对于自定义激活函数，请使用模块中的 CUSTOM_ACTIVATION_FUNCTIONS 字典。此方法验证提供的激活函数字符串，并基于验证上下文使用基类中提供的验证器返回可调用函数。

参数

名称	类型	描述	默认
`activation_func`	`str`	要验证的激活函数。	必需
`context`	`ValidationInfo`	验证的上下文。	必需

返回

名称	类型	描述
`Callable`	`Callable`	验证后的可调用函数。

另请参阅

CUSTOM_ACTIVATION_FNS

源代码位于 bionemo/llm/run/config_models.py 中

@field_validator("activation_func", mode="before")
@classmethod
def validate_activation_func(cls, activation_func: str) -> Callable:
    """Validates the activation function, assumes this function exists in torch.nn.functional.

    For custom activation functions, use the CUSTOM_ACTIVATION_FUNCTIONS dictionary in the module. This method
    validates the provided activation function string and returns a callable function based on the validation
    context using the provided validator in the base class.

    Args:
        activation_func (str): The activation function to be validated.
        context (ValidationInfo): The context for validation.

    Returns:
        Callable: A callable function after validation.

    See Also:
        CUSTOM_ACTIVATION_FNS
    """
    func = getattr(torch.nn.functional, activation_func.lower(), None)
    if func is None and activation_func in CUSTOM_ACTIVATION_FNS:
        func = CUSTOM_ACTIVATION_FNS[activation_func]
        return func
    elif func is None:
        raise ValueError(
            f"activation_func must be a valid function in `torch.nn.functional`, got {activation_func=}"
        )
    else:
        return func

`MainConfig`

基类：BaseModel、Generic[ExModelConfigT, DataConfigT]

BioNeMo 的主配置类。所有序列化的有效 MainConfig 的配置都应该是可运行的。

此类用于定义 BioNeMo 的主配置。它定义了使用 NeMo2 训练 API 执行训练作业所需的最小配置部分。它接受两个泛型类型参数，用户必须在自己的执行环境中定义这两个参数。

此外，此类假设 ExposedModelConfig 和 DataConfig 的配置可能具有对整个 MainConfig 进行操作的自定义验证器。这避免了在此类中进行基于类型的条件判断的需要，同时仍然允许在底层类中实现自定义验证全局逻辑。例如，某些模型可能希望将其 Datamodule 的 seq_length 限制为特定值。

参数

名称	描述	默认
`data_config`	包含关于实例化所需 DataModule 的指令的泛型配置类型。	必需
`parallel_config`	模型的并行配置。	必需
`training_config`	模型的训练配置。	必需
`bionemo_model_config`	泛型 ExposedModelConfig 类型。此类隐藏了底层模型配置中的额外配置参数，并提供了	必需
`optim_config`	模型的优化器/调度器配置。	必需
`experiment_config`	模型的实验配置。	必需
`wandb_config`	可选，模型的 wandb 配置。	必需

源代码位于 bionemo/llm/run/config_models.py 中

class MainConfig(BaseModel, Generic[ExModelConfigT, DataConfigT]):
    """Main configuration class for BioNeMo. All serialized configs that are a valid MainConfig should be Runnable.

    This class is used to define the main configuration for BioNeMo. It defines the minimal pieces of configuration
    to execution a training job with the NeMo2 training api. It accepts two generic type parameters which users
    must define in their own environment for execution.

    Additionally, this class assumes that the configs for ExposedModelConfig and DataConfig may have custom validators
    implemented that operate on the entire MainConfig. This prevents the need from type based conditionals inside this
    class while still allowing for custom validation global logic to be implemented in the underlying classes. For example,
    some models may want to restrict their Datamodules seq_length to a certain value.


    Args:
        data_config: Generic config type that contains instructions on instantiating the required DataModule.
        parallel_config: The parallel configuration for the model.
        training_config: The training configuration for the model.
        bionemo_model_config: Generic ExposedModelConfig type. This class hides extra configuration parameters in the
            underlying model configuration as well as providing
        optim_config: The optimizer/scheduler configuration for the model.
        experiment_config: The experiment configuration for the model.
        wandb_config: Optional, the wandb configuration for the model.
    """

    data_config: DataConfigT
    parallel_config: ParallelConfig
    training_config: TrainingConfig
    bionemo_model_config: ExModelConfigT
    optim_config: OptimizerSchedulerConfig
    experiment_config: ExperimentConfig
    wandb_config: Optional[WandbConfig] = None

    @model_validator(mode="after")
    def validate_master_config(self) -> "MainConfig":
        """Validates the master configuration object."""
        self.bionemo_model_config.seq_length = self.data_config.seq_length
        return self

    @model_validator(mode="after")
    def run_bionemo_model_config_model_validators(self) -> "MainConfig":
        """Runs the model validators on the bionemo_model_config."""
        return self.bionemo_model_config.custom_model_validator(self)

    @model_validator(mode="after")
    def run_data_config_model_validators(self) -> "MainConfig":
        """Runs the model validators on the data_config."""
        return self.data_config.custom_model_validator(self)

    @model_validator(mode="after")
    def validate_checkpointing_setting(self) -> "MainConfig":
        """Validates the master configuration object."""
        self.training_config.enable_checkpointing = self.experiment_config.create_checkpoint_callback
        return self

`run_bionemo_model_config_model_validators()`

在 bionemo_model_config 上运行模型验证器。

源代码位于 bionemo/llm/run/config_models.py 中

@model_validator(mode="after")
def run_bionemo_model_config_model_validators(self) -> "MainConfig":
    """Runs the model validators on the bionemo_model_config."""
    return self.bionemo_model_config.custom_model_validator(self)

`run_data_config_model_validators()`

在 data_config 上运行模型验证器。

源代码位于 bionemo/llm/run/config_models.py 中

@model_validator(mode="after")
def run_data_config_model_validators(self) -> "MainConfig":
    """Runs the model validators on the data_config."""
    return self.data_config.custom_model_validator(self)

`validate_checkpointing_setting()`

验证主配置对象。

源代码位于 bionemo/llm/run/config_models.py 中

@model_validator(mode="after")
def validate_checkpointing_setting(self) -> "MainConfig":
    """Validates the master configuration object."""
    self.training_config.enable_checkpointing = self.experiment_config.create_checkpoint_callback
    return self

`validate_master_config()`

验证主配置对象。

源代码位于 bionemo/llm/run/config_models.py 中

@model_validator(mode="after")
def validate_master_config(self) -> "MainConfig":
    """Validates the master configuration object."""
    self.bionemo_model_config.seq_length = self.data_config.seq_length
    return self

`OptimizerSchedulerConfig`

基类：BaseModel

优化器和学习率调度器的配置。

属性

名称	类型	描述
`lr`	`float`	优化器的学习率。默认为 1e-4。
`optimizer`	`str`	要使用的优化器类型。默认为 "adam"。
`interval`	`str`	更新学习率调度器的间隔。默认为 "step"。
`monitor`	`str`	用于学习率调整的监控指标。默认为 "val_loss"。
`interval`	`str`	更新学习率调度器的间隔。默认为 "step"。
`monitor`	`str`	用于学习率调整的监控指标。默认为 "val_loss"。
`warmup_steps`	`int`	与 warmup 退火学习率调度器一起使用的预热步数。默认为 0。
`lr_scheduler`	`Literal['warmup_anneal', 'cosine']`	要使用的学习率调度器类型。默认为 'warmup_anneal'。注意：这可能会更改。
`max_steps`	`Optional[int]`	优化器中使用的 max_steps。默认为 None，它使用 TrainingConfig 中的 max_steps。

源代码位于 bionemo/llm/run/config_models.py 中

class OptimizerSchedulerConfig(BaseModel):
    """Configuration for the optimizer and learning rate scheduler.

    Attributes:
        lr (float): Learning rate for the optimizer. Default is 1e-4.
        optimizer (str): Type of optimizer to use. Default is "adam".
        interval (str): Interval for updating the learning rate scheduler. Default is "step".
        monitor (str): Metric to monitor for learning rate adjustments. Default is "val_loss".
        interval (str): Interval for updating the learning rate scheduler. Default is "step".
        monitor (str): Metric to monitor for learning rate adjustments. Default is "val_loss".
        warmup_steps (int): Number of warmup steps for use with the warmup annealing learning rate scheduler. Default is 0.
        lr_scheduler (Literal['warmup_anneal', 'cosine']): Type of learning rate scheduler to use. Default is 'warmup_anneal'. NOTE this is likely to change.
        max_steps (Optional[int]): max_steps used in optimizer. Default to None which uses max_steps from TrainingConfig.
    """

    lr: float = 1e-4
    optimizer: str = "adam"
    interval: str = "step"
    monitor: str = "val_loss"
    cosine_rampup_frac: float = 0.01
    cosine_hold_frac: float = 0.05
    warmup_steps: int = 0
    lr_scheduler: Literal["warmup_anneal", "cosine"] = "warmup_anneal"
    max_steps: Optional[int] = None

`ParallelConfig`

基类：BaseModel

ParallelConfig 是用于在模型训练中设置并行性的配置类。

属性

名称	类型	描述
`tensor_model_parallel_size`	`int`	张量模型并行的大小。默认为 1。
`pipeline_model_parallel_size`	`int`	管道模型并行的大小。默认为 1。
`accumulate_grad_batches`	`int`	要累积梯度的批次数。默认为 1。
`ddp`	`Literal['megatron']`	要使用的分布式数据并行方法。默认为 "megatron"。
`remove_unused_parameters`	`bool`	是否删除未使用的参数。默认为 True。
`num_devices`	`int`	要使用的设备数。默认为 1。
`num_nodes`	`int`	要使用的节点数。默认为 1。

方法

名称	描述
`validate_devices`	根据张量和管道模型并行大小验证设备数量。

源代码位于 bionemo/llm/run/config_models.py 中

class ParallelConfig(BaseModel):
    """ParallelConfig is a configuration class for setting up parallelism in model training.

    Attributes:
        tensor_model_parallel_size (int): The size of the tensor model parallelism. Default is 1.
        pipeline_model_parallel_size (int): The size of the pipeline model parallelism. Default is 1.
        accumulate_grad_batches (int): The number of batches to accumulate gradients over. Default is 1.
        ddp (Literal["megatron"]): The distributed data parallel method to use. Default is "megatron".
        remove_unused_parameters (bool): Whether to remove unused parameters. Default is True.
        num_devices (int): The number of devices to use. Default is 1.
        num_nodes (int): The number of nodes to use. Default is 1.

    Methods:
        validate_devices(): Validates the number of devices based on the tensor and pipeline model parallel sizes.
    """

    tensor_model_parallel_size: int = 1
    pipeline_model_parallel_size: int = 1
    accumulate_grad_batches: int = 1
    ddp: Literal["megatron"] = "megatron"
    remove_unused_parameters: bool = True
    num_devices: int = 1
    num_nodes: int = 1

    @model_validator(mode="after")
    def validate_devices(self):
        """Validates the number of devices based on the tensor and pipeline model parallel sizes."""
        if self.num_devices < self.tensor_model_parallel_size * self.pipeline_model_parallel_size:
            raise ValueError("devices must be divisible by tensor_model_parallel_size * pipeline_model_parallel_size")
        return self

`validate_devices()`

根据张量和管道模型并行大小验证设备数量。

源代码位于 bionemo/llm/run/config_models.py 中

@model_validator(mode="after")
def validate_devices(self):
    """Validates the number of devices based on the tensor and pipeline model parallel sizes."""
    if self.num_devices < self.tensor_model_parallel_size * self.pipeline_model_parallel_size:
        raise ValueError("devices must be divisible by tensor_model_parallel_size * pipeline_model_parallel_size")
    return self

`TrainingConfig`

基类：BaseModel

TrainingConfig 是用于训练模型的配置类。

属性

名称	类型	描述
`max_steps`	`int`	最大训练步数。
`limit_val_batches`	`int \| float`	要使用的验证批次数量。可以是分数或计数。
`val_check_interval`	`int`	检查验证的间隔（以步数为单位）。
`precision`	`Literal['32', 'bf16-mixed', '16-mixed']`	用于训练的精度。默认为 "bf16-mixed"。
`accelerator`	`str`	用于训练的加速器类型。默认为 "gpu"。
`gc_interval`	`int`	运行同步垃圾回收的全局步长间隔。在执行分布式训练时，可用于同步垃圾回收。默认为 0。
`include_perplexity`	`bool`	是否在验证日志中包含困惑度。默认为 False。
`enable_checkpointing`	`bool`	是否启用检查点，并在没有用户定义的 ModelCheckpoint 时配置默认的 ModelCheckpoint 回调。对应于 pl.Trainer 中的相同参数名称

源代码位于 bionemo/llm/run/config_models.py 中

class TrainingConfig(BaseModel):
    """TrainingConfig is a configuration class for training models.

    Attributes:
        max_steps (int): The maximum number of training steps.
        limit_val_batches (int | float): The number of validation batches to use. Can be a fraction or a count.
        val_check_interval (int): The interval (in steps) at which to check validation.
        precision (Literal["32", "bf16-mixed", "16-mixed"], optional): The precision to use for training. Defaults to "bf16-mixed".
        accelerator (str, optional): The type of accelerator to use for training. Defaults to "gpu".
        gc_interval (int, optional): The interval of global steps at which to run synchronized garbage collection. Useful for synchronizing garbage collection when performing distributed training. Defaults to 0.
        include_perplexity (bool, optional): Whether to include perplexity in the validation logs. Defaults to False.
        enable_checkpointing (bool, optional): Whether to enable checkpointing and configure a default ModelCheckpoint callback if there is no user-defined ModelCheckpoint. Corresponds to the same parameter name in pl.Trainer
    """

    max_steps: int
    limit_val_batches: int | float  # Because this can be a fraction or a count...
    val_check_interval: int
    precision: Literal["32", "bf16-mixed", "16-mixed"] = "bf16-mixed"
    accelerator: str = "gpu"
    # NOTE: VERY important for distributed training performance.
    gc_interval: int = 0
    include_perplexity: bool = False
    enable_checkpointing: bool = True

`deserialize_str_to_path(path)`

用于字符串/路径对象的通用反序列化。由于 YAML 没有 pathlib.Path 的本机表示形式，因此我们序列化为字符串。将此方法作为 @field_validator 导入。

源代码位于 bionemo/llm/run/config_models.py 中

def deserialize_str_to_path(path: str) -> pathlib.Path:
    """General purpose deserialize for string/path objects. Since YAML has no native representation for pathlib.Path, we serialize to strings. Import this method as a @field_validator."""
    return pathlib.Path(path)

`serialize_path_or_str(path)`

用于字符串/路径对象的通用序列化。由于 YAML 没有 pathlib.Path 的本机表示形式，因此我们序列化为字符串。将此方法作为 @field_serializer 导入。

源代码位于 bionemo/llm/run/config_models.py 中

def serialize_path_or_str(path: str | pathlib.Path) -> str:
    """General purpose serialization for string/path objects. Since YAML has no native representation for pathlib.Path, we serialize to strings. Import this method as a @field_serializer."""
    if isinstance(path, pathlib.Path):
        return str(path)
    elif isinstance(path, str):
        return path
    else:
        raise ValueError(f"Expected str or pathlib.Path, got {type(path)}")

配置模型

DataConfig

construct_data_module(global_batch_size) abstractmethod

custom_model_validator(global_cfg)

ExperimentConfig

ExposedModelConfig

custom_model_validator(global_cfg)

exposed_to_internal_bionemo_model_config()

model_class()

precision_validator(v) classmethod

serialize_activation_func(v)

serialize_dtypes(v)

validate_activation_func(activation_func) classmethod

MainConfig

run_bionemo_model_config_model_validators()

run_data_config_model_validators()

validate_checkpointing_setting()

validate_master_config()

OptimizerSchedulerConfig

ParallelConfig

validate_devices()

TrainingConfig

deserialize_str_to_path(path)

serialize_path_or_str(path)

`DataConfig`

`construct_data_module(global_batch_size)` `abstractmethod`

`custom_model_validator(global_cfg)`

`ExperimentConfig`

`ExposedModelConfig`

`custom_model_validator(global_cfg)`

`exposed_to_internal_bionemo_model_config()`

`model_class()`

`precision_validator(v)` `classmethod`

`serialize_activation_func(v)`

`serialize_dtypes(v)`

`validate_activation_func(activation_func)` `classmethod`

`MainConfig`

`run_bionemo_model_config_model_validators()`

`run_data_config_model_validators()`

`validate_checkpointing_setting()`

`validate_master_config()`

`OptimizerSchedulerConfig`

`ParallelConfig`

`validate_devices()`

`TrainingConfig`

`deserialize_str_to_path(path)`

`serialize_path_or_str(path)`