跳到内容

远程

FTPRemoteResource dataclass

基类: RemoteResource

源代码位于 bionemo/llm/utils/remote.py
145
146
147
148
149
150
151
152
153
154
155
156
157
class FTPRemoteResource(RemoteResource):  # noqa: D101
    def download_resource(self, overwrite=False) -> str:
        """Downloads the resource to its specified fully_qualified_dest name.

        Returns: the fully qualified destination filename.
        """
        self.exists_or_create_destination_directory()

        if not self.check_exists() or overwrite:
            request.urlretrieve(self.url, self.fully_qualified_dest_filename)

        self.check_exists()
        return self.fully_qualified_dest_filename

download_resource(overwrite=False)

将资源下载到其指定的 fully_qualified_dest 名称。

返回:完全限定的目标文件名。

源代码位于 bionemo/llm/utils/remote.py
146
147
148
149
150
151
152
153
154
155
156
157
def download_resource(self, overwrite=False) -> str:
    """Downloads the resource to its specified fully_qualified_dest name.

    Returns: the fully qualified destination filename.
    """
    self.exists_or_create_destination_directory()

    if not self.check_exists() or overwrite:
        request.urlretrieve(self.url, self.fully_qualified_dest_filename)

    self.check_exists()
    return self.fully_qualified_dest_filename

RemoteResource dataclass

负责下载远程文件,以及对下载的文件进行可选处理,以供下游用例使用。

每个对象都通过其构造函数(设置目标和校验和)或通过预配置的类方法调用。download_resource() 包含核心功能,即将 url 上的文件下载到完全限定的文件名。类方法可用于进一步配置此过程。

接收

文件、其校验和、目标目录和根目录

我们的数据类然后提供一些有用的东西:- 完全限定的目标文件夹(属性) - 完全限定的目标文件(属性) - check_exists() - download_resource()

形成完全限定的目标文件夹。为文件创建完全限定路径

(所有内容都位于下载例程中)检查 fq 目标文件夹是否存在,否则创建它 下载文件。校验和下载。完成。

后处理应该是他们自己的方法,并带有自己的配置。

使用示例

以下操作将下载并预处理预打包的资源。

GRCh38Ensembl99ResourcePreparer().prepare() Hg38chromResourcePreparer().prepare() GRCh38p13_ResourcePreparer().prepare()

属性

名称 类型 描述
dest_directory str

完成下载后放置所需文件的目录。应具有 {dest_directory}/{dest_filename} 的形式

dest_filename str

完成下载后文件的所需名称。

checksum Optional[str]

与 url 上的文件关联的校验和。如果设置为 None,check_exists 仅检查 {dest_directory}/{dest_filename} 的存在

url Optional[str]

要下载的文件的 URL

root_directory str | PathLike

底层目录,完全限定路径通过连接 root_directory、dest_directory 和 dest_filename 形成。

源代码位于 bionemo/llm/utils/remote.py
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
@dataclass
class RemoteResource:
    """Responsible for downloading remote files, along with optional processing of downloaded files for downstream usecases.

    Each object is invoked through either its constructor (setting up the destination and checksum), or through a pre-configured class method.
    `download_resource()` contains the core functionality, which is to download the file at `url` to the fully qualified filename. Class methods
    can be used to further configure this process.

    Receive:
        a file, its checksum, a destination directory, and a root directory

        Our dataclass then provides some useful things:
            - fully qualified destination folder (property)
            - fully qualified destination file (property)
            - check_exists()
            - download_resource()

        Form the fully qualified destination folder.
        Create a fully qualified path for the file

        (all lives in the download routine)
        Check that the fq destination folder exists, otherwise create it
        Download the file.
        Checksum the download.
        Done.

        Postprocessing should be their own method with their own configuration.

    Example usage:
        >>> # The following will download and preprocess the prepackaged resources.
        >>> GRCh38Ensembl99ResourcePreparer().prepare()
        >>> Hg38chromResourcePreparer().prepare()
        >>> GRCh38p13_ResourcePreparer().prepare()


    Attributes:
        dest_directory: The directory to place the desired file upon completing the download. Should have the form {dest_directory}/{dest_filename}
        dest_filename: The desired name for the file upon completing the download.
        checksum: checksum associated with the file located at url. If set to None, check_exists only checks for the existance of `{dest_directory}/{dest_filename}`
        url: URL of the file to download
        root_directory: the bottom-level directory, the fully qualified path is formed by joining root_directory, dest_directory, and dest_filename.
    """

    checksum: Optional[str]
    dest_filename: str
    dest_directory: str
    root_directory: str | os.PathLike = BIONEMO_CACHE_DIR
    url: Optional[str] = None

    @property
    def fully_qualified_dest_folder(self):  # noqa: D102
        return Path(self.root_directory) / self.dest_directory

    @property
    def fully_qualified_dest_filename(self):
        """Returns the fully qualified destination path of the file.

        Example:
            /tmp/my_folder/file.tar.gz
        """
        return os.path.join(self.fully_qualified_dest_folder, self.dest_filename)

    def exists_or_create_destination_directory(self, exist_ok=True):
        """Checks that the `fully_qualified_destination_directory` exists, if it does not, the directory is created (or fails).

        exists_ok: Triest to create `fully_qualified_dest_folder` if it doesnt already exist.
        """
        os.makedirs(self.fully_qualified_dest_folder, exist_ok=exist_ok)

    @staticmethod
    def get_env_tmpdir():
        """Convenience method that exposes the environment TMPDIR variable."""
        return os.environ.get("TMPDIR", "/tmp")

    def download_resource(self, overwrite=False) -> str:
        """Downloads the resource to its specified fully_qualified_dest name.

        Returns: the fully qualified destination filename.
        """
        self.exists_or_create_destination_directory()

        if not self.check_exists() or overwrite:
            logging.info(f"Downloading resource: {self.url}")
            with requests.get(self.url, stream=True) as r, open(self.fully_qualified_dest_filename, "wb") as fd:
                r.raise_for_status()
                for bytes in r:
                    fd.write(bytes)
        else:
            logging.info(f"Resource already exists, skipping download: {self.url}")

        self.check_exists()
        return self.fully_qualified_dest_filename

    def check_exists(self):
        """Returns true if `fully_qualified_dest_filename` exists and the checksum matches `self.checksum`"""  # noqa: D415
        if os.path.exists(self.fully_qualified_dest_filename):
            with open(self.fully_qualified_dest_filename, "rb") as fd:
                data = fd.read()
                result = md5(data).hexdigest()
            if self.checksum is None:
                logging.info("No checksum provided, filename exists. Assuming it is complete.")
                matches = True
            else:
                matches = result == self.checksum
            return matches

        return False

fully_qualified_dest_filename property

返回文件的完全限定目标路径。

示例

/tmp/my_folder/file.tar.gz

check_exists()

如果 fully_qualified_dest_filename 存在并且校验和与 self.checksum 匹配,则返回 true

源代码位于 bionemo/llm/utils/remote.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def check_exists(self):
    """Returns true if `fully_qualified_dest_filename` exists and the checksum matches `self.checksum`"""  # noqa: D415
    if os.path.exists(self.fully_qualified_dest_filename):
        with open(self.fully_qualified_dest_filename, "rb") as fd:
            data = fd.read()
            result = md5(data).hexdigest()
        if self.checksum is None:
            logging.info("No checksum provided, filename exists. Assuming it is complete.")
            matches = True
        else:
            matches = result == self.checksum
        return matches

    return False

download_resource(overwrite=False)

将资源下载到其指定的 fully_qualified_dest 名称。

返回:完全限定的目标文件名。

源代码位于 bionemo/llm/utils/remote.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def download_resource(self, overwrite=False) -> str:
    """Downloads the resource to its specified fully_qualified_dest name.

    Returns: the fully qualified destination filename.
    """
    self.exists_or_create_destination_directory()

    if not self.check_exists() or overwrite:
        logging.info(f"Downloading resource: {self.url}")
        with requests.get(self.url, stream=True) as r, open(self.fully_qualified_dest_filename, "wb") as fd:
            r.raise_for_status()
            for bytes in r:
                fd.write(bytes)
    else:
        logging.info(f"Resource already exists, skipping download: {self.url}")

    self.check_exists()
    return self.fully_qualified_dest_filename

exists_or_create_destination_directory(exist_ok=True)

检查 fully_qualified_destination_directory 是否存在,如果不存在,则创建目录(或失败)。

exists_ok:如果 fully_qualified_dest_folder 尚不存在,则尝试创建它。

源代码位于 bionemo/llm/utils/remote.py
 98
 99
100
101
102
103
def exists_or_create_destination_directory(self, exist_ok=True):
    """Checks that the `fully_qualified_destination_directory` exists, if it does not, the directory is created (or fails).

    exists_ok: Triest to create `fully_qualified_dest_folder` if it doesnt already exist.
    """
    os.makedirs(self.fully_qualified_dest_folder, exist_ok=exist_ok)

get_env_tmpdir() staticmethod

公开环境变量 TMPDIR 的便捷方法。

源代码位于 bionemo/llm/utils/remote.py
105
106
107
108
@staticmethod
def get_env_tmpdir():
    """Convenience method that exposes the environment TMPDIR variable."""
    return os.environ.get("TMPDIR", "/tmp")