负责下载远程文件,以及对下载的文件进行可选处理,以供下游用例使用。
每个对象都通过其构造函数(设置目标和校验和)或通过预配置的类方法调用。download_resource()
包含核心功能,即将 url
上的文件下载到完全限定的文件名。类方法可用于进一步配置此过程。
接收
文件、其校验和、目标目录和根目录
我们的数据类然后提供一些有用的东西:- 完全限定的目标文件夹(属性) - 完全限定的目标文件(属性) - check_exists() - download_resource()
形成完全限定的目标文件夹。为文件创建完全限定路径
(所有内容都位于下载例程中)检查 fq 目标文件夹是否存在,否则创建它 下载文件。校验和下载。完成。
后处理应该是他们自己的方法,并带有自己的配置。
使用示例
以下操作将下载并预处理预打包的资源。
GRCh38Ensembl99ResourcePreparer().prepare() Hg38chromResourcePreparer().prepare() GRCh38p13_ResourcePreparer().prepare()
属性
名称 |
类型 |
描述 |
dest_directory |
str
|
完成下载后放置所需文件的目录。应具有 {dest_directory}/{dest_filename} 的形式
|
dest_filename |
str
|
|
checksum |
Optional[str]
|
与 url 上的文件关联的校验和。如果设置为 None,check_exists 仅检查 {dest_directory}/{dest_filename} 的存在
|
url |
Optional[str]
|
|
root_directory |
str | PathLike
|
底层目录,完全限定路径通过连接 root_directory、dest_directory 和 dest_filename 形成。
|
源代码位于 bionemo/llm/utils/remote.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142 | @dataclass
class RemoteResource:
"""Responsible for downloading remote files, along with optional processing of downloaded files for downstream usecases.
Each object is invoked through either its constructor (setting up the destination and checksum), or through a pre-configured class method.
`download_resource()` contains the core functionality, which is to download the file at `url` to the fully qualified filename. Class methods
can be used to further configure this process.
Receive:
a file, its checksum, a destination directory, and a root directory
Our dataclass then provides some useful things:
- fully qualified destination folder (property)
- fully qualified destination file (property)
- check_exists()
- download_resource()
Form the fully qualified destination folder.
Create a fully qualified path for the file
(all lives in the download routine)
Check that the fq destination folder exists, otherwise create it
Download the file.
Checksum the download.
Done.
Postprocessing should be their own method with their own configuration.
Example usage:
>>> # The following will download and preprocess the prepackaged resources.
>>> GRCh38Ensembl99ResourcePreparer().prepare()
>>> Hg38chromResourcePreparer().prepare()
>>> GRCh38p13_ResourcePreparer().prepare()
Attributes:
dest_directory: The directory to place the desired file upon completing the download. Should have the form {dest_directory}/{dest_filename}
dest_filename: The desired name for the file upon completing the download.
checksum: checksum associated with the file located at url. If set to None, check_exists only checks for the existance of `{dest_directory}/{dest_filename}`
url: URL of the file to download
root_directory: the bottom-level directory, the fully qualified path is formed by joining root_directory, dest_directory, and dest_filename.
"""
checksum: Optional[str]
dest_filename: str
dest_directory: str
root_directory: str | os.PathLike = BIONEMO_CACHE_DIR
url: Optional[str] = None
@property
def fully_qualified_dest_folder(self): # noqa: D102
return Path(self.root_directory) / self.dest_directory
@property
def fully_qualified_dest_filename(self):
"""Returns the fully qualified destination path of the file.
Example:
/tmp/my_folder/file.tar.gz
"""
return os.path.join(self.fully_qualified_dest_folder, self.dest_filename)
def exists_or_create_destination_directory(self, exist_ok=True):
"""Checks that the `fully_qualified_destination_directory` exists, if it does not, the directory is created (or fails).
exists_ok: Triest to create `fully_qualified_dest_folder` if it doesnt already exist.
"""
os.makedirs(self.fully_qualified_dest_folder, exist_ok=exist_ok)
@staticmethod
def get_env_tmpdir():
"""Convenience method that exposes the environment TMPDIR variable."""
return os.environ.get("TMPDIR", "/tmp")
def download_resource(self, overwrite=False) -> str:
"""Downloads the resource to its specified fully_qualified_dest name.
Returns: the fully qualified destination filename.
"""
self.exists_or_create_destination_directory()
if not self.check_exists() or overwrite:
logging.info(f"Downloading resource: {self.url}")
with requests.get(self.url, stream=True) as r, open(self.fully_qualified_dest_filename, "wb") as fd:
r.raise_for_status()
for bytes in r:
fd.write(bytes)
else:
logging.info(f"Resource already exists, skipping download: {self.url}")
self.check_exists()
return self.fully_qualified_dest_filename
def check_exists(self):
"""Returns true if `fully_qualified_dest_filename` exists and the checksum matches `self.checksum`""" # noqa: D415
if os.path.exists(self.fully_qualified_dest_filename):
with open(self.fully_qualified_dest_filename, "rb") as fd:
data = fd.read()
result = md5(data).hexdigest()
if self.checksum is None:
logging.info("No checksum provided, filename exists. Assuming it is complete.")
matches = True
else:
matches = result == self.checksum
return matches
return False
|
fully_qualified_dest_filename
property
返回文件的完全限定目标路径。
示例
/tmp/my_folder/file.tar.gz
check_exists()
如果 fully_qualified_dest_filename
存在并且校验和与 self.checksum
匹配,则返回 true
源代码位于 bionemo/llm/utils/remote.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142 | def check_exists(self):
"""Returns true if `fully_qualified_dest_filename` exists and the checksum matches `self.checksum`""" # noqa: D415
if os.path.exists(self.fully_qualified_dest_filename):
with open(self.fully_qualified_dest_filename, "rb") as fd:
data = fd.read()
result = md5(data).hexdigest()
if self.checksum is None:
logging.info("No checksum provided, filename exists. Assuming it is complete.")
matches = True
else:
matches = result == self.checksum
return matches
return False
|
download_resource(overwrite=False)
将资源下载到其指定的 fully_qualified_dest 名称。
返回:完全限定的目标文件名。
源代码位于 bionemo/llm/utils/remote.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127 | def download_resource(self, overwrite=False) -> str:
"""Downloads the resource to its specified fully_qualified_dest name.
Returns: the fully qualified destination filename.
"""
self.exists_or_create_destination_directory()
if not self.check_exists() or overwrite:
logging.info(f"Downloading resource: {self.url}")
with requests.get(self.url, stream=True) as r, open(self.fully_qualified_dest_filename, "wb") as fd:
r.raise_for_status()
for bytes in r:
fd.write(bytes)
else:
logging.info(f"Resource already exists, skipping download: {self.url}")
self.check_exists()
return self.fully_qualified_dest_filename
|
exists_or_create_destination_directory(exist_ok=True)
检查 fully_qualified_destination_directory
是否存在,如果不存在,则创建目录(或失败)。
exists_ok:如果 fully_qualified_dest_folder
尚不存在,则尝试创建它。
源代码位于 bionemo/llm/utils/remote.py
| def exists_or_create_destination_directory(self, exist_ok=True):
"""Checks that the `fully_qualified_destination_directory` exists, if it does not, the directory is created (or fails).
exists_ok: Triest to create `fully_qualified_dest_folder` if it doesnt already exist.
"""
os.makedirs(self.fully_qualified_dest_folder, exist_ok=exist_ok)
|
get_env_tmpdir()
staticmethod
公开环境变量 TMPDIR 的便捷方法。
源代码位于 bionemo/llm/utils/remote.py
| @staticmethod
def get_env_tmpdir():
"""Convenience method that exposes the environment TMPDIR variable."""
return os.environ.get("TMPDIR", "/tmp")
|