BioSequenceDataset¶
BioSequenceDataset loads and saves data to a sequence file.
kedro_datasets.biosequence.BioSequenceDataset ¶
BioSequenceDataset(
*,
filepath,
load_args=None,
save_args=None,
credentials=None,
fs_args=None,
metadata=None
)
Bases: AbstractDataset[list, list]
BioSequenceDataset loads and saves data to a sequence file.
Examples:
Using the Python API:
>>> from Bio import SeqIO
>>> from io import StringIO
>>> from kedro_datasets.biosequence import BioSequenceDataset
>>>
>>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n"
>>> raw_data = []
>>> for record in SeqIO.parse(StringIO(data), "fasta"):
... raw_data.append(record)
...
>>>
>>> dataset = BioSequenceDataset(
... filepath=tmp_path / "ls_orchid.fasta",
... load_args={"format": "fasta"},
... save_args={"format": "fasta"},
... )
>>> dataset.save(raw_data)
>>> sequence_list = dataset.load()
>>> assert raw_data[0].id == sequence_list[0].id
>>> assert raw_data[0].seq == sequence_list[0].seq
Parameters:
-
filepath(str | PathLike) –Filepath in POSIX format to sequence file prefixed with a protocol like
s3://. If prefix is not provided,fileprotocol (local filesystem) will be used. The prefix should be any protocol supported byfsspec. -
load_args(dict[str, Any] | None, default:None) –Options for parsing sequence files by Biopython
SeqIO.parse(). -
save_args(dict[str, Any] | None, default:None) –file format supported by Biopython
SeqIO.write(). E.g.{"format": "fasta"}. -
credentials(dict[str, Any] | None, default:None) –Credentials required to get access to the underlying filesystem. E.g. for
GCSFileSystemit should look like{"token": None}. -
fs_args(dict[str, Any] | None, default:None) –Extra arguments to pass into underlying filesystem class constructor (e.g.
{"project": "my-project"}forGCSFileSystem), as well as to pass to the filesystem'sopenmethod through nested keysopen_args_loadandopen_args_save. Here you can find all available arguments foropen: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open All defaults are preserved, exceptmode, which is set torwhen loading and towwhen saving. -
metadata(dict[str, Any] | None, default:None) –Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.
Note: Here you can find all supported file formats: https://biopython.org/wiki/SeqIO
Source code in kedro_datasets/biosequence/biosequence_dataset.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
DEFAULT_FS_ARGS
class-attribute
instance-attribute
¶
DEFAULT_FS_ARGS = {
"open_args_save": {"mode": "w"},
"open_args_load": {"mode": "r"},
}
_fs_open_args_load
instance-attribute
¶
_fs_open_args_load = {
None: get("open_args_load", {}),
None: _fs_open_args_load or {},
}
_fs_open_args_save
instance-attribute
¶
_fs_open_args_save = {
None: get("open_args_save", {}),
None: _fs_open_args_save or {},
}
_describe ¶
_describe()
Source code in kedro_datasets/biosequence/biosequence_dataset.py
116 117 118 119 120 121 122 | |
_exists ¶
_exists()
Source code in kedro_datasets/biosequence/biosequence_dataset.py
135 136 137 | |
_release ¶
_release()
Source code in kedro_datasets/biosequence/biosequence_dataset.py
139 140 | |
invalidate_cache ¶
invalidate_cache()
Invalidate underlying filesystem caches.
Source code in kedro_datasets/biosequence/biosequence_dataset.py
142 143 144 145 | |
load ¶
load()
Source code in kedro_datasets/biosequence/biosequence_dataset.py
124 125 126 127 | |
save ¶
save(data)
Source code in kedro_datasets/biosequence/biosequence_dataset.py
129 130 131 132 133 | |