pypdf.PDFDataset
kedro_datasets_experimental.pypdf.PDFDataset ¶
PDFDataset(
*,
filepath,
load_args=None,
credentials=None,
fs_args=None,
metadata=None
)
Bases: AbstractDataset[NoReturn, list[str]]
PDFDataset loads data from PDF files using an underlying
filesystem (e.g.: local, S3, GCS). It uses pypdf to read and extract text from PDF files.
This is a read-only dataset - saving is not supported.
Examples:
Using the YAML API:
my_pdf_document:
type: pypdf.PDFDataset
filepath: data/01_raw/document.pdf
password_protected_pdf:
type: pypdf.PDFDataset
filepath: data/01_raw/protected.pdf
load_args:
password: "pass123" # pragma: allowlist secret
s3_pdf:
type: pypdf.PDFDataset
filepath: s3://your_bucket/document.pdf
credentials: dev_s3
Using the Python API:
>>> from kedro_datasets_experimental.pypdf import PDFDataset
>>>
>>> dataset = PDFDataset(filepath="data/document.pdf")
>>> pages = dataset.load()
>>> # pages is a list of strings, one per page
>>> assert isinstance(pages, list)
>>> assert all(isinstance(page, str) for page in pages)
Parameters:
-
filepath(str) –Filepath in POSIX format to a PDF file prefixed with a protocol like
s3://. If prefix is not provided,fileprotocol (local filesystem) will be used. The prefix should be any protocol supported byfsspec. -
load_args(dict[str, Any] | None, default:None) –Pypdf options for loading PDF files (arguments passed into
pypdf.PdfReader). Here you can find all available arguments: https://pypdf.readthedocs.io/en/stable/modules/PdfReader.html All defaults are preserved, except "strict", which is set to False. Common options include: - password (str): Password for encrypted PDFs - strict (bool): Whether to raise errors on malformed PDFs (default: False) -
credentials(dict[str, Any] | None, default:None) –Credentials required to get access to the underlying filesystem. E.g. for
GCSFileSystemit should look like{"token": None}. -
fs_args(dict[str, Any] | None, default:None) –Extra arguments to pass into underlying filesystem class constructor (e.g.
{"project": "my-project"}forGCSFileSystem), as well as to pass to the filesystem'sopenmethod through nested keysopen_args_loadandopen_args_save. Here you can find all available arguments foropen: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open All defaults are preserved. -
metadata(dict[str, Any] | None, default:None) –Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.
Source code in kedro_datasets_experimental/pypdf/pdf_dataset.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | |
_describe ¶
_describe()
Source code in kedro_datasets_experimental/pypdf/pdf_dataset.py
114 115 116 117 118 119 | |
_exists ¶
_exists()
Check if the PDF file exists.
Returns:
-
bool–True if the file exists, False otherwise.
Source code in kedro_datasets_experimental/pypdf/pdf_dataset.py
147 148 149 150 151 152 153 154 | |
_invalidate_cache ¶
_invalidate_cache()
Invalidate underlying filesystem caches.
Source code in kedro_datasets_experimental/pypdf/pdf_dataset.py
160 161 162 163 | |
_release ¶
_release()
Release any cached filesystem information.
Source code in kedro_datasets_experimental/pypdf/pdf_dataset.py
156 157 158 | |
load ¶
load()
Loads data from a PDF file.
Returns:
-
list[str]–A list of strings, where each string contains the text extracted from one page.
Source code in kedro_datasets_experimental/pypdf/pdf_dataset.py
121 122 123 124 125 126 127 128 129 130 131 132 133 134 | |
save ¶
save(data)
Saving to PDFDataset is not supported.
Parameters:
-
data(NoReturn) –Data to save.
Raises:
-
DatasetError–Always raised as saving is not supported.
Source code in kedro_datasets_experimental/pypdf/pdf_dataset.py
136 137 138 139 140 141 142 143 144 145 | |