Skip to content
This repository was archived by the owner on Jul 31, 2023. It is now read-only.

Commit f4650ca

Browse files
cfezequielmbernicolc0
authored
Release/2.0 (#56)
* Update check_tfrecords to use new dataset load function. * Add tfrecord_dir to create_tfrecords output. * Restructure test image directory to match expected format. * Feature/dataclass (#44) * Added data classes for types. * Checking in progress. * Checking in more changes. * Converted types to classes and refactored schema into OO pattern. * Changed OrderedDict import to support py3.6. * Changed OrderedDict import to support py3.6. * Updated setup.py for version. * fixing setup.py * Patched requirements and setup. * Addressed comments in code review. * Addressed code comments round 2. * refactored IMAGE_CSV_SCHEMA. * Merged check_test.py from dev Co-authored-by: Carlos Ezequiel <[email protected]> * Feature/structured data tutorial (#45) * Converted types to classes and refactored schema into OO pattern. * Add tutorial on structured data conversion. This changes types.FloatInput to use tf.float32 for its feature_spec attribute to address potential incompatibility with using tf.float64 type in TensorFlow Transform. Co-authored-by: Mike Bernico <[email protected]> * Update structured data tutorial to use output dir. * Clarify need for proper header when using create_tfrecords. Fixes #47. * Clean up README and update image directory notebook. * Feature/test image dir (#49) * Restructure test image directory to match expected format. * Clean up README and update image directory notebook. * Fix minor issues * Add an explicit error message for missing train split * Configure automated tests for Jupyter notebooks. * Add convert_and_load function. Also refactor create_tfrecords to convert. * Refactor check and common modules to utils. * Add test targets for py files and notebooks. * Feature/convert and load (#55) * Add convert_and_load function. Also refactor create_tfrecords to convert. * Refactor check and common modules to utils. * Add test targets for py files and notebooks. * Update version in setup.py and release notes. * Fix issues with GCS path parsing. Co-authored-by: Mike Bernico <[email protected]> Co-authored-by: Sergii Khomenko <[email protected]>
1 parent 747dabe commit f4650ca

38 files changed

+2333
-742
lines changed

.github/workflows/python-cicd.yml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ on: [push]
77

88
jobs:
99
build:
10-
1110
runs-on: ubuntu-latest
1211
strategy:
1312
matrix:
@@ -23,10 +22,14 @@ jobs:
2322
run: |
2423
python -m pip install --upgrade pip
2524
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
25+
26+
- name: Run all tests
27+
run: |
28+
export PYTHONPATH="$GITHUB_WORKSPACE"
29+
make test
30+
2631
- name: Lint with pylint
2732
run: |
2833
make pylint
2934
30-
- name: Run tests
31-
run: |
32-
make test
35+

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
build/
2+
dist/
3+
tfrecorder.egg-info
14
.idea/
25
.ipynb_checkpoints/
36
.vscode/

Makefile

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,17 @@
1-
all: init test pylint
1+
all: init testnb test pylint
22

33
init:
44
pip install -r requirements.txt
55

6-
test:
6+
test: test-nb test-py
7+
8+
test-py:
79
nosetests --with-coverage -v --cover-package=tfrecorder
810

11+
test-nb:
12+
ls -1 samples/*.ipynb | grep -v '^.*Dataflow.ipynb' | xargs py.test --nbval-lax -p no:python
13+
914
pylint:
10-
pylint tfrecorder
15+
pylint -j 0 tfrecorder
1116

12-
.PHONY: all init test pylint
17+
.PHONY: all init test pylint

README.md

Lines changed: 95 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ TFRecorder can convert any Pandas DataFrame or CSV file into TFRecords. If your
99
[Release Notes](RELEASE.md)
1010

1111
## Why TFRecorder?
12-
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). The TFRecorder project started inside [Google Cloud AI Services](https://cloud.google.com/consulting) when we realized we were writing TFRecord conversion code over and over again.
12+
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). The TFRecorder project started inside [Google Cloud AI Services](https://cloud.google.com/consulting) when we realized we were writing TFRecord conversion code over and over again.
1313

1414
When to use TFRecords:
1515
* Your model is input bound (reading data is impacting training time).
@@ -71,7 +71,7 @@ df.tensorflow.to_tfr(output_dir='/my/output/path')
7171

7272
Google Cloud Platform Dataflow workers need to be supplied with the tfrecorder
7373
package that you would like to run remotely. To do so first download or build
74-
the package (a python wheel file) and then specify the path the the file when
74+
the package (a python wheel file) and then specify the path the file when
7575
tfrecorder is called.
7676

7777
Step 1: Download or create the wheel file.
@@ -109,7 +109,7 @@ Using Python interpreter:
109109
```python
110110
import tfrecorder
111111
112-
tfrecorder.create_tfrecords(
112+
tfrecorder.convert(
113113
source='/path/to/data.csv',
114114
output_dir='gs://my/bucket')
115115
```
@@ -126,10 +126,9 @@ tfrecorder create-tfrecords \
126126
```python
127127
import tfrecorder
128128
129-
tfrecorder.create_tfrecords(
129+
tfrecorder.convert(
130130
source='/path/to/image_dir',
131-
output_dir='gs://my/bucket',
132-
)
131+
output_dir='gs://my/bucket')
133132
```
134133

135134
The image directory should have the following general structure:
@@ -159,7 +158,7 @@ images/
159158
160159
### Loading a TF Dataset from TFRecord files
161160
162-
You can load a TensorFlow dataset from TFRecord files generated by TFRecorder
161+
You can load a TensorFlow dataset from TFRecord files generated by TFRecorder
163162
on your local machine.
164163
165164
```python
@@ -175,8 +174,9 @@ Using Python interpreter:
175174
```python
176175
import tfrecorder
177176

178-
tfrecorder.check_tfrecords(
179-
file_pattern='/path/to/tfrecords/train*.tfrecord.gz',
177+
tfrecorder.inspect(
178+
tfrecord_dir='/path/to/tfrecords/',
179+
split='TRAIN',
180180
num_records=5,
181181
output_dir='/tmp/output')
182182
```
@@ -187,16 +187,17 @@ representing the images encoded into TFRecords.
187187
Using the command line:
188188

189189
```bash
190-
tfrecorder check-tfrecords \
191-
--file_pattern=/path/to/tfrecords/train*.tfrecord.gz \
190+
tfrecorder inspect \
191+
--tfrecord-dir=/path/to/tfrecords/ \
192+
--split='TRAIN' \
192193
--num_records=5 \
193194
--output_dir=/tmp/output
194195
```
195196

196197
## Default Schema
197198

198-
If you don't specify an input schema, TFRecorder expects data to be in the same format as
199-
[AutoML Vision input](https://cloud.google.com/vision/automl/docs/prepare).
199+
If you don't specify an input schema, TFRecorder expects data to be in the same format as
200+
[AutoML Vision input](https://cloud.google.com/vision/automl/docs/prepare).
200201
This format looks like a Pandas DataFrame or CSV formatted as:
201202

202203
| split | image_uri | label |
@@ -205,139 +206,139 @@ This format looks like a Pandas DataFrame or CSV formatted as:
205206

206207
where:
207208
* `split` can take on the values TRAIN, VALIDATION, and TEST
208-
* `image_uri` specifies a local or Google Cloud Storage location for the image file.
209-
* `label` can be either a text based label that will be integerized or integer
209+
* `image_uri` specifies a local or Google Cloud Storage location for the image file.
210+
* `label` can be either a text-based label that will be integerized or integer
210211

211212
## Flexible Schema
212213

213-
TFRecorder's flexible schema system allows you to use any schema you want for your input data. To support any input data schema, provide a schema map to TFRecorder. A TFRecorder schema_map creates a mapping between your dataframe column names and their types in the resulting
214-
TFRecord.
214+
TFRecorder's flexible schema system allows you to use any schema you want for your input data.
215215

216-
### Creating and using a schema map
217-
A schema map is a Python dictionary that maps DataFrame column names to [supported
218-
TFRecorder types.](#Supported-types)
216+
For example, the default image CSV schema input can be defined like this:
217+
```python
218+
import pandas as pd
219+
import tfrecorder
220+
from tfrecorder import input_schema
221+
from tfrecorder import types
219222

220-
For example, the default image CSV input can be defined like this:
223+
image_csv_schema = input_schema.Schema({
224+
'split': types.SplitKey,
225+
'image_uri': types.ImageUri,
226+
'label': types.StringLabel
227+
})
221228

222-
```python
223-
from tfrecorder import schema
229+
# You can then pass the schema to `tfrecorder.create_tfrecords`.
224230

225-
image_csv_schema = {
226-
'split': schema.split_key,
227-
'image_uri': schema.image_uri,
228-
'label': schema.string_label
229-
}
231+
df = pd.read_csv(...)
232+
df.tensorflow.to_tfr(
233+
output_dir='gs://my/bucket',
234+
schema_map=image_csv_schema,
235+
runner='DataflowRunner',
236+
project='my-project',
237+
region='us-central1')
230238
```
231-
Once created a schema_map can be sent to TFRecorder.
239+
240+
### Flexible Schema Example
241+
242+
Imagine that you have a dataset that you would like to convert to TFRecords that
243+
looks like this:
244+
245+
| split | x | y | label |
246+
|-------|-------|------|-------|
247+
| TRAIN | 0.32 | 42 |1 |
248+
249+
You can use TFRecorder as shown below:
232250

233251
```python
234252
import pandas as pd
235-
from tfrecorder import schema
236253
import tfrecorder
254+
from tfrecorder import input_schema
255+
from tfrecorder import types
256+
257+
# First create a schema map
258+
schema = input_schema.Schema({
259+
'split': types.SplitKey,
260+
'x': types.FloatInput,
261+
'y': types.IntegerInput,
262+
'label': types.IntegerLabel,
263+
})
264+
265+
# Now call TFRecorder with the specified schema_map
237266

238267
df = pd.read_csv(...)
239268
df.tensorflow.to_tfr(
240269
output_dir='gs://my/bucket',
241-
schema_map=schema.image_csv_schema,
270+
schema=schema,
242271
runner='DataflowRunner',
243272
project='my-project',
244273
region='us-central1')
245274
```
275+
After calling TFRecorder's `to_tfr()` function, TFRecorder will create an Apache beam pipeline, either locally or in this case
276+
using Google Cloud's Dataflow runner. This beam pipeline will use the schema map to identify the types you've associated with
277+
each data column and process your data using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) and TFRecorder's image processing functions to convert the data into into TFRecords.
246278

247279
### Supported types
248-
TFRecorder's schema system supports several types, all listed below. You can use
249-
these types by referencing them in the schema map. Each type informs TFRecorder how
250-
to treat your DataFrame columns. For example, the schema mapping
251-
`my_split_key: schema.SplitKeyType` tells TFRecorder to treat the column `my_split_key` as
252-
type `schema.SplitKeyType` and create dataset splits based on it's contents.
253280

254-
#### schema.ImageUriType
255-
* Specifies the path to an image. When specified, TFRecorder
256-
will load the specified image and store the image as a [base64 encoded](https://docs.python.org/3/library/base64.html)
257-
[tf.string](https://www.tensorflow.org/tutorials/load_data/unicode) in the key 'image'
258-
along with the height, width, and image channels as integers using they keys 'image_height', 'image_width', and 'image_channels'.
259-
* A schema can contain only one imageUriType
281+
TFRecorder's schema system supports several types.
282+
You can use these types by referencing them in the schema map.
283+
Each type informs TFRecorder how to treat your DataFrame columns.
284+
285+
#### types.SplitKey
260286

261-
#### schema.SplitKeyType
262287
* A split key is required for TFRecorder at this time.
263288
* Only one split key is allowed.
264-
* Specifies a split key that TFRecorder will use to partition the
289+
* Specifies a split key that TFRecorder will use to partition the
265290
input dataset on.
266291
* Allowed values are 'TRAIN', 'VALIDATION, and 'TEST'
267292

268-
Note: If you do not want your data to be partitioned please include a split_key and
269-
set all rows to TRAIN.
293+
Note: If you do not want your data to be partitioned, include a column with
294+
`types.SplitKey` and set all the elements to `TRAIN`.
295+
296+
#### types.ImageUri
297+
298+
* Specifies the path to an image. When specified, TFRecorder
299+
will load the specified image and store the image as a [base64 encoded](https://docs.python.org/3/library/base64.html)
300+
[tf.string](https://www.tensorflow.org/tutorials/load_data/unicode) in the key 'image'
301+
along with the height, width, and image channels as integers using the keys 'image_height', 'image_width', and 'image_channels'.
302+
* A schema can contain only one imageUri column
303+
304+
#### types.IntegerInput
270305

271-
#### schema.IntegerInputType
272306
* Specifies an int input.
273307
* Will be scaled to mean 0, variance 1.
274308

275-
#### schema.FloatInputType
309+
#### types.FloatInput
310+
276311
* Specifies an float input.
277312
* Will be scaled to mean 0, variance 1.
278313

279-
#### schema.CategoricalInputType
314+
#### types.CategoricalInput
315+
280316
* Specifies a string input.
281317
* Vocabulary computed and output integerized.
282318

283-
#### schema.IntegerLabelType
319+
#### types.IntegerLabel
320+
284321
* Specifies an integer target.
285322
* Not transformed.
286323

287-
#### schema.StringLabelType
324+
#### types.StringLabel
325+
288326
* Specifies a string target.
289327
* Vocabulary computed and *output integerized.*
290328

291-
### Flexible Schema Example
292-
293-
Imagine that you have a dataset that you would like to convert to TFRecords that
294-
looks like this:
295-
296-
| split | x | y | label |
297-
|-------|-------|------|-------|
298-
| TRAIN | 0.32 | 42 |1 |
299-
300-
You can use TFRecorder as shown below:
301-
302-
```python
303-
import pandas as pd
304-
import tfrecorder
305-
from tfrecorder import schema
306-
307-
# First create a schema map
308-
schema_map = {
309-
'split':schema.SplitKeyType,
310-
'x':schema.FloatInputType,
311-
'y':schema.IntegerInputType,
312-
'label':schema.IntegerLabelType
313-
}
314-
315-
# Now call TFRecorder with the specified schema_map
316-
317-
df = pd.read_csv(...)
318-
df.tensorflow.to_tfr(
319-
output_dir='gs://my/bucket',
320-
schema_map=schema_map,
321-
runner='DataflowRunner',
322-
project='my-project',
323-
region='us-central1')
324-
```
325-
After calling TFRecorder's to_tfr() function, TFRecorder will create an Apache beam pipeline, either locally or in this case
326-
using Google Cloud's Dataflow runner. This beam pipeline will use the schema map to identify the types you've associated with
327-
each data column and process your data using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) and TFRecorder's image processing functions to convert the data into into TFRecords.
328-
329329
## Contributing
330330

331-
Pull requests are welcome. Please see our [code of conduct](docs/code-of-conduct.md) and [contributing guide](docs/contributing.md).
331+
Pull requests are welcome.
332+
Please see our [code of conduct](docs/code-of-conduct.md) and [contributing guide](docs/contributing.md).
332333

333334
## Why TFRecorder?
334-
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem).
335+
336+
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem).
335337

336338
TFRecords help when:
337339
* Your model is input bound (reading data is impacting training time).
338340
* Anytime you want to use tf.Dataset
339341
* When your dataset can't fit into memory
340342

341-
342-
In our work at [Google Cloud AI Services](https://cloud.google.com/consulting) we wanted to help our users spend their time writing AI/ML applications, and spend less time converting data.
343-
343+
Need help with using AI in the cloud?
344+
Visit [Google Cloud AI Services](https://cloud.google.com/consulting).

RELEASE.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
# Release 2.0
2+
3+
* Changes `create_tfrecords` and `check_tfrecords` to `convert` and `inspect` respectively
4+
* Adds `convert_and_load` function
5+
* Changes flexible schema to use `dataclasses`
6+
* Adds automated testing for notebooks
7+
* Minor fixes and usability improvements
8+
19
# Hotfix 1.1.3
210

311
* Adds note regarding DataFrame header specification in README.md.

requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,6 @@ jupyter >= 1.0.0
1212
tensorflow >= 2.3.1
1313
pyarrow <0.18,>=0.17
1414
frozendict >= 1.2
15+
dataclasses >= 0.5;python_version<"3.7"
16+
nbval >= 0.9.6
17+
pytest >= 6.1.1

0 commit comments

Comments
 (0)