Skip to content

Commit 6c83b47

Browse files
tf-transform-teamelmer-garduno
authored andcommitted
Project import generated by Copybara.
PiperOrigin-RevId: 177494698
1 parent 6bd7c13 commit 6c83b47

19 files changed

+620
-208
lines changed

README.md

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -35,18 +35,12 @@ an explicit dependency on TensorFlow as a package. See [TensorFlow
3535
documentation](https://www.tensorflow.org/install/) for more information on
3636
installing TensorFlow.
3737

38-
tf.Transform does though have a dependency on the Google Cloud Dataflow
39-
distribution of Apache Beam.
40-
Apache Beam is the package used to run distributed pipelines. Apache Beam is
41-
able to run pipelines in multiple ways, depending on the "runner" used. While
42-
Apache Beam is an open source package, currently the only distribution on PyPI
43-
is the Cloud Dataflow distribution. This package can run beam pipelines locally,
44-
or on Google Cloud Dataflow.
45-
46-
When a base package for Apache Beam (containing no runners) is available, the
47-
tf.Transform package will depend only on this base package, and users will be
48-
able to install their own runners. tf.Transform will attempt to be as
49-
independent from the specific runner as possible.
38+
tf.Transform does though have a dependency on the GCP distribution of Apache
39+
Beam. Apache Beam is the framework used to run distributed pipelines. Apache
40+
Beam is able to run pipelines in multiple ways, depending on the "runner" used,
41+
and the "runner" is usually provided by a distribution of Apache
42+
Beam. With the GCP distribution of Apache Beam, one can run beam pipelines
43+
locally, or on Google Cloud Dataflow.
5044

5145
Note: If you clone tf.Transform's implementation and samples from GitHub's
5246
`master` branch (as opposed to using the released implementation and samples
@@ -60,11 +54,13 @@ a comprehensive list, meaning other combinations may also work, but these are
6054
the combinations tested by our testing framework and by the team before
6155
releasing a new version.
6256

63-
|tensorflow-transform |tensorflow|apache-beam[gcp]|
64-
|--------------------------------------------------------------------------------|----------|----------------|
65-
|[GitHub master](https://github.com/tensorflow/transform/blob/master/RELEASE.md) |nightly |latest (2.x) |
66-
|[0.3.0](https://github.com/tensorflow/transform/blob/v0.3.0/RELEASE.md) |1.3 |2.1.1 |
67-
|[0.1.10](https://github.com/tensorflow/transform/blob/v0.1.10/RELEASE.md) |1.0 |2.0.0 |
57+
|tensorflow-transform |tensorflow |apache-beam[gcp]|
58+
|--------------------------------------------------------------------------------|--------------|----------------|
59+
|[GitHub master](https://github.com/tensorflow/transform/blob/master/RELEASE.md) |nightly (1.x) |2.2.0 |
60+
|[0.3.1](https://github.com/tensorflow/transform/blob/v0.3.1/RELEASE.md) |1.3 |2.1.1 |
61+
|[0.3.0](https://github.com/tensorflow/transform/blob/v0.3.0/RELEASE.md) |1.3 |2.1.1 |
62+
|[0.1.10](https://github.com/tensorflow/transform/blob/v0.1.10/RELEASE.md) |1.0 |2.0.0 |
63+
6864
## Getting Started
6965

7066
For instructions on using tf.Transform see the [getting started

RELEASE.md

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,41 @@
1-
# Current Release
1+
# Current version (not yet released; still in development)
22

33
## Major Features and Improvements
4-
* Add a combine_analyzer() that supports user provided combiner, conforming to
4+
* Added a combine_analyzer() that supports user provided combiner, conforming to
55
beam.CombinFn(). This allows users to implement custom combiners
66
(e.g. median), to complement analyzers (like min, max) that are
77
prepackaged in TFT.
8+
* Quantiles Analyzer (`tft.quantiles`).
89

910
## Bug Fixes and Other Changes
10-
11+
* Depends on `apache-beam[gcp]>=2.2,<3`.
12+
* Fixes some KeyError issues that appeared in certain circumstances when one
13+
would call AnalyzeAndTransformDataset (due to a now-fixed Apache Beam [bug]
14+
(https://issues.apache.org/jira/projects/BEAM/issues/BEAM-2966)).
1115
* Allow all functions that accept and return tensors, to accept an optional
1216
name scope, in line with TensorFlow coding conventions.
13-
1417
* Update examples to construct input functions by hand instead of using helper
1518
functions.
16-
1719
* Change scale_by_min_max/scale_to_0_1 to return the average(min, max) of the
1820
range in case all values are identical.
19-
2021
* Added export of serving model to examples.
22+
* Use "core" version of feature columns (tf.feature_column instead of
23+
tf.contrib) in examples.
24+
* A few bug fixes and improvements for coders regarding Python 3.
2125

2226
## Breaking changes
2327

28+
* Requires pre-installed TensorFlow >= 1.4.
29+
* No longer distributing a WHL file in PyPI. Only doing a source distribution
30+
which should however be compatible with all platforms (ie you are still able
31+
to `pip install tensorflow-transform` and use `requirements.txt` or `setup.py`
32+
files for environment setup).
2433
* Some functions now introduce a new name scope when they did not before so the
2534
names of tensors may change. This will only affect you if you directly lookup
2635
tensors by name in the graph produced by tf.Transform.
36+
* Various Analyzer Specs (_NumericCombineSpec, _UniquesSpec, _QuantilesSpec) are
37+
now private. Analyzers are accessible only via the top-level TFT functions (
38+
min, max, sum, size, mean, var, uniques, quantiles).
2739

2840
# Release 0.3.1
2941

@@ -38,7 +50,7 @@ in input_fn_maker there is now also a `*_serving_input_receiver_fn()`.
3850
vocabulary (as generated by `tft.uniques`) to several different columns.
3951
* Provide a source distribution tar `tensorflow-transform-X.Y.Z.tar.gz`.
4052

41-
## Breaking changes
53+
## Breaking Changes
4254
* The default prefix for `tft.string_to_int` `vocab_filename` changed from
4355
`vocab_string_to_int` to `vocab_string_to_int_uniques`. To make your pipelines
4456
resilient to implementation details please set `vocab_filename` if you are using
@@ -57,12 +69,12 @@ the generated vocab_filename on a downstream component.
5769
use multi-threaded workers.
5870
* Performance optimizations in ExampleProtoCoder.
5971
* Depends on `apache-beam[gcp]>=2.1.1,<3`.
60-
* Depends on `protobuf>=3.3.0<4`.
72+
* Depends on `protobuf>=3.3<4`.
6173
* Depends on `six>=1.9,<1.11`.
6274

63-
## Breaking changes
75+
## Breaking Changes
6476
* Requires pre-installed TensorFlow >= 1.3.
65-
* Removed `tft.map` use `tft.apply_function` instead (as needed).
77+
* Removed `tft.map` use `tft.apply_function` instead (as needed).
6678
* Removed `tft.tfidf_weights` use `tft.tfidf` instead.
6779
* `beam_metadata_io.WriteMetadata` now requires a second `pipeline` argument
6880
(see examples).

examples/census_example.py

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -30,14 +30,12 @@
3030
from apache_beam.io import tfrecordio
3131
from tensorflow.contrib import learn
3232
from tensorflow.contrib import lookup
33-
from tensorflow.contrib.layers import feature_column
3433
from tensorflow.contrib.learn.python.learn.utils import input_fn_utils
3534

3635
from tensorflow_transform.beam import impl as beam_impl
3736
from tensorflow_transform.beam.tft_beam_io import transform_fn_io
3837
from tensorflow_transform.coders import csv_coder
3938
from tensorflow_transform.coders import example_proto_coder
40-
from tensorflow_transform.saved import input_fn_maker
4139
from tensorflow_transform.saved import saved_transform_io
4240
from tensorflow_transform.tf_metadata import dataset_metadata
4341
from tensorflow_transform.tf_metadata import dataset_schema
@@ -126,7 +124,7 @@ def preprocessing_fn(inputs):
126124

127125
# For the label column we provide the mapping from string to index.
128126
def convert_label(label):
129-
table = lookup.string_to_index_table_from_tensor(['>50K', '<=50K'])
127+
table = lookup.index_table_from_tensor(['>50K', '<=50K'])
130128
return table.lookup(label)
131129
outputs[LABEL_KEY] = tft.apply_function(convert_label, inputs[LABEL_KEY])
132130

@@ -233,11 +231,6 @@ def input_fn():
233231
os.path.join(working_dir, filebase + '*'),
234232
batch_size, transformed_feature_spec, tf.TFRecordReader)
235233

236-
# Apply convert_scalars_to_vectors to avoid errors where feature columns
237-
# do not accept scalars but require length-1 vectors.
238-
transformed_features = input_fn_maker.convert_scalars_to_vectors(
239-
transformed_features)
240-
241234
# Extract features and label from the transformed tensors.
242235
transformed_labels = transformed_features.pop(LABEL_KEY)
243236

@@ -276,10 +269,6 @@ def serving_input_fn():
276269
os.path.join(working_dir, transform_fn_io.TRANSFORM_FN_DIR),
277270
raw_features))
278271

279-
# Apply convert_scalars_to_vectors since this was done in training.
280-
transformed_features = input_fn_maker.convert_scalars_to_vectors(
281-
transformed_features)
282-
283272
return input_fn_utils.InputFnOps(transformed_features, None, default_inputs)
284273

285274
return serving_input_fn
@@ -300,15 +289,15 @@ def train_and_evaluate(working_dir, num_train_instances=NUM_TRAIN_INSTANCES,
300289
"""
301290

302291
# Wrap scalars as real valued columns.
303-
real_valued_columns = [feature_column.real_valued_column(key)
292+
real_valued_columns = [tf.feature_column.numeric_column(key, shape=())
304293
for key in NUMERIC_FEATURE_KEYS]
305294

306295
# Wrap categorical columns. Note the combiner is irrelevant since the input
307296
# only has one value set per feature per instance.
308297
one_hot_columns = [
309-
feature_column.sparse_column_with_integerized_feature(
310-
key, bucket_size=bucket_size, combiner='sum')
311-
for key, bucket_size in zip(CATEGORICAL_FEATURE_KEYS, BUCKET_SIZES)]
298+
tf.feature_column.categorical_column_with_identity(
299+
key, num_buckets=num_buckets)
300+
for key, num_buckets in zip(CATEGORICAL_FEATURE_KEYS, BUCKET_SIZES)]
312301

313302
estimator = learn.LinearClassifier(real_valued_columns + one_hot_columns)
314303

examples/sentiment_example.py

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,10 @@
2929
from apache_beam.io import textio
3030
from apache_beam.io import tfrecordio
3131
from tensorflow.contrib import learn
32-
from tensorflow.contrib.layers import feature_column
3332
from tensorflow.contrib.learn.python.learn.utils import input_fn_utils
3433
from tensorflow_transform.beam import impl as beam_impl
3534
from tensorflow_transform.beam.tft_beam_io import transform_fn_io
3635
from tensorflow_transform.coders import example_proto_coder
37-
from tensorflow_transform.saved import input_fn_maker
3836
from tensorflow_transform.saved import saved_transform_io
3937
from tensorflow_transform.tf_metadata import dataset_metadata
4038
from tensorflow_transform.tf_metadata import dataset_schema
@@ -261,11 +259,6 @@ def input_fn():
261259
os.path.join(working_dir, filebase + '*'),
262260
batch_size, transformed_feature_spec, tf.TFRecordReader)
263261

264-
# Apply convert_scalars_to_vectors to avoid errors where feature columns
265-
# do not accept scalars but require length-1 vectors.
266-
transformed_features = input_fn_maker.convert_scalars_to_vectors(
267-
transformed_features)
268-
269262
# Extract features and label from the transformed tensors.
270263
transformed_labels = transformed_features.pop(LABEL_KEY)
271264

@@ -304,10 +297,6 @@ def serving_input_fn():
304297
os.path.join(working_dir, transform_fn_io.TRANSFORM_FN_DIR),
305298
raw_features))
306299

307-
# Apply convert_scalars_to_vectors since this was done in training.
308-
transformed_features = input_fn_maker.convert_scalars_to_vectors(
309-
transformed_features)
310-
311300
return input_fn_utils.InputFnOps(transformed_features, None, default_inputs)
312301

313302
return serving_input_fn
@@ -327,15 +316,13 @@ def train_and_evaluate(working_dir,
327316
The results from the estimator's 'evaluate' method
328317
"""
329318
# Unrecognized tokens are represented by -1, but
330-
# sparse_column_with_integerized_feature uses the mod operator to map integers
319+
# categorical_column_with_identity uses the mod operator to map integers
331320
# to the range [0, bucket_size). By choosing bucket_size=VOCAB_SIZE + 1, we
332321
# represent unrecognized tokens as VOCAB_SIZE.
333-
review_column = feature_column.sparse_column_with_integerized_feature(
334-
REVIEW_KEY,
335-
bucket_size=VOCAB_SIZE + 1,
336-
combiner='sum')
337-
weighted_reviews = feature_column.weighted_sparse_column(review_column,
338-
REVIEW_WEIGHT_KEY)
322+
review_column = tf.feature_column.categorical_column_with_identity(
323+
REVIEW_KEY, num_buckets=VOCAB_SIZE + 1)
324+
weighted_reviews = tf.feature_column.weighted_categorical_column(
325+
review_column, REVIEW_WEIGHT_KEY)
339326

340327
estimator = learn.LinearClassifier([weighted_reviews])
341328

examples/simple_example.py

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -61,15 +61,11 @@ def preprocessing_fn(inputs):
6161
}))
6262

6363
with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
64-
transform_fn = (
65-
(raw_data, raw_data_metadata)
66-
| beam_impl.AnalyzeDataset(preprocessing_fn))
67-
transformed_dataset = (
68-
((raw_data, raw_data_metadata), transform_fn)
69-
| beam_impl.TransformDataset())
64+
transformed_dataset, transform_fn = ( # pylint: disable=unused-variable
65+
(raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(
66+
preprocessing_fn))
7067

71-
# pylint: disable=unused-variable
72-
transformed_data, transformed_metadata = transformed_dataset
68+
transformed_data, transformed_metadata = transformed_dataset # pylint: disable=unused-variable
7369

7470
pprint.pprint(transformed_data)
7571

getting_started.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -119,12 +119,9 @@ raw_data = [
119119
]
120120
121121
raw_data_metadata = ...
122-
transform_fn = (
123-
(raw_data, raw_data_metadata)
124-
| beam_impl.AnalyzeDataset(preprocessing_fn))
125-
transformed_dataset = (
126-
((raw_data, raw_data_metadata), transform_fn)
127-
| beam_impl.TransformDataset())
122+
transformed_dataset, transform_fn = (
123+
(raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(
124+
preprocessing_fn, tempfile.mkdtemp()))
128125
transformed_data, transformed_metadata = transformed_dataset
129126
```
130127

setup.py

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,20 +17,20 @@
1717
from setuptools import setup
1818

1919
# Tensorflow transform version.
20-
__version__ = '0.3.1'
20+
__version__ = '0.4.0dev'
2121

2222

2323
def _make_required_install_packages():
2424
return [
25-
# apache-beam[gcp] <= 2.1.0 has an issue with importing the six library.
26-
'apache-beam[gcp]>=2.1.1,<3',
25+
'apache-beam[gcp]>=2.2,<3',
2726

28-
# Protobuf libraries <= 3.2 contain some map-related data corruption bugs
27+
# Protobuf libraries < 3.3 contain some map-related data corruption bugs
2928
# (b/35874111).
3029
'protobuf>=3.3,<4',
3130

3231
# Six 1.11.0 incompatible with apitools.
3332
'six>=1.9,<1.11',
33+
3434
]
3535

3636

@@ -40,8 +40,27 @@ def _make_required_install_packages():
4040
author='Google Inc.',
4141
author_email='[email protected]',
4242
license='Apache 2.0',
43+
classifiers=[
44+
'Development Status :: 4 - Beta',
45+
'Intended Audience :: Developers',
46+
'Intended Audience :: Education',
47+
'Intended Audience :: Science/Research',
48+
'License :: OSI Approved :: Apache Software License',
49+
'Operating System :: OS Independent',
50+
'Programming Language :: Python',
51+
'Programming Language :: Python :: 2',
52+
'Programming Language :: Python :: 2.7',
53+
'Programming Language :: Python :: 2 :: Only',
54+
'Topic :: Scientific/Engineering',
55+
'Topic :: Scientific/Engineering :: Artificial Intelligence',
56+
'Topic :: Scientific/Engineering :: Mathematics',
57+
'Topic :: Software Development',
58+
'Topic :: Software Development :: Libraries',
59+
'Topic :: Software Development :: Libraries :: Python Modules',
60+
],
4361
namespace_packages=[],
4462
install_requires=_make_required_install_packages(),
63+
python_requires='>=2.7,<3',
4564
packages=find_packages(),
4665
include_package_data=True,
4766
description='A library for data preprocessing with TensorFlow',

tensorflow_transform/analyzers.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -425,6 +425,50 @@ def bucket_dtype(self):
425425
return tf.float32
426426

427427

428+
def quantiles(x, num_buckets, epsilon, name=None):
429+
"""Computes the quantile boundaries of a `Tensor` over the whole dataset.
430+
431+
quantile boundaries are computed using approximate quantiles,
432+
and error tolerance is specified using `epsilon`. The boundaries divide the
433+
input tensor into approximately equal `num_buckets` parts.
434+
See go/squawd for details, and how to control the error due to approximation.
435+
436+
Args:
437+
x: An input `Tensor` or `SparseTensor`.
438+
num_buckets: Values in the `x` are divided into approximately equal-sized
439+
buckets, where the number of buckets is num_buckets.
440+
epsilon: Error tolerance, typically a small fraction close to zero
441+
(e.g. 0.01). Higher values of epsilon increase the quantile approximation,
442+
and hence result in more unequal buckets, but could improve performance,
443+
and resource consumption. Some measured results on memory consumption:
444+
For epsilon = 0.001, the amount of memory for each buffer to hold the
445+
summary for 1 trillion input values is ~25000 bytes. If epsilon is
446+
relaxed to 0.01, the buffer size drops to ~2000 bytes for the same input
447+
size. If we use a strict epsilon value of 0, the buffer size is same size
448+
as the input, because the intermediate stages have to remember every input
449+
and the quantile boundaries can be found only after an equivalent to a
450+
full sorting of input. The buffer size also determines the amount of work
451+
in the different stages of the beam pipeline, in general, larger epsilon
452+
results in fewer and smaller stages, and less time. For more performance
453+
trade-offs see also http://web.cs.ucla.edu/~weiwang/paper/SSDBM07_2.pdf
454+
name: (Optional) A name for this operation.
455+
456+
Returns:
457+
The bucket boundaries represented as a list, with num_bucket-1 elements
458+
See bucket_dtype() above for type of bucket boundaries.
459+
"""
460+
461+
with tf.name_scope(name, 'quantiles'):
462+
spec = _QuantilesSpec(epsilon, num_buckets)
463+
quantile_boundaries = Analyzer(
464+
[x], [(spec.bucket_dtype, [1, None], False)], spec,
465+
'quantiles').outputs[0]
466+
467+
# quantile boundaries is of the form
468+
# [nd.arrary(first, <num_buckets-1>, last)]
469+
# Drop the fist and last quantile boundaries, so that we end-up with
470+
# num_buckets-1 boundaries, and hence num_buckets buckets.
471+
return quantile_boundaries[0:1, 1:-1]
428472

429473

430474
class _CombinerSpec(object):

0 commit comments

Comments
 (0)