Skip to content

Commit a87a5a3

Browse files
Merge pull request #19 from tensorflow/tft-0.1.10
Project import generated by Copybara. PiperOrigin-RevId: 157835649
2 parents 3206f45 + 3703673 commit a87a5a3

File tree

13 files changed

+1258
-322
lines changed

13 files changed

+1258
-322
lines changed

RELEASE.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Release 0.1.10
2+
3+
## Major Features and Improvements
4+
* Add json-example serving input functions to TF.Transform.
5+
* Add variance analyzer to tf.transform.
6+
7+
## Bug Fixes and Other Changes
8+
* Remove duplication in output of `tft.tfidf`.
9+
* Ensure ngrams output dense_shape is greater than or equal to 0.
10+
* Alters the behavior and interface of tensorflow_transform.mappers.ngrams.
11+
* Use `apache-beam[gcp] >=2,<3`
12+
* Making TF Parallelism runner-dependent.
13+
* Fixes issue with csv serving input function.
14+
15+
## Deprecations
16+
* `tft.map` will be removed on version 0.2.0, see the `examples` directory for
17+
instructions on how to use `tft.apply_function` instead (as needed).
18+
* `tft.tfidf_weights` will be removed on version 0.2.0, use `tft.tfidf` instead.
19+
20+
# Release 0.1.9
21+
22+
## Major Features and Improvements
23+
* Refactor internals to remove Column and Statistic classes
24+
25+
## Bug Fixes and Other Changes
26+
* Remove collections from graph to avoid warnings
27+
* Return float32 from tfidf_weights
28+
* Update tensorflow_transform to use tf.saved_model APIs.
29+
* Add default values on example proto coder.

examples/census_example.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,8 @@ def preprocessing_fn(inputs):
116116
def convert_label(label):
117117
table = lookup.string_to_index_table_from_tensor(['>50K', '<=50K'])
118118
return table.lookup(label)
119-
outputs[LABEL_COLUMN] = tft.map(convert_label, inputs[LABEL_COLUMN])
119+
outputs[LABEL_COLUMN] = tft.apply_function(convert_label,
120+
inputs[LABEL_COLUMN])
120121

121122
return outputs
122123

examples/sentiment_example.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -140,13 +140,13 @@ def preprocessing_fn(inputs):
140140
"""Preprocess input columns into transformed columns."""
141141
review = inputs[REVIEW_COLUMN]
142142

143-
review_tokens = tft.map(lambda x: tf.string_split(x, DELIMITERS),
144-
review)
143+
review_tokens = tf.string_split(review, DELIMITERS)
145144
review_indices = tft.string_to_int(review_tokens, top_k=VOCAB_SIZE)
146145
# Add one for the oov bucket created by string_to_int.
147-
review_weight = tft.tfidf_weights(review_indices, VOCAB_SIZE + 1)
146+
review_bow_indices, review_weight = tft.tfidf(review_indices,
147+
VOCAB_SIZE + 1)
148148
return {
149-
REVIEW_COLUMN: review_indices,
149+
REVIEW_COLUMN: review_bow_indices,
150150
REVIEW_WEIGHT: review_weight,
151151
LABEL_COLUMN: inputs[LABEL_COLUMN]
152152
}

examples/simple_example.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,10 @@ def preprocessing_fn(inputs):
3232
x = inputs['x']
3333
y = inputs['y']
3434
s = inputs['s']
35-
x_centered = tft.map(lambda x, mean: x - mean, x, tft.mean(x))
35+
x_centered = x - tft.mean(x)
3636
y_normalized = tft.scale_to_0_1(y)
3737
s_integerized = tft.string_to_int(s)
38-
x_centered_times_y_normalized = tft.map(lambda x, y: x * y,
39-
x_centered, y_normalized)
38+
x_centered_times_y_normalized = (x_centered * y_normalized)
4039
return {
4140
'x_centered': x_centered,
4241
'y_normalized': y_normalized,

getting_started.md

Lines changed: 55 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -11,35 +11,27 @@ aspects of the usage of tf.Transform.
1111
## Defining a Preprocessing Function
1212

1313
The most important concept of tf.Transform is the "preprocessing function". This
14-
is a logical description of a transformation of a dataset. The dataset is
15-
conceptualized as a dictionary of columns, and the preprocessing function is
16-
defined by two basic mechanisms:
17-
18-
1) Applying `tft.map`, which takes a user-defined function that accepts and
19-
returns tensors. Such a function can use any TensorFlow operation to construct
20-
the output tensors from the inputs. The remaining arguments of `tft.map` are the
21-
columns that the function should be applied to. The number of columns provided
22-
should equal the number of arguments to the user-defined function. Like the
23-
Python `map` function, `tft.map` applies the user-provided function to the
24-
elements in the columns specified. Each row is treated independently, and the
25-
output is a column containing the results (but see the note on batching at the
26-
end of this section).
27-
28-
2) Applying any of the tf.Transform provided "analyzers". Analyzers are
29-
functions that accept one or more `Column`s and return some summary statistic
30-
for the input column or columns. A statistic is like a column except that it
31-
only has a single value. An example of an analyzer is `tft.min` which computes
32-
the minimum of a column. Currently tf.Transform provides a fixed set of
33-
analyzers, but this will be extensible in future versions.
34-
35-
In fact, `tft.map` can also accept statistics, which is how statistics are
36-
incorporated into the user-defined pipeline. By combining analyzers and
37-
`tft.map`, users can flexibly create pipelines for transforming their data. In
38-
particular, users should define a "preprocessing function" which accepts and
39-
returns columns.
40-
41-
The following preprocessing function transforms each of three columns in
42-
different ways, and combines two of the columns.
14+
is a logical description of a transformation of a dataset. The preprocessing
15+
function accepts and returns a dictionary of tensors (in this guide, "tensors"
16+
generally means `Tensor`s or `SparseTensor`s). There are two kinds of functions
17+
that can be used to define the preprocessing function:
18+
19+
1) Any function that accepts and returns tensors. These will add TensorFlow
20+
operations to the graph that transforms raw data into transformed data.
21+
22+
2) Any of the tf.Transform provided "analyzers". Analyzers also accept and return
23+
tensors, but unlike typical TensorFlow functions they don't add TF Operations
24+
to the graph. Instead, they cause tf.Transform to compute a full pass operation
25+
outside of TensorFlow, using the input tensor values over the full dataset to
26+
generate a constant tensor that gets returned as the output. For example
27+
`tft.min` computes the minimum of a tensor over the whole dataset. Currently
28+
tf.Transform provides a fixed set of analyzers, but this will be extensible in
29+
future versions.
30+
31+
By combining analyzers and regular TensorFlow functions, users can flexibly
32+
create pipelines for transforming their data. The following preprocessing
33+
function transforms each of three features in different ways, and combines two
34+
of the features.
4335

4436
```
4537
import tensorflow as tf
@@ -49,11 +41,10 @@ def preprocessing_fn(inputs):
4941
x = inputs['x']
5042
y = inputs['y']
5143
s = inputs['s']
52-
x_centered = tft.map(lambda x, mean: x - mean, x, tft.mean(x))
44+
x_centered = x - tft.mean(x)
5345
y_normalized = tft.scale_to_0_1(y)
5446
s_integerized = tft.string_to_int(s)
55-
x_centered_times_y_normalized = tft.map(lambda x, y: x * y,
56-
x_centered, y_normalized)
47+
x_centered_times_y_normalized = x_centered * y_normalized
5748
return {
5849
'x_centered': x_centered,
5950
'y_normalized': y_normalized,
@@ -62,32 +53,29 @@ def preprocessing_fn(inputs):
6253
}
6354
```
6455

65-
`x`, `y` and `s` are local variables that represent input columns, that are
66-
declared for code brevity. The first new column to be constructed, `x_centered`,
67-
is constructed by composing `tft.map` and `tft.mean`. `tft.mean(x)` returns a
68-
statistic representing the mean of the column `x`. The lambda passed to
69-
`tft.map` is simply subtraction, where the first argument is the column `x` and
70-
the second is the statistic `tft.mean(x)`. Thus `x_centered` is the column `x`
56+
`x`, `y` and `s` are `Tensor`s that represent input features. The first new
57+
tensor to be constructed, `x_centered`, is constructed by applying `tft.mean`
58+
to `x` and subtracting this from `x`. `tft.mean(x)` returns a tensor
59+
representing the mean of the tensor `x`. Thus `x_centered` is the tensor `x`
7160
with the mean subtracted.
7261

73-
The second new column is `y_normalized`, created in a similar manner but using
62+
The second new tensor is `y_normalized`, created in a similar manner but using
7463
the convenience method `tft.scale_to_0_1`. This method does something similar
7564
under the hood to what is done to compute `x_centered`, namely computing a max
7665
and min and using these to scale `y`.
7766

78-
The column `s_integerized` shows an example of string manipulation. In this
67+
The tensor `s_integerized` shows an example of string manipulation. In this
7968
simple case we take a string and map it to an integer. This too uses a
80-
convenience function, where the analyzer that is applied computes the unique
81-
values taken by the column, and the map uses these values as a dictionary to
82-
convert to an integer.
69+
convenience function, `tft.string_to_int`. This function uses an analyzer to
70+
compute the unique values taken by the input strings, and then uses TensorFlow
71+
ops to convert the input strings to indices in the table of unique values.
8372

84-
The final column shows that it is possible to use `tft.map` not only to
85-
manipulate a single column but also to combine columns.
73+
The final column shows that it is possible to use tensorflow operations to
74+
create new features by combining tensors.
8675

87-
Note that `Column`s are not themselves wrappers around data. Rather they are
88-
placeholders used to construct a definition of the user's logical pipeline. In
89-
order to apply such a pipeline to data, we rely on a concrete implementation of
90-
the tf.Transform API. The Apache Beam implementation provides `PTransform`s that
76+
The preprocessing function defines a pipeline of operations on a dataset. In
77+
order to apply such a pipeline, we rely on a concrete implementation of the
78+
tf.Transform API. The Apache Beam implementation provides `PTransform`s that
9179
apply a user's preprocessing function to data. The typical workflow of a
9280
tf.Transform user will be to construct a preprocessing function, and then
9381
incorporate this into a larger Beam pipeline, ultimately materializing the data
@@ -100,13 +88,14 @@ tf.Transform is to provide the TensorFlow graph for preprocessing that can be
10088
incorporated into the serving graph (and optionally the training graph),
10189
batching is also an important concept in tf.Transform.
10290

103-
While it is not obvious from the example above, the user defined function passed
104-
to `tft.map` will be passed tensors representing *batches*, not individual
105-
instances, just as will happen during training and serving with TensorFlow. This
106-
is only the case for inputs that are `Column`s, not `Statistic`s. Thus the
107-
actual tensors used in the `tft.map` for `x_centered` are 1) a rank 1 tensor,
108-
representing a batch of values from the column `x`, whose first dimension is the
109-
batch dimension; and 2) a rank 0 tensor representing the mean of that column.
91+
While it is not obvious from the example above, the user defined preprocessing
92+
function will be passed tensors representing *batches*, not individual
93+
instances, just as will happen during training and serving with TensorFlow. On
94+
the other hand, analyzers perform a computation over the whole dataset and
95+
return a single value, not a batch of values. Thus `x` is a `Tensor` of shape
96+
`(batch_size,)` while `tft.mean(x)` is a `Tensor` of shape `()`. The
97+
subtraction `x - tft.mean(x)` involves broadcasting where the value of
98+
`tft.mean(x)` is subtracted from every element of the batch represented by `x`.
11099

111100
## The Canonical Beam Implementation
112101

@@ -322,17 +311,20 @@ def preprocessing_fn(inputs):
322311
def convert_label(label):
323312
table = lookup.string_to_index_table_from_tensor(['>50K', '<=50K'])
324313
return table.lookup(label)
325-
outputs[LABEL_COLUMN] = tft.map(convert_label, inputs[LABEL_COLUMN])
314+
outputs[LABEL_COLUMN] = tft.apply_function(
315+
convert_label, inputs[LABEL_COLUMN])
326316
327317
return outputs
328318
```
329319

330-
One difference from the previous example is that we convert the outputs from
331-
scalars to single element vectors. This allows the data to be correctly read
332-
during training. Also for the label column, we manually specify the mapping from
333-
string to index so that ">50K" gets mapped to 0 and "<=50K" gets mapped to 1.
334-
This is useful so that we know which index in the trained model corresponds to
335-
which label.
320+
One difference from the previous example is that for the label column, we
321+
manually specify the mapping from string to index so that ">50K" gets mapped to
322+
0 and "<=50K" gets mapped to 1. This is useful so that we know which index in
323+
the trained model corresponds to which label. We cannot apply the function
324+
`convert_label` directly to its arguments because `tf.Transform` needs to know
325+
about the `Table` defined in `convert_label`. That is, `convert_label` is not
326+
a pure function but involves table initialization. For such functions, we use
327+
`tft.apply_function` to wrap the function application.
336328

337329
The `raw_data` variable represents a `PCollection` containing data in the same
338330
format as the list `raw_data` from the previous example, and the use of the

setup.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,12 @@
1717
from setuptools import setup
1818

1919
# Tensorflow transform version.
20-
__version__ = '0.1.9'
20+
__version__ = '0.1.10'
2121

2222

2323
def _make_required_install_packages():
2424
return [
25-
# Using >= for better integration tests. During release this is
26-
# automatically changed to a ==.
27-
'apache-beam[gcp] == 0.6.0',
25+
'apache-beam[gcp]>=2,<3',
2826
]
2927

3028

tensorflow_transform/analyzers.py

Lines changed: 39 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -45,20 +45,20 @@ class Analyzer(object):
4545
4646
Args:
4747
inputs: The inputs to the analyzer.
48-
output_shapes_and_dtype: List of pairs of (shape, dtype) for each output.
48+
output_shapes_and_dtype: List of pairs of (dtype, shape) for each output.
4949
spec: A description of the computation to be done.
5050
5151
Raises:
5252
ValueError: If the inputs are not all `Tensor`s.
5353
"""
5454

55-
def __init__(self, inputs, output_shapes_and_dtypes, spec):
55+
def __init__(self, inputs, output_dtypes_and_shapes, spec):
5656
for tensor in inputs:
5757
if not isinstance(tensor, tf.Tensor):
5858
raise ValueError('Analyzers can only accept `Tensor`s as inputs')
5959
self._inputs = inputs
60-
self._outputs = [tf.placeholder(shape, dtype)
61-
for shape, dtype in output_shapes_and_dtypes]
60+
self._outputs = [tf.placeholder(dtype, shape)
61+
for dtype, shape in output_dtypes_and_shapes]
6262
self._spec = spec
6363
tf.add_to_collection(ANALYZER_COLLECTION, self)
6464

@@ -131,7 +131,7 @@ def min(x, reduce_instance_dims=True): # pylint: disable=redefined-builtin
131131
dimension and outputs a `Tensor` of the same shape as the input.
132132
133133
Returns:
134-
A `Tensor`.
134+
A `Tensor`. Has the same type as `x`.
135135
"""
136136
return _numeric_combine(x, NumericCombineSpec.MIN, reduce_instance_dims)
137137

@@ -146,7 +146,7 @@ def max(x, reduce_instance_dims=True): # pylint: disable=redefined-builtin
146146
dimension and outputs a vector of the same shape as the output.
147147
148148
Returns:
149-
A `Tensor`.
149+
A `Tensor`. Has the same type as `x`.
150150
"""
151151
return _numeric_combine(x, NumericCombineSpec.MAX, reduce_instance_dims)
152152

@@ -161,7 +161,7 @@ def sum(x, reduce_instance_dims=True): # pylint: disable=redefined-builtin
161161
dimension and outputs a vector of the same shape as the output.
162162
163163
Returns:
164-
A `Tensor`.
164+
A `Tensor`. Has the same type as `x`.
165165
"""
166166
return _numeric_combine(x, NumericCombineSpec.SUM, reduce_instance_dims)
167167

@@ -176,7 +176,7 @@ def size(x, reduce_instance_dims=True):
176176
dimension and outputs a vector of the same shape as the output.
177177
178178
Returns:
179-
A `Tensor`.
179+
A `Tensor`. Has the same type as `x`.
180180
"""
181181
with tf.name_scope('size'):
182182
# Note: Calling `sum` defined in this module, not the builtin.
@@ -193,14 +193,44 @@ def mean(x, reduce_instance_dims=True):
193193
dimension and outputs a vector of the same shape as the output.
194194
195195
Returns:
196-
A `Tensor` containing the mean.
196+
A `Tensor` containing the mean. If `x` is floating point, the mean will
197+
have the same type as `x`. If `x` is integral, the output is cast to float32
198+
for int8 and int16 and float64 for int32 and int64 (similar to the behavior
199+
of tf.truediv).
197200
"""
198201
with tf.name_scope('mean'):
199202
# Note: Calling `sum` defined in this module, not the builtin.
200203
return tf.divide(
201204
sum(x, reduce_instance_dims), size(x, reduce_instance_dims))
202205

203206

207+
def var(x, reduce_instance_dims=True):
208+
"""Computes the variance of the values of a `Tensor` over the whole dataset.
209+
210+
Uses the biased variance (0 delta degrees of freedom), as given by
211+
(x - mean(x))**2 / length(x).
212+
213+
Args:
214+
x: A `Tensor`.
215+
reduce_instance_dims: By default collapses the batch and instance dimensions
216+
to arrive at a single scalar output. If False, only collapses the batch
217+
dimension and outputs a vector of the same shape as the output.
218+
219+
Returns:
220+
A `Tensor` containing the variance. If `x` is floating point, the variance
221+
will have the same type as `x`. If `x` is integral, the output is cast to
222+
float32 for int8 and int16 and float64 for int32 and int64 (similar to the
223+
behavior of tf.truediv).
224+
"""
225+
with tf.name_scope('var'):
226+
# Note: Calling `mean`, `sum`, and `size` as defined in this module, not the
227+
# builtins.
228+
x_mean = mean(x, reduce_instance_dims)
229+
# x_mean will be float32 or float64, depending on type of x.
230+
squared_deviations = tf.square(tf.cast(x, x_mean.dtype) - x_mean)
231+
return mean(squared_deviations, reduce_instance_dims)
232+
233+
204234
class UniquesSpec(object):
205235
"""Operation to compute unique values."""
206236

0 commit comments

Comments
 (0)