@@ -11,35 +11,27 @@ aspects of the usage of tf.Transform.
11
11
## Defining a Preprocessing Function
12
12
13
13
The most important concept of tf.Transform is the "preprocessing function". This
14
- is a logical description of a transformation of a dataset. The dataset is
15
- conceptualized as a dictionary of columns, and the preprocessing function is
16
- defined by two basic mechanisms:
17
-
18
- 1 ) Applying ` tft.map ` , which takes a user-defined function that accepts and
19
- returns tensors. Such a function can use any TensorFlow operation to construct
20
- the output tensors from the inputs. The remaining arguments of ` tft.map ` are the
21
- columns that the function should be applied to. The number of columns provided
22
- should equal the number of arguments to the user-defined function. Like the
23
- Python ` map ` function, ` tft.map ` applies the user-provided function to the
24
- elements in the columns specified. Each row is treated independently, and the
25
- output is a column containing the results (but see the note on batching at the
26
- end of this section).
27
-
28
- 2 ) Applying any of the tf.Transform provided "analyzers". Analyzers are
29
- functions that accept one or more ` Column ` s and return some summary statistic
30
- for the input column or columns. A statistic is like a column except that it
31
- only has a single value. An example of an analyzer is ` tft.min ` which computes
32
- the minimum of a column. Currently tf.Transform provides a fixed set of
33
- analyzers, but this will be extensible in future versions.
34
-
35
- In fact, ` tft.map ` can also accept statistics, which is how statistics are
36
- incorporated into the user-defined pipeline. By combining analyzers and
37
- ` tft.map ` , users can flexibly create pipelines for transforming their data. In
38
- particular, users should define a "preprocessing function" which accepts and
39
- returns columns.
40
-
41
- The following preprocessing function transforms each of three columns in
42
- different ways, and combines two of the columns.
14
+ is a logical description of a transformation of a dataset. The preprocessing
15
+ function accepts and returns a dictionary of tensors (in this guide, "tensors"
16
+ generally means ` Tensor ` s or ` SparseTensor ` s). There are two kinds of functions
17
+ that can be used to define the preprocessing function:
18
+
19
+ 1 ) Any function that accepts and returns tensors. These will add TensorFlow
20
+ operations to the graph that transforms raw data into transformed data.
21
+
22
+ 2 ) Any of the tf.Transform provided "analyzers". Analyzers also accept and return
23
+ tensors, but unlike typical TensorFlow functions they don't add TF Operations
24
+ to the graph. Instead, they cause tf.Transform to compute a full pass operation
25
+ outside of TensorFlow, using the input tensor values over the full dataset to
26
+ generate a constant tensor that gets returned as the output. For example
27
+ ` tft.min ` computes the minimum of a tensor over the whole dataset. Currently
28
+ tf.Transform provides a fixed set of analyzers, but this will be extensible in
29
+ future versions.
30
+
31
+ By combining analyzers and regular TensorFlow functions, users can flexibly
32
+ create pipelines for transforming their data. The following preprocessing
33
+ function transforms each of three features in different ways, and combines two
34
+ of the features.
43
35
44
36
```
45
37
import tensorflow as tf
@@ -49,11 +41,10 @@ def preprocessing_fn(inputs):
49
41
x = inputs['x']
50
42
y = inputs['y']
51
43
s = inputs['s']
52
- x_centered = tft.map(lambda x, mean: x - mean, x, tft.mean(x) )
44
+ x_centered = x - tft.mean(x)
53
45
y_normalized = tft.scale_to_0_1(y)
54
46
s_integerized = tft.string_to_int(s)
55
- x_centered_times_y_normalized = tft.map(lambda x, y: x * y,
56
- x_centered, y_normalized)
47
+ x_centered_times_y_normalized = x_centered * y_normalized
57
48
return {
58
49
'x_centered': x_centered,
59
50
'y_normalized': y_normalized,
@@ -62,32 +53,29 @@ def preprocessing_fn(inputs):
62
53
}
63
54
```
64
55
65
- ` x ` , ` y ` and ` s ` are local variables that represent input columns, that are
66
- declared for code brevity. The first new column to be constructed, ` x_centered ` ,
67
- is constructed by composing ` tft.map ` and ` tft.mean ` . ` tft.mean(x) ` returns a
68
- statistic representing the mean of the column ` x ` . The lambda passed to
69
- ` tft.map ` is simply subtraction, where the first argument is the column ` x ` and
70
- the second is the statistic ` tft.mean(x) ` . Thus ` x_centered ` is the column ` x `
56
+ ` x ` , ` y ` and ` s ` are ` Tensor ` s that represent input features. The first new
57
+ tensor to be constructed, ` x_centered ` , is constructed by applying ` tft.mean `
58
+ to ` x ` and subtracting this from ` x ` . ` tft.mean(x) ` returns a tensor
59
+ representing the mean of the tensor ` x ` . Thus ` x_centered ` is the tensor ` x `
71
60
with the mean subtracted.
72
61
73
- The second new column is ` y_normalized ` , created in a similar manner but using
62
+ The second new tensor is ` y_normalized ` , created in a similar manner but using
74
63
the convenience method ` tft.scale_to_0_1 ` . This method does something similar
75
64
under the hood to what is done to compute ` x_centered ` , namely computing a max
76
65
and min and using these to scale ` y ` .
77
66
78
- The column ` s_integerized ` shows an example of string manipulation. In this
67
+ The tensor ` s_integerized ` shows an example of string manipulation. In this
79
68
simple case we take a string and map it to an integer. This too uses a
80
- convenience function, where the analyzer that is applied computes the unique
81
- values taken by the column , and the map uses these values as a dictionary to
82
- convert to an integer .
69
+ convenience function, ` tft.string_to_int ` . This function uses an analyzer to
70
+ compute the unique values taken by the input strings , and then uses TensorFlow
71
+ ops to convert the input strings to indices in the table of unique values .
83
72
84
- The final column shows that it is possible to use ` tft.map ` not only to
85
- manipulate a single column but also to combine columns .
73
+ The final column shows that it is possible to use tensorflow operations to
74
+ create new features by combining tensors .
86
75
87
- Note that ` Column ` s are not themselves wrappers around data. Rather they are
88
- placeholders used to construct a definition of the user's logical pipeline. In
89
- order to apply such a pipeline to data, we rely on a concrete implementation of
90
- the tf.Transform API. The Apache Beam implementation provides ` PTransform ` s that
76
+ The preprocessing function defines a pipeline of operations on a dataset. In
77
+ order to apply such a pipeline, we rely on a concrete implementation of the
78
+ tf.Transform API. The Apache Beam implementation provides ` PTransform ` s that
91
79
apply a user's preprocessing function to data. The typical workflow of a
92
80
tf.Transform user will be to construct a preprocessing function, and then
93
81
incorporate this into a larger Beam pipeline, ultimately materializing the data
@@ -100,13 +88,14 @@ tf.Transform is to provide the TensorFlow graph for preprocessing that can be
100
88
incorporated into the serving graph (and optionally the training graph),
101
89
batching is also an important concept in tf.Transform.
102
90
103
- While it is not obvious from the example above, the user defined function passed
104
- to ` tft.map ` will be passed tensors representing * batches* , not individual
105
- instances, just as will happen during training and serving with TensorFlow. This
106
- is only the case for inputs that are ` Column ` s, not ` Statistic ` s. Thus the
107
- actual tensors used in the ` tft.map ` for ` x_centered ` are 1) a rank 1 tensor,
108
- representing a batch of values from the column ` x ` , whose first dimension is the
109
- batch dimension; and 2) a rank 0 tensor representing the mean of that column.
91
+ While it is not obvious from the example above, the user defined preprocessing
92
+ function will be passed tensors representing * batches* , not individual
93
+ instances, just as will happen during training and serving with TensorFlow. On
94
+ the other hand, analyzers perform a computation over the whole dataset and
95
+ return a single value, not a batch of values. Thus ` x ` is a ` Tensor ` of shape
96
+ ` (batch_size,) ` while ` tft.mean(x) ` is a ` Tensor ` of shape ` () ` . The
97
+ subtraction ` x - tft.mean(x) ` involves broadcasting where the value of
98
+ ` tft.mean(x) ` is subtracted from every element of the batch represented by ` x ` .
110
99
111
100
## The Canonical Beam Implementation
112
101
@@ -322,17 +311,20 @@ def preprocessing_fn(inputs):
322
311
def convert_label(label):
323
312
table = lookup.string_to_index_table_from_tensor(['>50K', '<=50K'])
324
313
return table.lookup(label)
325
- outputs[LABEL_COLUMN] = tft.map(convert_label, inputs[LABEL_COLUMN])
314
+ outputs[LABEL_COLUMN] = tft.apply_function(
315
+ convert_label, inputs[LABEL_COLUMN])
326
316
327
317
return outputs
328
318
```
329
319
330
- One difference from the previous example is that we convert the outputs from
331
- scalars to single element vectors. This allows the data to be correctly read
332
- during training. Also for the label column, we manually specify the mapping from
333
- string to index so that ">50K" gets mapped to 0 and "<=50K" gets mapped to 1.
334
- This is useful so that we know which index in the trained model corresponds to
335
- which label.
320
+ One difference from the previous example is that for the label column, we
321
+ manually specify the mapping from string to index so that ">50K" gets mapped to
322
+ 0 and "<=50K" gets mapped to 1. This is useful so that we know which index in
323
+ the trained model corresponds to which label. We cannot apply the function
324
+ ` convert_label ` directly to its arguments because ` tf.Transform ` needs to know
325
+ about the ` Table ` defined in ` convert_label ` . That is, ` convert_label ` is not
326
+ a pure function but involves table initialization. For such functions, we use
327
+ ` tft.apply_function ` to wrap the function application.
336
328
337
329
The ` raw_data ` variable represents a ` PCollection ` containing data in the same
338
330
format as the list ` raw_data ` from the previous example, and the use of the
0 commit comments