tensorflow
diff --git a/‎README.md
Lines changed: 13 additions & 17 deletions b/‎README.md
Lines changed: 13 additions & 17 deletions
diff --git a/‎RELEASE.md
Lines changed: 22 additions & 10 deletions b/‎RELEASE.md
Lines changed: 22 additions & 10 deletions
diff --git a/‎examples/census_example.py
Lines changed: 5 additions & 16 deletions b/‎examples/census_example.py
Lines changed: 5 additions & 16 deletions
diff --git a/‎examples/sentiment_example.py
Lines changed: 5 additions & 18 deletions b/‎examples/sentiment_example.py
Lines changed: 5 additions & 18 deletions
diff --git a/‎examples/simple_example.py
Lines changed: 4 additions & 8 deletions b/‎examples/simple_example.py
Lines changed: 4 additions & 8 deletions
diff --git a/‎getting_started.md
Lines changed: 3 additions & 6 deletions b/‎getting_started.md
Lines changed: 3 additions & 6 deletions
diff --git a/‎setup.py
Lines changed: 23 additions & 4 deletions b/‎setup.py
Lines changed: 23 additions & 4 deletions
diff --git a/‎tensorflow_transform/analyzers.py
Lines changed: 44 additions & 0 deletions b/‎tensorflow_transform/analyzers.py
Lines changed: 44 additions & 0 deletions
@@ -35,18 +35,12 @@ an explicit dependency on TensorFlow as a package. See [TensorFlow
 documentation](https://www.tensorflow.org/install/) for more information on
 installing TensorFlow.
 
-tf.Transform does though have a dependency on the Google Cloud Dataflow
-distribution of Apache Beam.
-Apache Beam is the package used to run distributed pipelines. Apache Beam is
-able to run pipelines in multiple ways, depending on the "runner" used. While
-Apache Beam is an open source package, currently the only distribution on PyPI
-is the Cloud Dataflow distribution. This package can run beam pipelines locally,
-or on Google Cloud Dataflow.
-
-When a base package for Apache Beam (containing no runners) is available, the
-tf.Transform package will depend only on this base package, and users will be
-able to install their own runners. tf.Transform will attempt to be as
-independent from the specific runner as possible.
+tf.Transform does though have a dependency on the GCP distribution of Apache
+Beam. Apache Beam is the framework used to run distributed pipelines. Apache
+Beam is able to run pipelines in multiple ways, depending on the "runner" used,
+and the "runner" is usually provided by a distribution of Apache
+Beam. With the GCP distribution of Apache Beam, one can run beam pipelines
+locally, or on Google Cloud Dataflow.
 
 Note: If you clone tf.Transform's implementation and samples from GitHub's
 `master` branch (as opposed to using the released implementation and samples
@@ -60,11 +54,13 @@ a comprehensive list, meaning other combinations may also work, but these are
 the combinations tested by our testing framework and by the team before
 releasing a new version.
 
-|tensorflow-transform                                                            |tensorflow|apache-beam[gcp]|
-|--------------------------------------------------------------------------------|----------|----------------|
-|[GitHub master](https://github.com/tensorflow/transform/blob/master/RELEASE.md) |nightly   |latest (2.x)    |
-|[0.3.0](https://github.com/tensorflow/transform/blob/v0.3.0/RELEASE.md)         |1.3       |2.1.1           |
-|[0.1.10](https://github.com/tensorflow/transform/blob/v0.1.10/RELEASE.md)       |1.0       |2.0.0           |
+|tensorflow-transform                                                            |tensorflow    |apache-beam[gcp]|
+|--------------------------------------------------------------------------------|--------------|----------------|
+|[GitHub master](https://github.com/tensorflow/transform/blob/master/RELEASE.md) |nightly (1.x) |2.2.0           |
+|[0.3.1](https://github.com/tensorflow/transform/blob/v0.3.1/RELEASE.md)         |1.3           |2.1.1           |
+|[0.3.0](https://github.com/tensorflow/transform/blob/v0.3.0/RELEASE.md)         |1.3           |2.1.1           |
+|[0.1.10](https://github.com/tensorflow/transform/blob/v0.1.10/RELEASE.md)       |1.0           |2.0.0           |
+
 ## Getting Started
 
 For instructions on using tf.Transform see the [getting started
 
@@ -1,29 +1,41 @@
-# Current Release
+# Current version (not yet released; still in development)
 
 ## Major Features and Improvements
-* Add a combine_analyzer() that supports user provided combiner, conforming to
+* Added a combine_analyzer() that supports user provided combiner, conforming to
   beam.CombinFn(). This allows users to implement custom combiners
   (e.g. median), to complement analyzers (like min, max) that are
   prepackaged in TFT.
+* Quantiles Analyzer (`tft.quantiles`).
 
 ## Bug Fixes and Other Changes
-
+* Depends on `apache-beam[gcp]>=2.2,<3`.
+* Fixes some KeyError issues that appeared in certain circumstances when one
+  would call AnalyzeAndTransformDataset (due to a now-fixed Apache Beam [bug]
+  (https://issues.apache.org/jira/projects/BEAM/issues/BEAM-2966)).
 * Allow all functions that accept and return tensors, to accept an optional
   name scope, in line with TensorFlow coding conventions.
-
 * Update examples to construct input functions by hand instead of using helper
   functions.
-
 * Change scale_by_min_max/scale_to_0_1 to return the average(min, max) of the
   range in case all values are identical.
-
 * Added export of serving model to examples.
+* Use "core" version of feature columns (tf.feature_column instead of
+  tf.contrib) in examples.
+* A few bug fixes and improvements for coders regarding Python 3.
 
 ## Breaking changes
 
+* Requires pre-installed TensorFlow >= 1.4.
+* No longer distributing a WHL file in PyPI. Only doing a source distribution
+  which should however be compatible with all platforms (ie you are still able
+  to `pip install tensorflow-transform` and use `requirements.txt` or `setup.py`
+  files for environment setup).
 * Some functions now introduce a new name scope when they did not before so the
   names of tensors may change.  This will only affect you if you directly lookup
   tensors by name in the graph produced by tf.Transform.
+* Various Analyzer Specs (_NumericCombineSpec, _UniquesSpec, _QuantilesSpec) are
+  now private. Analyzers are accessible only via the top-level TFT functions (
+  min, max, sum, size, mean, var, uniques, quantiles).
 
 # Release 0.3.1
 
@@ -38,7 +50,7 @@ in input_fn_maker there is now also a `*_serving_input_receiver_fn()`.
   vocabulary (as generated by `tft.uniques`) to several different columns.
 * Provide a source distribution tar `tensorflow-transform-X.Y.Z.tar.gz`.
 
-## Breaking changes
+## Breaking Changes
 * The default prefix for `tft.string_to_int` `vocab_filename` changed from
 `vocab_string_to_int` to `vocab_string_to_int_uniques`. To make your pipelines
 resilient to implementation details please set `vocab_filename` if you are using
@@ -57,12 +69,12 @@ the generated vocab_filename on a downstream component.
   use multi-threaded workers.
 * Performance optimizations in ExampleProtoCoder.
 * Depends on `apache-beam[gcp]>=2.1.1,<3`.
-* Depends on `protobuf>=3.3.0<4`.
+* Depends on `protobuf>=3.3<4`.
 * Depends on `six>=1.9,<1.11`.
 
-## Breaking changes
+## Breaking Changes
 * Requires pre-installed TensorFlow >= 1.3.
-* Removed `tft.map`  use `tft.apply_function` instead (as needed).
+* Removed `tft.map` use `tft.apply_function` instead (as needed).
 * Removed `tft.tfidf_weights` use `tft.tfidf` instead.
 * `beam_metadata_io.WriteMetadata` now requires a second `pipeline` argument
   (see examples).
 
@@ -30,14 +30,12 @@
 from apache_beam.io import tfrecordio
 from tensorflow.contrib import learn
 from tensorflow.contrib import lookup
-from tensorflow.contrib.layers import feature_column
 from tensorflow.contrib.learn.python.learn.utils import input_fn_utils
 
 from tensorflow_transform.beam import impl as beam_impl
 from tensorflow_transform.beam.tft_beam_io import transform_fn_io
 from tensorflow_transform.coders import csv_coder
 from tensorflow_transform.coders import example_proto_coder
-from tensorflow_transform.saved import input_fn_maker
 from tensorflow_transform.saved import saved_transform_io
 from tensorflow_transform.tf_metadata import dataset_metadata
 from tensorflow_transform.tf_metadata import dataset_schema
@@ -126,7 +124,7 @@ def preprocessing_fn(inputs):
 
     # For the label column we provide the mapping from string to index.
     def convert_label(label):
-      table = lookup.string_to_index_table_from_tensor(['>50K', '<=50K'])
+      table = lookup.index_table_from_tensor(['>50K', '<=50K'])
       return table.lookup(label)
     outputs[LABEL_KEY] = tft.apply_function(convert_label, inputs[LABEL_KEY])
 
@@ -233,11 +231,6 @@ def input_fn():
         os.path.join(working_dir, filebase + '*'),
         batch_size, transformed_feature_spec, tf.TFRecordReader)
 
-    # Apply convert_scalars_to_vectors to avoid errors where feature columns
-    # do not accept scalars but require length-1 vectors.
-    transformed_features = input_fn_maker.convert_scalars_to_vectors(
-        transformed_features)
-
     # Extract features and label from the transformed tensors.
     transformed_labels = transformed_features.pop(LABEL_KEY)
 
@@ -276,10 +269,6 @@ def serving_input_fn():
             os.path.join(working_dir, transform_fn_io.TRANSFORM_FN_DIR),
             raw_features))
 
-    # Apply convert_scalars_to_vectors since this was done in training.
-    transformed_features = input_fn_maker.convert_scalars_to_vectors(
-        transformed_features)
-
     return input_fn_utils.InputFnOps(transformed_features, None, default_inputs)
 
   return serving_input_fn
@@ -300,15 +289,15 @@ def train_and_evaluate(working_dir, num_train_instances=NUM_TRAIN_INSTANCES,
   """
 
   # Wrap scalars as real valued columns.
-  real_valued_columns = [feature_column.real_valued_column(key)
+  real_valued_columns = [tf.feature_column.numeric_column(key, shape=())
                          for key in NUMERIC_FEATURE_KEYS]
 
   # Wrap categorical columns.  Note the combiner is irrelevant since the input
   # only has one value set per feature per instance.
   one_hot_columns = [
-      feature_column.sparse_column_with_integerized_feature(
-          key, bucket_size=bucket_size, combiner='sum')
-      for key, bucket_size in zip(CATEGORICAL_FEATURE_KEYS, BUCKET_SIZES)]
+      tf.feature_column.categorical_column_with_identity(
+          key, num_buckets=num_buckets)
+      for key, num_buckets in zip(CATEGORICAL_FEATURE_KEYS, BUCKET_SIZES)]
 
   estimator = learn.LinearClassifier(real_valued_columns + one_hot_columns)
 
 
@@ -29,12 +29,10 @@
 from apache_beam.io import textio
 from apache_beam.io import tfrecordio
 from tensorflow.contrib import learn
-from tensorflow.contrib.layers import feature_column
 from tensorflow.contrib.learn.python.learn.utils import input_fn_utils
 from tensorflow_transform.beam import impl as beam_impl
 from tensorflow_transform.beam.tft_beam_io import transform_fn_io
 from tensorflow_transform.coders import example_proto_coder
-from tensorflow_transform.saved import input_fn_maker
 from tensorflow_transform.saved import saved_transform_io
 from tensorflow_transform.tf_metadata import dataset_metadata
 from tensorflow_transform.tf_metadata import dataset_schema
@@ -261,11 +259,6 @@ def input_fn():
         os.path.join(working_dir, filebase + '*'),
         batch_size, transformed_feature_spec, tf.TFRecordReader)
 
-    # Apply convert_scalars_to_vectors to avoid errors where feature columns
-    # do not accept scalars but require length-1 vectors.
-    transformed_features = input_fn_maker.convert_scalars_to_vectors(
-        transformed_features)
-
     # Extract features and label from the transformed tensors.
     transformed_labels = transformed_features.pop(LABEL_KEY)
 
@@ -304,10 +297,6 @@ def serving_input_fn():
             os.path.join(working_dir, transform_fn_io.TRANSFORM_FN_DIR),
             raw_features))
 
-    # Apply convert_scalars_to_vectors since this was done in training.
-    transformed_features = input_fn_maker.convert_scalars_to_vectors(
-        transformed_features)
-
     return input_fn_utils.InputFnOps(transformed_features, None, default_inputs)
 
   return serving_input_fn
@@ -327,15 +316,13 @@ def train_and_evaluate(working_dir,
     The results from the estimator's 'evaluate' method
   """
   # Unrecognized tokens are represented by -1, but
-  # sparse_column_with_integerized_feature uses the mod operator to map integers
+  # categorical_column_with_identity uses the mod operator to map integers
   # to the range [0, bucket_size).  By choosing bucket_size=VOCAB_SIZE + 1, we
   # represent unrecognized tokens as VOCAB_SIZE.
-  review_column = feature_column.sparse_column_with_integerized_feature(
-      REVIEW_KEY,
-      bucket_size=VOCAB_SIZE + 1,
-      combiner='sum')
-  weighted_reviews = feature_column.weighted_sparse_column(review_column,
-                                                           REVIEW_WEIGHT_KEY)
+  review_column = tf.feature_column.categorical_column_with_identity(
+      REVIEW_KEY, num_buckets=VOCAB_SIZE + 1)
+  weighted_reviews = tf.feature_column.weighted_categorical_column(
+      review_column, REVIEW_WEIGHT_KEY)
 
   estimator = learn.LinearClassifier([weighted_reviews])
 
 
@@ -61,15 +61,11 @@ def preprocessing_fn(inputs):
   }))
 
   with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
-    transform_fn = (
-        (raw_data, raw_data_metadata)
-        | beam_impl.AnalyzeDataset(preprocessing_fn))
-    transformed_dataset = (
-        ((raw_data, raw_data_metadata), transform_fn)
-        | beam_impl.TransformDataset())
+    transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
+        (raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(
+            preprocessing_fn))
 
-  # pylint: disable=unused-variable
-  transformed_data, transformed_metadata = transformed_dataset
+  transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable
 
   pprint.pprint(transformed_data)
 
 
@@ -119,12 +119,9 @@ raw_data = [
 ]
 
 raw_data_metadata = ...
-transform_fn = (
-    (raw_data, raw_data_metadata)
-    | beam_impl.AnalyzeDataset(preprocessing_fn))
-transformed_dataset = (
-    ((raw_data, raw_data_metadata), transform_fn)
-    | beam_impl.TransformDataset())
+transformed_dataset, transform_fn = (
+    (raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(
+        preprocessing_fn, tempfile.mkdtemp()))
 transformed_data, transformed_metadata = transformed_dataset
 ```
 
 
@@ -17,20 +17,20 @@
 from setuptools import setup
 
 # Tensorflow transform version.
-__version__ = '0.3.1'
+__version__ = '0.4.0dev'
 
 
 def _make_required_install_packages():
   return [
-      # apache-beam[gcp] <= 2.1.0 has an issue with importing the six library.
-      'apache-beam[gcp]>=2.1.1,<3',
+      'apache-beam[gcp]>=2.2,<3',
 
-      # Protobuf libraries <= 3.2 contain some map-related data corruption bugs
+      # Protobuf libraries < 3.3 contain some map-related data corruption bugs
       # (b/35874111).
       'protobuf>=3.3,<4',
 
       # Six 1.11.0 incompatible with apitools.
       'six>=1.9,<1.11',
+
   ]
 
 
@@ -40,8 +40,27 @@ def _make_required_install_packages():
     author='Google Inc.',
     author_email='[email protected]',
     license='Apache 2.0',
+    classifiers=[
+        'Development Status :: 4 - Beta',
+        'Intended Audience :: Developers',
+        'Intended Audience :: Education',
+        'Intended Audience :: Science/Research',
+        'License :: OSI Approved :: Apache Software License',
+        'Operating System :: OS Independent',
+        'Programming Language :: Python',
+        'Programming Language :: Python :: 2',
+        'Programming Language :: Python :: 2.7',
+        'Programming Language :: Python :: 2 :: Only',
+        'Topic :: Scientific/Engineering',
+        'Topic :: Scientific/Engineering :: Artificial Intelligence',
+        'Topic :: Scientific/Engineering :: Mathematics',
+        'Topic :: Software Development',
+        'Topic :: Software Development :: Libraries',
+        'Topic :: Software Development :: Libraries :: Python Modules',
+    ],
     namespace_packages=[],
     install_requires=_make_required_install_packages(),
+    python_requires='>=2.7,<3',
     packages=find_packages(),
     include_package_data=True,
     description='A library for data preprocessing with TensorFlow',
 
@@ -425,6 +425,50 @@ def bucket_dtype(self):
     return tf.float32
 
 
+def quantiles(x, num_buckets, epsilon, name=None):
+  """Computes the quantile boundaries of a `Tensor` over the whole dataset.
+
+  quantile boundaries are computed using approximate quantiles,
+  and error tolerance is specified using `epsilon`. The boundaries divide the
+  input tensor into approximately equal `num_buckets` parts.
+  See go/squawd for details, and how to control the error due to approximation.
+
+  Args:
+    x: An input `Tensor` or `SparseTensor`.
+    num_buckets: Values in the `x` are divided into approximately equal-sized
+      buckets, where the number of buckets is num_buckets.
+    epsilon: Error tolerance, typically a small fraction close to zero
+      (e.g. 0.01). Higher values of epsilon increase the quantile approximation,
+      and hence result in more unequal buckets, but could improve performance,
+      and resource consumption.  Some measured results on memory consumption:
+      For epsilon = 0.001, the amount of memory for each buffer to hold the
+      summary for 1 trillion input values is ~25000 bytes. If epsilon is
+      relaxed to 0.01, the buffer size drops to ~2000 bytes for the same input
+      size. If we use a strict epsilon value of 0, the buffer size is same size
+      as the input, because the intermediate stages have to remember every input
+      and the quantile boundaries can be found only after an equivalent to a
+      full sorting of input. The buffer size also determines the amount of work
+      in the different stages of the beam pipeline, in general, larger epsilon
+      results in fewer and smaller stages, and less time. For more performance
+      trade-offs see also http://web.cs.ucla.edu/~weiwang/paper/SSDBM07_2.pdf
+    name: (Optional) A name for this operation.
+
+  Returns:
+    The bucket boundaries represented as a list, with num_bucket-1 elements
+    See bucket_dtype() above for type of bucket boundaries.
+  """
+
+  with tf.name_scope(name, 'quantiles'):
+    spec = _QuantilesSpec(epsilon, num_buckets)
+    quantile_boundaries = Analyzer(
+        [x], [(spec.bucket_dtype, [1, None], False)], spec,
+        'quantiles').outputs[0]
+
+    # quantile boundaries is of the form
+    #    [nd.arrary(first, <num_buckets-1>, last)]
+    # Drop the fist and last quantile boundaries, so that we end-up with
+    # num_buckets-1 boundaries, and hence num_buckets buckets.
+    return quantile_boundaries[0:1, 1:-1]
 
 
 class _CombinerSpec(object):