Skip to content

Commit 70557c6

Browse files
committed
default partitioner
1 parent c8f8c7b commit 70557c6

File tree

2 files changed

+60
-52
lines changed

2 files changed

+60
-52
lines changed

snooty.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ artifact-id-2-13 = "mongo-spark-connector_2.13"
2121
artifact-id-2-12 = "mongo-spark-connector_2.12"
2222
spark-core-version = "3.3.1"
2323
spark-sql-version = "3.3.1"
24+
mdb-server = "MongoDB Server"
2425

2526
[substitutions]
2627
copy = "unicode:: U+000A9"

source/batch-mode/batch-read-config.txt

Lines changed: 59 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Batch Read Configuration Options
77
.. contents:: On this page
88
:local:
99
:backlinks: none
10-
:depth: 1
10+
:depth: 2
1111
:class: singlecol
1212

1313
.. facet::
@@ -178,26 +178,81 @@ dividing the data into partitions, you can run transformations in parallel.
178178
This section contains configuration information for the following
179179
partitioner:
180180

181+
- :ref:`AutoBucketPartitioner <conf-autobucketpartitioner>`
181182
- :ref:`SamplePartitioner <conf-samplepartitioner>`
182183
- :ref:`ShardedPartitioner <conf-shardedpartitioner>`
183184
- :ref:`PaginateBySizePartitioner <conf-paginatebysizepartitioner>`
184185
- :ref:`PaginateIntoPartitionsPartitioner <conf-paginateintopartitionspartitioner>`
185186
- :ref:`SinglePartitionPartitioner <conf-singlepartitionpartitioner>`
186-
- :ref:`AutoBucketPartitioner <conf-autobucketpartitioner>`
187187

188188
.. note:: Batch Reads Only
189189

190190
Because the data-stream-processing engine produces a single data stream,
191191
partitioners do not affect streaming reads.
192192

193+
.. _conf-autobucketpartitioner:
194+
195+
``AutoBucketPartitioner`` Configuration
196+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
197+
198+
The ``AutoBucketPartitioner`` is the default partitioner configuration and uses
199+
the :manual:`$bucketAuto </reference/operator/aggregation/bucketAuto/>`
200+
aggregation stage to paginate the data. By using this configuration,
201+
you can partition the data across single or multiple fields, including nested
202+
fields.
203+
204+
.. note:: Compound Keys
205+
206+
The ``AutoBucketPartitioner`` configuration requires {+mdb-server+} version
207+
7.0 or higher to support compound keys.
208+
209+
To use this configuration, set the ``partitioner`` configuration option to
210+
``com.mongodb.spark.sql.connector.read.partitioner.AutoBucketPartitioner``.
211+
212+
.. list-table::
213+
:header-rows: 1
214+
:widths: 35 65
215+
216+
* - Property name
217+
- Description
218+
219+
* - ``partitioner.options.partition.fieldList``
220+
- The list of fields to use for partitioning. The value can be either a single field
221+
name or a list of comma-separated fields.
222+
223+
**Default:** ``_id``
224+
225+
* - ``partitioner.options.partition.chunkSize``
226+
- The average size (MB) for each partition. Smaller partition sizes
227+
create more partitions containing fewer documents.
228+
Because this configuration uses the average document size to determine the number of
229+
documents per partition, partitions might not be the same size.
230+
231+
**Default:** ``64``
232+
233+
* - ``partitioner.options.partition.samplesPerPartition``
234+
- The number of samples to take per partition.
235+
236+
**Default:** ``100``
237+
238+
* - ``partitioner.options.partition.partitionKeyProjectionField``
239+
- The field name to use for a projected field that contains all the
240+
fields used to partition the collection.
241+
We recommend changing the value of this property only if each document already
242+
contains the ``__idx`` field.
243+
244+
**Default:** ``__idx``
245+
193246
.. _conf-mongosamplepartitioner:
194247
.. _conf-samplepartitioner:
195248

196249
``SamplePartitioner`` Configuration
197250
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
198251

199-
``SamplePartitioner`` is the default partitioner configuration. This configuration
200-
lets you specify a partition field, partition size, and number of samples per partition.
252+
The ``SamplePartitioner`` configuration configuration is similar to the
253+
:ref:`AutoBucketPartitioner <conf-autobucketpartitioner>`
254+
configuration, but does not use the ``$bucketAuto`` aggregation stage. This configuration lets you specify a partition field,
255+
partition size, and number of samples per partition.
201256

202257
To use this configuration, set the ``partitioner`` configuration option to
203258
``com.mongodb.spark.sql.connector.read.partitioner.SamplePartitioner``.
@@ -328,54 +383,6 @@ The ``SinglePartitionPartitioner`` configuration creates a single partition.
328383
To use this configuration, set the ``partitioner`` configuration option to
329384
``com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner``.
330385

331-
.. _conf-autobucketpartitioner:
332-
333-
``AutoBucketPartitioner`` Configuration
334-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
335-
336-
The ``AutoBucketPartitioner`` configuration is similar to the
337-
:ref:`SamplePartitioner <conf-samplepartitioner>`
338-
configuration, but uses the :manual:`$bucketAuto </reference/operator/aggregation/bucketAuto/>`
339-
aggregation stage to paginate the data. By using this configuration,
340-
you can partition the data across single or multiple fields, including nested fields.
341-
342-
To use this configuration, set the ``partitioner`` configuration option to
343-
``com.mongodb.spark.sql.connector.read.partitioner.AutoBucketPartitioner``.
344-
345-
.. list-table::
346-
:header-rows: 1
347-
:widths: 35 65
348-
349-
* - Property name
350-
- Description
351-
352-
* - ``partitioner.options.partition.fieldList``
353-
- The list of fields to use for partitioning. The value can be either a single field
354-
name or a list of comma-separated fields.
355-
356-
**Default:** ``_id``
357-
358-
* - ``partitioner.options.partition.chunkSize``
359-
- The average size (MB) for each partition. Smaller partition sizes
360-
create more partitions containing fewer documents.
361-
Because this configuration uses the average document size to determine the number of
362-
documents per partition, partitions might not be the same size.
363-
364-
**Default:** ``64``
365-
366-
* - ``partitioner.options.partition.samplesPerPartition``
367-
- The number of samples to take per partition.
368-
369-
**Default:** ``100``
370-
371-
* - ``partitioner.options.partition.partitionKeyProjectionField``
372-
- The field name to use for a projected field that contains all the
373-
fields used to partition the collection.
374-
We recommend changing the value of this property only if each document already
375-
contains the ``__idx`` field.
376-
377-
**Default:** ``__idx``
378-
379386
Specifying Properties in ``connection.uri``
380387
-------------------------------------------
381388

0 commit comments

Comments
 (0)