Skip to content

3966 update glossary #4164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 21 additions & 28 deletions .github/workflows/check-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,15 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
check_type: [spellcheck, kbcheck, md-lint]
check_type: [spellcheck, kbcheck, md-lint, glossary-check]
steps:
# Add setup steps per check here
- uses: actions/checkout@v4
- name: Install Aspell
if: matrix.check_type == 'spellcheck'
run: sudo apt-get update && sudo apt-get install -y aspell aspell-en
- name: Set up Python
if: matrix.check_type == 'kbcheck'
if: matrix.check_type == 'kbcheck' || matrix.check_type == 'glossary-check'
run: |
curl -Ls https://astral.sh/uv/install.sh | sh
uv clean
Expand All @@ -39,31 +39,25 @@ jobs:
run: yarn add -D markdownlint-cli2

# Run the checks here
- name: Run checks
id: check_step
run: |
if [[ "${{ matrix.check_type }}" == "spellcheck" ]]; then
yarn check-spelling
exit_code=$?
elif [[ "${{ matrix.check_type }}" == "kbcheck" ]]; then
yarn check-kb
exit_code=$?
elif [[ "${{ matrix.check_type }}" == "md-lint" ]]; then
yarn check-markdown
exit_code=$?
fi

if [[ $exit_code -ne 0 ]]; then
echo "::error::${{ matrix.check_type }} check failed. See logs for details."
exit 1
fi
- name: Run spellcheck
if: matrix.check_type == 'spellcheck'
run: yarn check-spelling

- name: Set check status
if: steps.check_step.outcome != 'success'
uses: actions/github-script@v6
with:
script: |
core.setFailed('${{ matrix.check_type }} check failed.');
- name: Run KB check
if: matrix.check_type == 'kbcheck'
run: yarn check-kb

- name: Run markdown lint
if: matrix.check_type == 'md-lint'
run: yarn check-markdown

- name: Run glossary check
if: matrix.check_type == 'glossary-check'
run: |
echo "Extracting glossary from markdown..."
python3 scripts/glossary/extract-glossary-terms.py
echo "Checking glossary coverage..."
python3 scripts/glossary/wrap-glossary-terms.py --check || echo "::warning::Glossary check found unwrapped terms (non-blocking)"

check_overall_status:
needs: stylecheck
Expand All @@ -74,5 +68,4 @@ jobs:
if: needs.stylecheck.result != 'success'
run: |
echo "::error::One or more checks of the style check failed."
exit 1

exit 1
24 changes: 12 additions & 12 deletions docs/best-practices/partitioning_keys.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ import partitions from '@site/static/images/bestpractices/partitions.png';
import merges_with_partitions from '@site/static/images/bestpractices/merges_with_partitions.png';

:::note A data management technique
Partitioning is primarily a data management technique and not a query optimization tool, and while it can improve performance in specific workloads, it should not be the first mechanism used to accelerate queries; the partitioning key must be chosen carefully, with a clear understanding of its implications, and only applied when it aligns with data life cycle needs or well-understood access patterns.
Partitioning is primarily a data management technique and not a query optimization tool, and while it can improve performance in specific workloads, it should not be the first mechanism used to accelerate queries; the ^^partitioning key^^ must be chosen carefully, with a clear understanding of its implications, and only applied when it aligns with data life cycle needs or well-understood access patterns.
:::

In ClickHouse, partitioning organizes data into logical segments based on a specified key. This is defined using the `PARTITION BY` clause at table creation time and is commonly used to group rows by time intervals, categories, or other business-relevant dimensions. Each unique value of the partitioning expression forms its own physical partition on disk, and ClickHouse stores data in separate parts for each of these values. Partitioning improves data management, simplifies retention policies, and can help with certain query patterns.
In ClickHouse, partitioning organizes data into logical segments based on a specified key. This is defined using the `PARTITION BY` clause at table creation time and is commonly used to group rows by time intervals, categories, or other business-relevant dimensions. Each unique value of the partitioning expression forms its own physical partition on disk, and ClickHouse stores data in separate ^^parts^^ for each of these values. Partitioning improves data management, simplifies retention policies, and can help with certain query patterns.

For example, consider the following UK price paid dataset table with a partitioning key of `toStartOfMonth(date)`.
For example, consider the following UK price paid dataset table with a ^^partitioning key^^ of `toStartOfMonth(date)`.

```sql
CREATE TABLE uk.uk_price_paid_simple_partitioned
Expand All @@ -40,28 +40,28 @@ The ClickHouse server first splits the rows from the example insert with 4 rows

For a more detailed explanation of partitioning, we recommend [this guide](/partitions).

With partitioning enabled, ClickHouse only [merges](/merges) data parts within, but not across partitions. We sketch that for our example table from above:
With partitioning enabled, ClickHouse only [merges](/merges) data ^^parts^^ within, but not across partitions. We sketch that for our example table from above:

<Image img={merges_with_partitions} size="md" alt="Partitions" />

## Applications of partitioning {#applications-of-partitioning}

Partitioning is a powerful tool for managing large datasets in ClickHouse, especially in observability and analytics use cases. It enables efficient data life cycle operations by allowing entire partitions, often aligned with time or business logic, to be dropped, moved, or archived in a single metadata operation. This is significantly faster and less resource-intensive than row-level delete or copy operations. Partitioning also integrates cleanly with ClickHouse features like TTL and tiered storage, making it possible to implement retention policies or hot/cold storage strategies without custom orchestration. For example, recent data can be kept on fast SSD-backed storage, while older partitions are automatically moved to cheaper object storage.
Partitioning is a powerful tool for managing large datasets in ClickHouse, especially in observability and analytics use cases. It enables efficient data life cycle operations by allowing entire partitions, often aligned with time or business logic, to be dropped, moved, or archived in a single metadata operation. This is significantly faster and less resource-intensive than row-level delete or copy operations. Partitioning also integrates cleanly with ClickHouse features like ^^TTL^^ and tiered storage, making it possible to implement retention policies or hot/cold storage strategies without custom orchestration. For example, recent data can be kept on fast SSD-backed storage, while older partitions are automatically moved to cheaper object storage.

While partitioning can improve query performance for some workloads, it can also negatively impact response time.

If the partitioning key is not in the primary key and you are filtering by it, users may see an improvement in query performance with partitioning. See [here](/partitions#query-optimization) for an example.
If the ^^partitioning key^^ is not in the ^^primary key^^ and you are filtering by it, users may see an improvement in query performance with partitioning. See [here](/partitions#query-optimization) for an example.

Conversely, if queries need to query across partitions performance may be negatively impacted due to a higher number of total parts. For this reason, users should understand their access patterns before considering partitioning a a query optimization technique.
Conversely, if queries need to query across partitions performance may be negatively impacted due to a higher number of total ^^parts^^. For this reason, users should understand their access patterns before considering partitioning a a query optimization technique.

In summary, users should primarily think of partitioning as a data management technique. For an example of managing data, see ["Managing Data"](/observability/managing-data) from the observability use-case guide and ["What are table partitions used for?"](/partitions#data-management) from Core Concepts - Table partitions.

## Choose a low cardinality partitioning key {#choose-a-low-cardinality-partitioning-key}
## Choose a low cardinality ^^partitioning key^^ {#choose-a-low-cardinality-partitioning-key}

Importantly, a higher number of parts will negatively affect query performance. ClickHouse will therefore respond to inserts with a [“too many parts”](/knowledgebase/exception-too-many-parts) error if the number of parts exceeds specified limits either in [total](/operations/settings/merge-tree-settings#max_parts_in_total) or [per partition](/operations/settings/merge-tree-settings#parts_to_throw_insert).
Importantly, a higher number of ^^parts^^ will negatively affect query performance. ClickHouse will therefore respond to inserts with a [“too many parts”](/knowledgebase/exception-too-many-parts) error if the number of ^^parts^^ exceeds specified limits either in [total](/operations/settings/merge-tree-settings#max_parts_in_total) or [per partition](/operations/settings/merge-tree-settings#parts_to_throw_insert).

Choosing the right **cardinality** for the partitioning key is critical. A high-cardinality partitioning key - where the number of distinct partition values is large - can lead to a proliferation of data parts. Since ClickHouse does not merge parts across partitions, too many partitions will result in too many unmerged parts, eventually triggering the “Too many parts” error. [Merges are essential](/merges) for reducing storage fragmentation and optimizing query speed, but with high-cardinality partitions, that merge potential is lost.
Choosing the right **cardinality** for the ^^partitioning key^^ is critical. A high-cardinality ^^partitioning key^^ - where the number of distinct partition values is large - can lead to a proliferation of data ^^parts^^. Since ClickHouse does not merge ^^parts^^ across partitions, too many partitions will result in too many unmerged ^^parts^^, eventually triggering the “Too many ^^parts^^” error. [Merges are essential](/merges) for reducing storage fragmentation and optimizing query speed, but with high-cardinality partitions, that merge potential is lost.

By contrast, a **low-cardinality partitioning key**—with fewer than 100 - 1,000 distinct values - is usually optimal. It enables efficient part merging, keeps metadata overhead low, and avoids excessive object creation in storage. In addition, ClickHouse automatically builds MinMax indexes on partition columns, which can significantly speed up queries that filter on those columns. For example, filtering by month when the table is partitioned by `toStartOfMonth(date)` allows the engine to skip irrelevant partitions and their parts entirely.
By contrast, a **low-cardinality ^^partitioning key^^**—with fewer than 100 - 1,000 distinct values - is usually optimal. It enables efficient part merging, keeps metadata overhead low, and avoids excessive object creation in storage. In addition, ClickHouse automatically builds MinMax indexes on partition columns, which can significantly speed up queries that filter on those columns. For example, filtering by month when the table is partitioned by `toStartOfMonth(date)` allows the engine to skip irrelevant partitions and their ^^parts^^ entirely.

While partitioning can improve performance in some query patterns, it's primarily a data management feature. In many cases, querying across all partitions can be slower than using a non-partitioned table due to increased data fragmentation and more parts being scanned. Use partitioning judiciously, and always ensure that the chosen key is low-cardinality and aligns with your data life cycle policies (e.g., retention via TTL). If you're unsure whether partitioning is necessary, you may want to start without it and optimize later based on observed access patterns.
While partitioning can improve performance in some query patterns, it's primarily a data management feature. In many cases, querying across all partitions can be slower than using a non-partitioned table due to increased data fragmentation and more ^^parts^^ being scanned. Use partitioning judiciously, and always ensure that the chosen key is low-cardinality and aligns with your data life cycle policies (e.g., retention via ^^TTL^^). If you're unsure whether partitioning is necessary, you may want to start without it and optimize later based on observed access patterns.
Loading