Skip to content

Releases: apache/druid

druid-29.0.1

03 Apr 05:02
Compare
Choose a tag to compare

Druid 29.0.1

Apache Druid 29.0.1 is a patch release that fixes some issues in the Druid 29.0.0 release.

Bug fixes

  • Added type verification for INSERT and REPLACE to validate that strings and string arrays aren't mixed #15920
  • Concurrent replace now allows pending Peon segments to be upgraded using the Supervisor #15995
  • Changed the targetDataSource attribute to return a string containing the name of the datasource. This reverts the breaking change introduced in Druid 29.0.0 for INSERT and REPLACE MSQ queries #16004 #16031
  • Decreased the size of the distribution Docker image #15968
  • Fixed an issue with SQL-based ingestion where string inputs, such as from CSV, TSV, or string-value fields in JSON, are ingested as null values when they are typed as LONG or BIGINT #15999
  • Fixed an issue where a web console-generated Kafka supervisor spec has flattenSpec in the wrong location #15946
  • Fixed an issue with filters on expression virtual column indexes incorrectly considering values null in some cases for expressions which translate null values into not null values #15959
  • Fixed an issue where the data loader crashes if the incoming data can't be parsed #15983
  • Improved DOUBLE type detection in the web console #15998
  • Web console-generated queries now only set the context parameter arrayIngestMode to array when you explicitly opt in to use arrays #15927
  • The web console now displays the results of an MSQ query that writes to an external destination through the EXTERN function #15969

Incompatible changes

Changes to targetDataSource in EXPLAIN queries

Druid 29.0.1 includes a breaking change that restores the behavior for targetDataSource to its 28.0.0 and earlier state, different from Druid 29.0.0 and only 29.0.0. In 29.0.0, targetDataSource returns a JSON object that includes the datasource name. In all other versions, targetDataSource returns a string containing the name of the datasource.

If you're upgrading from any version other than 29.0.0, there is no change in behavior.

If you are upgrading from 29.0.0, this is an incompatible change.

#16004

Dependency updates

  • Updated PostgreSQL JDBC Driver version to 42.7.2 #15931

Credits

@abhishekagarwal87
@adarshsanjeev
@AmatyaAvadhanula
@clintropolis
@cryptoe
@dependabot[bot]
@ektravel
@gargvishesh
@gianm
@kgyrtkirk
@LakshSingla
@somu-imply
@techdocsmith
@vogievetsky

Druid 29.0.0

21 Feb 05:51
Compare
Choose a tag to compare

Apache Druid 29.0.0 contains over 350 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 67 contributors.

See the complete set of changes for additional details, including bug fixes.

Review the upgrade notes before you upgrade to Druid 29.0.0.
If you are upgrading across multiple versions, see the Upgrade notes page, which lists upgrade notes for the most recent Druid versions.

# Important features, changes, and deprecations

This section contains important information about new and existing features.

# MSQ export statements (experimental)

Druid 29.0.0 adds experimental support for export statements to the MSQ task engine. This allows query tasks to write data to an external destination through the EXTERN function.

#15689

# SQL PIVOT and UNPIVOT (experimental)

Druid 29.0.0 adds experimental support for the SQL PIVOT and UNPIVOT operators.

The PIVOT operator carries out an aggregation and transforms rows into columns in the output. The following is the general syntax for the PIVOT operator:

PIVOT (aggregation_function(column_to_aggregate)
  FOR column_with_values_to_pivot
  IN (pivoted_column1 [, pivoted_column2 ...])
)

The UNPIVOT operator transforms existing column values into rows. The following is the general syntax for the UNPIVOT operator:

UNPIVOT (values_column 
  FOR names_column
  IN (unpivoted_column1 [, unpivoted_column2 ... ])
)

# Range support in window functions (experimental)

Window functions (experimental) now support ranges where both endpoints are unbounded or are the current row. Ranges work in strict mode, which means that Druid will fail queries that aren't supported. You can turn off strict mode for ranges by setting the context parameter windowingStrictValidation to false.

The following example shows a window expression with RANGE frame specifications:

(ORDER BY c)
(ORDER BY c RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
(ORDER BY c RANGE BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING)

#15703 #15746

# Improved INNER joins

Druid now supports arbitrary join conditions for INNER join. Any sub-conditions that can't be evaluated as part of the join are converted to a post-join filter. Improved join capabilities allow Druid to more effectively support applications like Tableau.

#15302

# Improved concurrent append and replace (experimental)

You no longer have to manually determine the task lock type for concurrent append and replace (experimental) with the taskLockType task context. Instead, Druid can now determine it automatically for you. You can use the context parameter "useConcurrentLocks": true for individual tasks and datasources or enable concurrent append and replace at a cluster level using druid.indexer.task.default.context.

#15684

# First and last aggregators for double, float, and long data types

Druid now supports first and last aggregators for the double, float, and long types in native and MSQ ingestion spec and MSQ queries. Previously, they were only supported for native queries. For more information, see First and last aggregators.

#14462

Additionally, the following functions can now return numeric values:

  • EARLIEST and EARLIEST_BY
  • LATEST and LATEST_BY

You can use these functions as aggregators at ingestion time.

#15607

# Support for logging audit events

Added support for logging audit events and improved coverage of audited REST API endpoints.
To enable logging audit events, set config druid.audit.manager.type to log in both the Coordinator and Overlord or in common.runtime.properties. When you set druid.audit.manager.type to sql, audit events are persisted to metadata store.

In both cases, Druid audits the following events:

  • Coordinator
    • Update load rules
    • Update lookups
    • Update coordinator dynamic config
    • Update auto-compaction config
  • Overlord
    • Submit a task
    • Create/update a supervisor
    • Update worker config
  • Basic security extension
    • Create user
    • Delete user
    • Update user credentials
    • Create role
    • Delete role
    • Assign role to user
    • Set role permissions

#15480 #15653

Also fixed an issue with the basic auth integration test by not persisting logs to the database.

#15561

# Enabled empty ingest queries

The MSQ task engine now allows empty ingest queries by default. Previously, ingest queries that produced no data would fail with the InsertCannotBeEmpty MSQ fault.
For more information, see Empty ingest queries in the upgrade notes.

#15674 #15495

In the web console, you can use a toggle to control whether an ingestion fails if the ingestion query produces no data.

#15627

# MSQ support for Google Cloud Storage

The MSQ task engine now supports Google Cloud Storage (GCS). You can use durable storage with GCS. See Durable storage configurations for more information.

#15398

# Experimental extensions

Druid 29.0.0 adds the following extensions.

# DDSketch

A new DDSketch extension is available as a community contribution. The DDSketch extension (druid-ddsketch) provides support for approximate quantile queries using the DDSketch library.

#15049

# Spectator histogram

A new histogram extension is available as a community contribution. The Spectator-based histogram extension (druid-spectator-histogram) provides approximate histogram aggregators and percentile post-aggregators based on Spectator fixed-bucket histograms.

#15340

# Delta Lake

A new Delta Lake extension is available as a community contribution. The Delta Lake extension...

Read more

Druid 28.0.1

21 Dec 11:24
Compare
Choose a tag to compare

Description

Apache Druid 28.0.1 is a patch release that fixes some issues in the 28.0.0 release. See the complete set of changes for additional details.

# Notable Bug fixes

  • #15405 To make the start-druid script more robust
  • #15402 Fixes the query caching bug for groupBy queries with multiple post-aggregation metrics
  • #15430 Fixes the failure of tasks during an upgrade due to the addition of new task action RetrieveSegmentsToReplaceAction which would not be available on the overlord at the time of rolling upgrade
  • #15500 Bug fix with NullFilter which is commonly utilised with the newly default SQL compatible mode.

# Credits

Thanks to everyone who contributed to this release!

@cryptoe
@gianm
@kgyrtkirk
@LakshSingla
@vogievetsky

Druid 28.0.0

15 Nov 10:19
Compare
Choose a tag to compare

Apache Druid 28.0.0 contains over 420 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 57 contributors.

See the complete set of changes for additional details, including bug fixes.

Review the upgrade notes and incompatible changes before you upgrade to Druid 28.0.0.

# Important features, changes, and deprecations

In Druid 28.0.0, we have made substantial improvements to querying to make the system more ANSI SQL compatible. This includes changes in handling NULL and boolean values as well as boolean logic. At the same time, the Apache Calcite library has been upgraded to the latest version. While we have documented known query behavior changes, please read the upgrade notes section carefully. Test your application before rolling out to broad production scenarios while closely monitoring the query status.

# SQL compatibility

Druid continues to make SQL query execution more consistent with how standard SQL behaves. However, there are feature flags available to restore the old behavior if needed.

# Three-valued logic

Druid native filters now observe SQL three-valued logic (true, false, or unknown) instead of Druid's classic two-state logic by default, when the following default settings apply:

  • druid.generic.useThreeValueLogicForNativeFilters = true
  • druid.expressions.useStrictBooleans = true
  • druid.generic.useDefaultValueForNull = false

#15058

# Strict booleans

druid.expressions.useStrictBooleans is now enabled by default.
Druid now handles booleans strictly using 1 (true) or 0 (false).
Previously, true and false could be represented either as true and false as well as 1 and 0, respectively.
In addition, Druid now returns a null value for Boolean comparisons like True && NULL.

If you don't explicitly configure this property in runtime.properties, clusters now use LONG types for any ingested boolean values and in the output of boolean functions for transformations and query time operations.
For more information, see SQL compatibility in the upgrade notes.

#14734

# NULL handling

druid.generic.useDefaultValueForNull is now disabled by default.
Druid now differentiates between empty records and null records.
Previously, Druid might treat empty records as empty or null.
For more information, see SQL compatibility in the upgrade notes.

#14792

# SQL planner improvements

Druid uses Apache Calcite for SQL planning and optimization. Starting in Druid 28.0.0, the Calcite version has been upgraded from 1.21 to 1.35. This upgrade brings in many bug fixes in SQL planning from Calcite.

# Dynamic parameters

As part of the Calcite upgrade, the behavior of type inference for dynamic parameters has changed. To avoid any type interference issues, explicitly CAST all dynamic parameters as a specific data type in SQL queries. For example, use:

SELECT (1 * CAST (? as DOUBLE))/2 as tmp

Do not use:

SELECT (1 * ?)/2 as tmp

# Async query and query from deep storage

Query from deep storage is no longer an experimental feature. When you query from deep storage, more data is available for queries without having to scale your Historical services to accommodate more data. To benefit from the space saving that query from deep storage offers, configure your load rules to unload data from your Historical services.

# Support for multiple result formats

Query from deep storage now supports multiple result formats.
Previously, the /druid/v2/sql/statements/ endpoint only supported results in the object format. Now, results can be written in any format specified in the resultFormat parameter.
For more information on result parameters supported by the Druid SQL API, see Responses.

#14571

# Broadened access for queries from deep storage

Users with the STATE permission can interact with status APIs for queries from deep storage. Previously, only the user who submitted the query could use those APIs. This enables the web console to monitor the running status of the queries. Users with the STATE permission can access the query results.

#14944

# MSQ queries for realtime tasks

The MSQ task engine can now include real time segments in query results. To do this, use the includeSegmentSource context parameter and set it to REALTIME.

#15024

# MSQ support for UNION ALL queries

You can now use the MSQ task engine to run UNION ALL queries with UnionDataSource.

#14981

# Ingest from multiple Kafka topics to a single datasource

You can now ingest streaming data from multiple Kafka topics to a datasource using a single supervisor.
You configure the topics for the supervisor spec using a regex pattern as the value for topicPattern in the IO config. If you add new topics to Kafka that match the regex, Druid automatically starts ingesting from those new topics.

If you enable multi-topic ingestion for a datasource, downgrading will cause the Supervisor to fail.
For more information, see Stop supervisors that ingest from multiple Kafka topics before downgrading.

#14424
#14865

# SQL UNNEST and ingestion flattening

The UNNEST function is no longer experimental.

Druid now supports UNNEST in SQL-based batch ingestion and query from deep storage, so you can flatten arrays easily. For more information, see UNNEST and Unnest arrays within a column.

You no longer need to include the context parameter enableUnnest: true to use UNNEST.

#14886

# Recommended syntax for SQL UNNEST

The recommended syntax for SQL UNNEST has changed. We recommend using CROSS JOIN instead of commas for most queries to prevent issues with precedence. For example, use:

SELECT column_alias_name1 FROM datasource CROSS JOIN UNNEST(source_ex...
Read more

Druid 27.0.0

11 Aug 09:28
Compare
Choose a tag to compare

Apache Druid 27.0.0 contains over 316 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 50 contributors.

See the complete set of changes for additional details, including bug fixes.

Review the upgrade notes and incompatible changes before you upgrade to Druid 27.0.0.

# Highlights

# New Explore view in the web console (experimental)

The Explore view is a simple, stateless, SQL backed, data exploration view to the web console. It lets users explore data in Druid with point-and-click interaction and visualizations (instead of writing SQL and looking at a table). This can provide faster time-to-value for a user new to Druid and can allow a Druid veteran to quickly chart some data that they care about.

image-4

The Explore view is accessible from the More (...) menu in the header:

image-5

#14540

# Query from deep storage (experimental)

Druid now supports querying segments that are stored only in deep storage. When you query from deep storage, you can query larger data available for queries without necessarily having to scale your Historical processes to accommodate more data. To take advantage of the potential storage savings, make sure you configure your load rules to not load all your segments onto Historical processes.

Note that at least one segment of a datasource must be loaded onto a Historical process so that the Broker can plan the query. It can be any segment though.

For more information, see the following:

#14416 #14512 #14527

# Schema auto-discovery and array column types

Type-aware schema auto-discovery is now generally available. Druid can determine the schema for the data you ingest rather than you having to manually define the schema.

As part of the type-aware schema discovery improvements, array column types are now generally available. Druid can determine the column types for your schema and assign them to these array column types when you ingest data using type-aware schema auto-discovery with the auto column type.

For more information about this feature, see the following:

# Smart segment loading

The Coordinator is now much more stable and user-friendly. In the new smartSegmentLoading mode, it dynamically computes values for several configs which maximize performance.

The Coordinator can now prioritize load of more recent segments and segments that are completely unavailable over load of segments that already have some replicas loaded in the cluster. It can also re-evaluate decisions taken in previous runs and cancel operations that are not needed anymore. Moreoever, move operations started by segment balancing do not compete with the load of unavailable segments thus reducing the reaction time for changes in the cluster and speeding up segment assignment decisions.

Additionally, leadership changes have less impact now, and the Coordinator doesn't get stuck even if re-election happens while a Coordinator run is in progress.

Lastly, the cost balancer strategy performs much better now and is capable of moving more segments in a single Coordinator run. These improvements were made by borrowing ideas from the cachingCost strategy. We recommend using cost instead of cachingCost since cachingCost is now deprecated.

For more information, see the following:

#13197 #14385 #14484

# New query filters

Druid now supports the following filters:

  • Equality: Use in place of the selector filter. It never matches null values.
  • Null: Match null values. Use in place of the selector filter.
  • Range: Filter on ranges of dimension values. Use in place of the bound filter. It never matches null values

Note that Druid's SQL planner uses these new filters in place of their older counterparts by default whenever druid.generic.useDefaultValueForNull=false or if sqlUseBoundAndSelectors is set to false on the SQL query context.

You can use these filters for filtering equality and ranges on ARRAY columns instead of only strings with the previous selector and bound filters.

For more information, see Query filters.

#14542

# Guardrail for subquery results

Users can now add a guardrail to prevent subquery’s results from exceeding the set number of bytes by setting druid.server.http.maxSubqueryRows in the Broker's config or maxSubqueryRows in the query context. This guardrail is recommended over row-based limiting.

This feature is experimental for now and defaults back to row-based limiting in case it fails to get the accurate size of the results consumed by the query.

#13952

# Added a new OSHI system monitor

Added a new OSHI system monitor (OshiSysMonitor) to replace SysMonitor. The new monitor has a wider support for different machine architectures including ARM instances. We recommend switching to the new monitor. SysMonitor is now deprecated and will be removed in future releases.

#14359

# Java 17 support

Druid now fully supports Java 17.

#14384

# Hadoop 2 deprecated

Support for Hadoop 2 is now deprecated. It will be removed in a future release.

For more information, see the upgrade notes.

# Additional features and improvements

# SQL-based ingestion

# Improved query planning behavior

Druid now fails query planning if a CLUSTERED BY column contains descending order.
Previously, queries would successfully plan if any CLUSTERED BY columns contained descending order.

The MSQ fault, InsertCannotOrderByDescending, is deprecated. An INSERT or REPLACE query containing a CLUSTERED BY expression cannot be in descending order. Druid's segment generation code only supports ascending order. Instead of the fault, Druid now throws a query ValidationException.

#14436 #14370

# Improved segment sizes

The default clusterStatisticsMergeMode is now `S...

Read more

Druid 26.0.0

24 May 00:40
Compare
Choose a tag to compare

Apache Druid 26.0.0 contains over 390 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 65 contributors.

See the complete set of changes for additional details.

Review the upgrade notes and incompatible changes before you upgrade to Druid 26.0.0.

# Highlights

# Auto type column schema (experimental)

A new "auto" type column schema and indexer has been added to native ingestion as the next logical iteration of the nested column functionality. This automatic type column indexer that produces the most appropriate column for the given inputs, producing either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json> columns, all sharing a common 'nested' format.

All columns produced by 'auto' have indexes to aid in fast filtering (unlike classic LONG and DOUBLE columns) and use cardinality based thresholds to attempt to only utilize these indexes when it is likely to actually speed up the query (unlike classic STRING columns).

COMPLEX<json> columns produced by this 'auto' indexer store arrays of simple scalar types differently than their 'json' (v4) counterparts, storing them as ARRAY typed columns. This means that the JSON_VALUE function can now extract entire arrays, for example JSON_VALUE(nested, '$.array' RETURNING BIGINT ARRAY). There is no change with how arrays of complex objects are stored at this time.

This improvement also adds a completely new functionality to Druid, ARRAY typed columns, which unlike classic multi-value STRING columns behave with ARRAY semantics. These columns can currently only be created via the 'auto' type indexer when all values are an arrays with the same type of elements.

An array data type is a data type that allows you to store multiple values in a single column of a database table. Arrays are typically used to store sets of related data that can be easily accessed and manipulated as a group.

This release adds support for storing arrays of primitive values such as ARRAY<STRING>, ARRAY<LONG>, and ARRAY<DOUBLE> as specialized nested columns instead of breaking them into separate element columns.

#14014 #13803

These changes affect two additional new features available in 26.0: schema auto-discovery and unnest.

# Schema auto-discovery (experimental)

We’re adding schema-auto discovery with type inference to Druid. With this feature, the data type of each incoming field is detected when schema is available. For incoming data which may contain added, dropped, or changed fields, you can choose to reject the nonconforming data (“the database is always correct - rejecting bad data!”), or you can let schema auto-discovery alter the datasource to match the incoming data (“the data is always right - change the database!”).

Schema auto-discovery is recommend for new use-cases and ingestions. For existing use-cases be careful switching to schema auto-discovery because Druid will ingest array-like values (e.g. ["tag1", "tag2]) as ARRAY<STRING> type columns instead of multi-value (MV) strings, this could cause issues in downstream apps replying on MV behavior. Hold off switching until an official migration path is available.

To use this feature, set spec.dataSchema.dimensionsSpec.useSchemaDiscovery to true in your task or supervisor spec or, if using the data loader in the console, uncheck the Explicitly define schema toggle on the Configure schema step. Druid can infer the entire schema or some of it if you explicitly list dimensions in your dimensions list.

Schema auto-discovery is available for native batch and streaming ingestion.

#13653 #13672 #14076

# UNNEST arrays (experimental)

Part of what’s cool about UNNEST is how it allows a wider range of operations that weren’t possible on Array data types. You can unnest arrays with either the UNNEST function (SQL) or the unnest datasource (native).

Unnest converts nested arrays or tables into individual rows. The UNNEST function is particularly useful when working with complex data types that contain nested arrays, such as JSON.

For example, suppose you have a table called "orders" with a column called "items" that contains an array of products for each order. You can use unnest to extract the individual products ("each_item") like in the following SQL example:

SELECT order_id, each_item FROM orders, UNNEST(items) as unnested(each_item)

This produces a result set with one row for each item in each order, with columns for the order ID and the individual item

Note the comma after the left table/datasource (orders in the example). It is required.

#13268 #13943 #13934 #13922 #13892 #13576 #13554 #13085

# Sort-merge join and hash shuffle join for MSQ

We can now perform shuffle joins by setting by setting the context parameter sqlJoinAlgorithm to sortMerge for the sort-merge algorithm or omitting it to perform broadcast joins (default).

Multi-stage queries can use a sort-merge join algorithm. With this algorithm, each pairwise join is planned into its own stage with two inputs. This approach is generally less performant but more scalable, than broadcast.

Set the context parameter sqlJoinAlgorithm to sortMerge to use this method.

Broadcast hash joins are similar to how native join queries are executed.

#13506

# Storage improvements on dictionary compression

Switching to using frontcoding dictionary compression (experimental) can save up to 30% with little to no impact to query performance.

This release further improves the frontCoded type of stringEncodingStrategy on indexSpec with a new segment format version, which typically has faster read speeds and reduced segment size. This improvement is backwards incompatible with Druid 25.0. Added a new formatVersion option, which defaults to the the current version 0. Set formatVersion to 1 to start using the new version.

#13988 #13996

Additionally, overall storage size, particularly with using larger buckets, has been improved.

13854

# Additional features and improvements

# MSQ task engine

# Array-valued parameters for SQL queries

Added support for array-valued parameters for SQL queries using. You can now reuse the same SQL for every ingestion, only passing in a different set of input files as query parameters.

#13627

# EXTEND clause for the EXTERN functions

You can now use an EXTEND clause to provide a list of column definitions for your source data in standard SQL format.

The web console now defaults to using the EXTEND clause syntax for all queries auto-generated in the web console. This means that SQL-based ingestion statements generated by the web console in Druid 26 (such as from the SQL based data loader) will not work in earlier versions of Druid.

#13627 #13985

# MSQ fault tolerance

Added the ability for MSQ controller task to retry worker task in case of failures. To enable, pass faultTolerance:true in the query context.

[#13353](https...

Read more

Druid 25.0.0

04 Jan 07:43
Compare
Choose a tag to compare

Apache Druid 25.0.0 contains over 300 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 51 contributors.

See the complete set of changes for additional details.

# Highlights

# MSQ task engine now production ready

The multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready. Use it for any supported workloads. For more information, see the following pages:

# Simplified Druid deployments

The new start-druid script greatly simplifies deploying any combination of Druid services on a single-server. It comes pre-packaged with the required configs and can be used to launch a fully functional Druid cluster simply by invoking ./start-druid. For experienced Druids, it also gives complete control over the runtime properties and JVM arguments to have a cluster that exactly fits your needs.

The start-druid script deprecates the existing profiles such as start-micro-quickstart and start-nano-quickstart. These profiles may be removed in future releases. For more information, see Single server deployment.

# String dictionary compression (experimental)

Added support for front coded string dictionaries for smaller string columns, leading to reduced segment sizes with only minor performance penalties for most Druid queries.

This can be enabled by setting IndexSpec.stringDictionaryEncoding to {"type":"frontCoded", "bucketSize": 4} , where bucketSize is any power of 2 less than or equal to 128. Setting this property instructs indexing tasks to write segments using compressed dictionaries of the specified bucket size.

Any segment written using string dictionary compression is not readable by older versions of Druid.

For more information, see Front coding.

#12277

# Kubernetes-native tasks

Druid can now use Kubernetes to launch and manage tasks, eliminating the need for middle managers.

To use this feature, enable the druid-kubernetes-overlord-extensions in the extensions load list for your Overlord process.

#13156

# Hadoop-3 compatible binary

Druid now comes packaged as a dedicated binary for Hadoop-3 users, which contains Hadoop-3 compatible jars. If you do not use Hadoop-3 with your Druid cluster, you may continue using the classic binary.

# Multi-stage query (MSQ) task engine

# MSQ enabled for Docker

MSQ task query engine is now enabled for Docker by default.

#13069

# Query history

Multi-stage queries no longer show up in the Query history dialog. They are still available in the Recent query tasks panel.

# Limit on CLUSTERED BY columns

When using the MSQ task engine to ingest data, the number of columns that can be passed in the CLUSTERED BY clause is now limited to 1500.

#13352

# Support for string dictionary compression

The MSQ task engine supports the front-coding of String dictionaries for better compression. This can be enabled for INSERT or REPLACE statements by setting indexSpec to a valid json string in the query context.

#13275

# Sketch merging mode

Workers can now gather key statistics, used to generate partition boundaries, either sequentially or in parallel. Set clusterStatisticsMergeMode to PARALLEL, SEQUENTIAL or AUTO in the query context to use the corresponding sketch merging mode. For more information, see Sketch merging mode.

#13205

# Performance and operational improvements

  • Error messages: For disallowed MSQ warnings of certain types, the warning is now surfaced as the error. #13198
  • Secrets: For tasks containing SQL with sensitive keys, Druid now masks the keys while logging with the help regular expressions. #13231
  • Downsampling accuracy: MSQ task engine now uses the number of bytes instead of number of keys when downsampling data. #12998
  • Memory usage: When determining partition boundaries, the heap footprint of internal sketches used by MSQ is now capped at 10% of available memory or 300 MB, whichever is lower. Previously, the cap was strictly 300 MB. #13274
  • Task reports: Added fields pendingTasks and runningTasks to the worker report. See Query task status information for related web console changes. #13263

# Querying

# Async reads for JDBC

Prevented JDBC timeouts on long queries by returning empty batches when a batch fetch takes too long. Uses an async model to run the result fetch concurrently with JDBC requests.

#13196

# Improved algorithm to check values of an IN filter

To accommodate large value sets arising from large IN filters or from joins pushed down as IN filters, Druid now uses a sorted merge algorithm for merging the set and dictionary for larger values.

#13133

# Enhanced query context security

Added the following configuration properties that refine the query context security model controlled by druid.auth.authorizeQueryContextParams:

  • druid.auth.unsecuredContextKeys: A JSON list of query context keys that do not require a security check.
  • druid.auth.securedContextKeys: A JSON list of query context keys that do require a security check.

If both are set, unsecuredContextKeys acts as exceptions to securedContextKeys.

#13071

# HTTP response headers

The HTTP response for a SQL query now correctly sets response headers, same as a native query.

#13052

# Metrics

# New metrics

The following metrics have been newly added. For more details, see the complete list of Druid metrics.

# Batched segment allocation

These metrics pertain to batched segment allocation.

Metric Description Dimensions
task/action/batch/runTime Milliseconds taken to execute a batch of task actions. Currently only being emitted for batched segmentAllocate actions dataSource, taskActionType=segmentAllocate
task/action/batch/queueTime Milliseconds spent by a batch of task actions in queue. Currently only being emitted for batched segmentAllocate actions dataSource, taskActionType=segmentAllocate
task/action/batch/size Number of task actions in a batch that was executed during the emission period. Currently only being emitted for batched segmentAllocate actions dataSource, taskActionType=segmentAllocate
task/action/batch/attempts Number of execution attempts for a single batch of task actions. Currently only being emitted for batched segmentAllocate actions dataSource, taskActionType=segmentAllocate
task/action/success/count Number of task actions that were executed successfully during the emission period. Currently only being emitted for batched segmentAllocate actions dataSource, taskId, taskType, taskActionType=segmentAllocate
task/action/failed/count Number of task actions that failed during the emission period. Currently only being emitted for batched segmentAllocate actions dataSource, taskId, `tas...
Read more

Druid 24.0.2

22 Dec 02:12
Compare
Choose a tag to compare

Apache Druid 24.0.2 is a bug fix release that fixes some issues in the 24.0.1 release.
See the complete set of changes for additional details.

# Bug fixes

#13138 to fix dependency errors while launching a Hadoop task.

# Credits

@kfaraz
@LakshSingla

Druid 24.0.1

22 Nov 18:23
Compare
Choose a tag to compare

Apache Druid 24.0.1 is a bug fix release that fixes some issues in the 24.0 release.
See the complete set of changes for additional details.

# Notable Bug fixes

#13214 to fix SQL planning when using the JSON_VALUE function.
#13297 to fix values that match a range filter on nested columns.
#13077 to fix detection of nested objects while generating an MSQ SQL in the web-console.
#13172 to correctly handle overlord leader election even when tasks cannot be reacquired.
#13259 to fix memory leaks from SQL statement objects.
#13273 to fix overlord API failures by de-duplicating task entries in memory.
#13049 to fix a race condition while processing query context.
#13151 to fix assertion error in SQL planning.

# Credits

Thanks to everyone who contributed to this release!

@abhishekagarwal87
@AmatyaAvadhanula
@clintropolis
@gianm
@kfaraz
@LakshSingla
@paul-rogers
@vogievetsky

# Known issues

  • Hadoop ingestion does not work with custom extension config due to injection errors
    (fixed in #13138)

Druid 24.0.0

16 Sep 14:48
Compare
Choose a tag to compare

Apache Druid 24.0.0 contains over 300 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 67 contributors. See the complete set of changes for additional details.

# Major version bump

Starting with this release, we have dropped the leading 0 from the release version and promoted all other digits one place to the left. Druid is now at major version 24, a jump up from the prior 0.23.0 release. In terms of backward-compatibility or breaking changes, this release is not significantly different than other previous major releases such as 0.23.0 or 0.22.0. We are continuing with the same policy as we have used in prior releases: minimizing the number of changes that require special attention when upgrading, and calling out any that do exist in the release notes. For this release, please refer to the Upgrading to 24.0.0 section for a list of backward-incompatible changes in this release.

# New Features

# Multi-stage query task engine

SQL-based ingestion for Apache Druid uses a distributed multi-stage query architecture, which includes a query engine called the multi-stage query task engine (MSQ task engine). The MSQ task engine extends Druid's query capabilities, so you can write queries that reference external data as well as perform ingestion with SQL INSERT and REPLACE. Essentially, you can perform SQL-based ingestion instead of using JSON ingestion specs that Druid's native ingestion uses. In addition to the easy-to-use syntax, the SQL interface lets you perform transformations that involve multiple shuffles of data.

SQL-based ingestion using the multi-stage query task engine is recommended for batch ingestion starting in Druid 24.0.0. Native batch and Hadoop-based ingestion continue to be supported as well. We recommend you review the known issues and test the feature in a staging environment before rolling out in production. Using the multi-stage query task engine with plain SELECT statements (not INSERT ... SELECT or REPLACE ... SELECT) is experimental.

If you're upgrading from an earlier version of Druid or you're using Docker, you'll need to add the druid-multi-stage-query extension to druid.extensions.loadlist in your common.runtime.properties file.

For more information, refer to the Overview documentation for SQL-based ingestion.

#12524
#12386
#12523
#12589

# Nested columns

Druid now supports directly storing nested data structures in a newly added COMPLEX<json> column type. COMPLEX<json> columns store a copy of the structured data in JSON format as well as specialized internal columns and indexes for nested literal values—STRING, LONG, and DOUBLE types. An optimized virtual column allows Druid to read and filter these values at speeds consistent with standard Druid LONG, DOUBLE, and STRING columns.

Newly added Druid SQL, native JSON functions, and virtual column allow you to extract, transform, and create COMPLEX<json> values in at query time. You can also use the JSON functions in INSERT and REPLACE statements in SQL-based ingestion, or in a transformSpec in native ingestion as an alternative to using a flattenSpec object to "flatten" nested data for ingestion.

See SQL JSON functions, native JSON functions, Nested columns, virtual columns, and the feature summary for more detail.

#12753
#12714
#12753
#12920

# Updated Java support

Java 11 is fully supported is no longer experimental. Java 17 support is improved.

#12839

# Query engine updates

# Updated column indexes and query processing of filters

Reworked column indexes to be extraordinarily flexible, which will eventually allow us to model a wide range of index types. Added machinery to build the filters that use the updated indexes, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.

#12388

# Time filter operator

You can now use the Druid SQL operator TIME_IN_INTERVAL to filter query results based on time. Prefer TIME_IN_INTERVAL over the SQL BETWEEN operator to filter on time. For more information, see Date and time functions.

#12662

# Null values and the "in" filter

If a values array contains null, the "in" filter matches null values. This differs from the SQL IN filter, which does not match null values.

For more information, see Query filters and SQL data types.
#12863

# Virtual columns in search queries

Previously, a search query could only search on dimensions that existed in the data source. Search queries now support virtual columns as a parameter in the query.

#12720

# Optimize simple MIN / MAX SQL queries on __time

Simple queries like select max(__time) from ds now run as a timeBoundary queries to take advantage of the time dimension sorting in a segment. You can set a feature flag to enable this feature.

#12472
#12491

# String aggregation results

The first/last string aggregator now only compares based on values. Previously, the first/last string aggregator’s values were compared based on the _time column first and then on values.

If you have existing queries and want to continue using both the _time column and values, update your queries to use ORDER BY MAX(timeCol).

#12773

# Reduced allocations due to Jackson serialization

Introduced and implemented new helper functions in JacksonUtils to enable reuse of
SerializerProvider objects.

Additionally, disabled backwards compatibility for map-based rows in the GroupByQueryToolChest by default, which eliminates the need to copy the heavyweight ObjectMapper. Introduced a configuration option to allow administrators to explicitly enable backwards compatibility.

#12468

# Updated IPAddress Java library

Added a new IPAddress Java library dependency to handle IP addresses. The library includes IPv6 support. Additionally, migrated IPv4 functions to use the new library.

#11634

# Query performance improvements

Optimized SQL operations and functions as follows:

  • Vectorized numeric latest aggregators (#12439)
  • Optimized isEmpty() and equals() on RangeSets (#12477)
  • Optimized reuse of Yielder objects (#12475)
  • Operations on numeric columns with indexes are now faster (#12830)
  • Optimized GroupBy by reducing allocations. Reduced allocations by reusing entry and key holders (#12474)
  • Added a vectorized version of string last aggregator (#12493)
  • Added Direct UTF-8 access for IN filters (#12517)
  • Enabled virtual columns to cache their outputs in case Druid calls them multiple times on the same underlying row (#12577)
  • Druid now rewrites a join as a filter when possible in IN joins (#12225)
  • Added automatic sizing for GroupBy dictionaries (#12763)
  • Druid now distributes JDBC connections more evenly amongst brokers (#12817)

# Streaming ingestion

# Kafka consumers

Previously, consumers that were registered and used for ingestion persisted until Kafka deleted the...

Read more