Changelog

Changelog¶

2024.4.2¶

Highlights¶

Trivial Merge Implementation¶

The Query Optimizer will inspect quires to determine if a merge(...) or groupby(...).apply(...) requires a shuffle. A shuffle can be avoided, if the DataFrame was shuffled on the same columns in a previous step without any operations in between that change the partitioning layout or the relevant values in each partition.

>>> result = df.merge(df2, on="a")
>>> result = result.merge(df3, on="a")

The Query optimizer will identify that result was previously shuffled on "a" as well and thus only shuffle df3 in the second merge operation before doing a blockwise merge.

Auto-partitioning in `read_parquet`¶

The Query Optimizer will automatically repartition datasets read from Parquet files if individual partitions are too small. This will reduce the number of partitions in consequentially also the size of the task graph.

The Optimizer aims to produce partitions of at least 75MB and will combine multiple files together if necessary to reach this threshold. The value can be configured by using

>>> dask.config.set({"dataframe.parquet.minimum-partition-size": 100_000_000})

The value is given in bytes. The default threshold is relatively conservative to avoid memory issues on worker nodes with a relatively small amount of memory per thread.

2024.4.1¶

This is a minor bugfix release that that fixes an error when importing dask.dataframe with Python 3.11.9.

See GH#11035 and GH#11039 from Richard (Rick) Zamora for details.

2024.4.0¶

Highlights¶

Query planning fixes¶

This release contains a variety of bugfixes in Dask DataFrame’s new query planner.

GPU metric dashboard fixes¶

GPU memory and utilization dashboard functionality has been restored. Previously these plots were unintentionally left blank.

See GH#8572 from Benjamin Zaitlen for details.

2024.3.1¶

This is a minor release that primarily demotes an exception to a warning if dask-expr is not installed when upgrading.

2024.3.0¶

Released on March 11, 2024

Highlights¶

Query planning¶

This release is enabling query planning by default for all users of dask.dataframe.

The query planning functionality represents a rewrite of the DataFrame using dask-expr. This is a drop-in replacement and we expect that most users will not have to adjust any of their code. Any feedback can be reported on the Dask issue tracker or on the query planning feedback issue.

If you are encountering any issues you are still able to opt-out by setting

>>> import dask
>>> dask.config.set({'dataframe.query-planning': False})

Sunset of Pandas 1.X support¶

The new query planning backend is requiring at least pandas 2.0. This pandas version will automatically be installed if you are installing from conda or if you are installing using dask[complete] or dask[dataframe] from pip.

The legacy DataFrame implementation is still supporting pandas 1.X if you install dask without extras.

Additional changes

Update tests for pandas nightlies with dask-expr (GH#10989) Patrick Hoefler
Use dask-expr docs as main reference docs for DataFrames (GH#10990) Patrick Hoefler
Adjust from_array test for dask-expr (GH#10988) Patrick Hoefler
Unskip to_delayed test (GH#10985) Patrick Hoefler
Bump conda-incubator/setup-miniconda from 3.0.1 to 3.0.3 (GH#10978)
Fix bug when enabling dask-expr (GH#10977) Patrick Hoefler
Update docs and requirements for dask-expr and remove warning (GH#10976) Patrick Hoefler
Fix numpy 2 compatibility with ogrid usage (GH#10929) David Hoese
Turn on dask-expr switch (GH#10967) Patrick Hoefler
Force initializing the random seed with the same byte order interpret… (GH#10970) Elliott Sales de Andrade
Use correct encoding for line terminator when reading CSV (GH#10972) Elliott Sales de Andrade
perf: do not unnecessarily recalculate input/output indices in _optimize_blockwise (GH#10966) Lindsey Gray
Adjust tests for string option in dask-expr (GH#10968) Patrick Hoefler
Adjust tests for array conversion in dask-expr (GH#10973) Patrick Hoefler
TST: Fix sizeof tests on 32bit (GH#10971) Elliott Sales de Andrade
TST: Add missing skip for pyarrow (GH#10969) Elliott Sales de Andrade
Implement dask-expr conversion for bag.to_dataframe (GH#10963) Patrick Hoefler
Fix dask-expr import errors (GH#10964) Miles
Clean up Sphinx documentation for dask.config (GH#10959) crusaderky
Use stdlib importlib.metadata on Python 3.12+ (GH#10955) wim glenn
Cast partitioning_index to smaller size (GH#10953) Florian Jetter
Reuse dask/dask groupby Aggregation (GH#10952) Patrick Hoefler
ensure tokens on futures are unique (GH#8569) Florian Jetter
Don’t obfuscate fine performance metrics failures (GH#8568) crusaderky
Mark shuffle fast tasks in dask-expr (GH#8563) crusaderky
Weigh gilknocker Prometheus metric by duration (GH#8558) crusaderky
Fix scheduler transition error on memory->erred (GH#8549) Hendrik Makait
Make CI happy again (GH#8560) Miles
Fix flaky test_Future_release_sync (GH#8562) crusaderky
Fix flaky test_flaky_connect_recover_with_retry (GH#8556) Hendrik Makait
typing tweaks in scheduler.py (GH#8551) crusaderky
Bump conda-incubator/setup-miniconda from 3.0.2 to 3.0.3 (GH#8553)
Install dask-expr on CI (GH#8552) Hendrik Makait
P2P shuffle can drop partition column before writing to disk (GH#8531) Hendrik Makait
Better logging for worker removal (GH#8517) crusaderky
Add indicator support to merge (GH#8539) Patrick Hoefler
Bump conda-incubator/setup-miniconda from 3.0.1 to 3.0.2 (GH#8535)
Avoid iteration error when getting module path (GH#8533) James Bourbeau
Ignore stdlib threading module in code collection (GH#8532) James Bourbeau
Fix excessive logging on P2P retry (GH#8511) Hendrik Makait
Prevent typos in retire_workers parameters (GH#8524) crusaderky
Cosmetic cleanup of test_steal (backport from #8185) (GH#8509) crusaderky
Fix flaky test_compute_per_key (GH#8521) crusaderky
Fix flaky test_no_workers_timeout_queued (GH#8523) crusaderky

2024.2.1¶

Released on February 23, 2024

Highlights¶

Allow silencing dask.DataFrame deprecation warning¶

The last release contained a DeprecationWarning that alerts users to an upcoming switch of dask.dafaframe to use the new backend with support for query planning (see also GH#10934).

This DeprecationWarning is triggered in import of the dask.dataframe module and the community raised concerns about this being to verbose.

It is now possible to silence this warning

# via Python
>>> dask.config.set({'dataframe.query-planning-warning': False})

# via CLI
dask config set dataframe.query-planning-warning False

See GH#10936 and GH#10925 from Miles for details.

More robust distributed scheduler for rare key collisions¶

Blockwise fusion optimization can cause a task key collision that is not being handled properly by the distributed scheduler (see GH#9888). Users will typically notice this by seeing one of various internal exceptions that cause a system deadlock or critical failure. While this issue could not be fixed, the scheduler now implements a mechanism that should mitigate most occurences and issues a warning if the issue is detected.

See GH#8185 from crusaderky and Florian Jetter for details.

Over the course of this, various improvements to tokenization have been implemented. See GH#10913, GH#10884, GH#10919, GH#10896 and primarily GH#10883 from crusaderky for more details.

More robust adaptive scaling on large clusters¶

Adaptive scaling could previously lose data during downscaling if many tasks had to be moved. This typically, but not exclusively, occured on large clusters and would manifest as a recomputation of tasks and could cause clusters to oscillate between up- and downscaling without ever finishing.

See GH#8522 from crusaderky for more details.

2024.2.0¶

Released on February 9, 2024

Highlights¶

Deprecate Dask DataFrame implementation¶

The current Dask DataFrame implementation is deprecated. In a future release, Dask DataFrame will use new implementation that contains several improvements including a logical query planning. The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by installing the dask-expr library:

$ pip install dask-expr

and turning the query planning option on:

>>> import dask
>>> dask.config.set({'dataframe.query-planning': True})
>>> import dask.dataframe as dd

API documentation for the new implementation is available at https://docs.dask.org/en/stable/dataframe-api.html

Any feedback can be reported on the Dask issue tracker https://github.com/dask/dask/issues

See GH#10912 from Patrick Hoefler for details.

Improved tokenization¶

This release contains several improvements to Dask’s object tokenization logic. More objects now produce deterministic tokens, which can lead to improved performance through caching of intermediate results.

See GH#10898, GH#10904, GH#10876, GH#10874, and GH#10865 from crusaderky for details.

Additional changes

Fix inplace modification on read-only arrays for string conversion (GH#10886) Patrick Hoefler
Add changelog entry for dask-expr (GH#10915) Patrick Hoefler
Fix leftsemi merge for cudf (GH#10914) Patrick Hoefler
Slight update to dask-expr warning (GH#10916) James Bourbeau
Improve performance for groupby.nunique (GH#10910) Patrick Hoefler
Add configuration for leftsemi merges in dask-expr (GH#10908) Patrick Hoefler
Adjust assign test for dask-expr (GH#10907) Patrick Hoefler
Avoid pytest.warns in test_to_datetime for GPU CI (GH#10902) Richard (Rick) Zamora
Update deployment options in docs homepage (GH#10901) James Bourbeau
Fix typo in dataframe docs (GH#10900) Matthew Rocklin
Bump peter-evans/create-pull-request from 5 to 6 (GH#10894)
Fix mimesis API >=13.1.0 - use random.randint (GH#10888) Miles
Adjust invalid test (GH#10897) Patrick Hoefler
Pickle da.argwhere and da.count_nonzero (GH#10885) crusaderky
Fix dask-expr tests after singleton pr (GH#10892) Patrick Hoefler
Set lower bound version for s3fs (GH#10889) Miles
Add a couple of dask-expr fixes for new parquet cache (GH#10880) Florian Jetter
Update deployment documentation (GH#10882) Matthew Rocklin
Start with dask-expr doc build (GH#10879) Patrick Hoefler
Test tokenization of static and class methods (GH#10872) crusaderky
Add distributed.print and distributed.warn to API docs (GH#10878) James Bourbeau
Run macos ci on M1 architecture (GH#10877) Patrick Hoefler
Update tests for dask-expr (GH#10838) Patrick Hoefler
Update parquet tests to align with dask-expr fixes (GH#10851) Richard (Rick) Zamora
Fix regression in test_graph_manipulation (GH#10873) crusaderky
Adjust pytest errors for dask-expr ci (GH#10871) Patrick Hoefler
Set upper bound version for numba when pandas<2.1 (GH#10890) Miles
Deprecate method parameter in DataFrame.fillna (GH#10846) Miles
Remove warning filter from pyproject.toml (GH#10867) Patrick Hoefler
Skip test_append_with_partition for fastparquet (GH#10828) Patrick Hoefler
Fix pytest 8 issues (GH#10868) Patrick Hoefler
Adjust test for support of median in Groupby.aggregate in dask-expr (2/2) (GH#10870) Hendrik Makait
Allow length of ascending to be larger than one in sort_values (GH#10864) Florian Jetter
Allow other message raised in Python 3.9 (GH#10862) Hendrik Makait
Don’t crash when getting computation code in pathological cases (GH#8502) James Bourbeau
Bump peter-evans/create-pull-request from 5 to 6 (GH#8494)
fix test of cudf spilling metrics (GH#8478) Mads R. B. Kristensen
Upgrade to pytest 8 (GH#8482) crusaderky
Fix test_two_consecutive_clients_share_results (GH#8484) crusaderky
Client word mix-up (GH#8481) templiert

2024.1.1¶

Released on January 26, 2024

Highlights¶

Pandas 2.2 and Scipy 1.12 support¶

This release contains compatibility updates for the latest pandas and scipy releases.

See GH#10834, GH#10849, GH#10845, and GH#8474 from crusaderky for details.

Deprecations¶

Deprecate convert_dtype in apply (GH#10827) Miles
Deprecate axis in DataFrame.rolling (GH#10803) Miles
Deprecate out= and dtype= parameter in most DataFrame methods (GH#10800) crusaderky
Deprecate axis in groupby cumulative transformers (GH#10796) Miles
Rename shuffle to shuffle_method in remaining methods (GH#10797) Miles

Additional changes

Add recommended deployment options to deployment docs (GH#10866) James Bourbeau
Improve _agg_finalize to confirm to output expectation (GH#10835) Hendrik Makait
Implement deterministic tokenization for hlg (GH#10817) Patrick Hoefler
Refactor: move tests for tokenize() to its own module (GH#10863) crusaderky
Update DataFrame examples section (GH#10856) James Bourbeau
Temporarily pin mimesis<13.1.0 (GH#10860) James Bourbeau
Trivial cosmetic tweaks to _testing.py (GH#10857) crusaderky
Unskip and adjust tests for groupby-aggregate with median using dask-expr (GH#10832) Hendrik Makait
Fix test for sizeof(pd.MultiIndex) in upstream CI (GH#10850) crusaderky
numpy 2.0: fix slicing by uint64 array (GH#10854) crusaderky
Rename numpy version constants to match pandas (GH#10843) crusaderky
Bump actions/cache from 3 to 4 (GH#10852)
Update gpuCI RAPIDS_VER to 24.04 (GH#10841)
Fix deprecations in doctest (GH#10844) crusaderky
Changed dtype arithmetics in numpy 2.x (GH#10831) crusaderky
Adjust tests for median support in dask-expr (GH#10839) Patrick Hoefler
Adjust tests for median support in groupby-aggregate in dask-expr (GH#10840) Hendrik Makait
numpy 2.x: fix std() on MaskedArray (GH#10837) crusaderky
Fail dask-expr ci if tests fail (GH#10829) Patrick Hoefler
Activate query_planning when exporting tests (GH#10833) Patrick Hoefler
Expose dataframe tests (GH#10830) Patrick Hoefler
numpy 2: deprecations in n-dimensional fft functions (GH#10821) crusaderky
Generalize CreationDispatch for dask-expr (GH#10794) Richard (Rick) Zamora
Remove circular import when dask-expr enabled (GH#10824) Miles
Minor[CI]: publish-test-results not marked as failed (GH#10825) Miles
Fix more tests to use pytest.warns() (GH#10818) Michał Górny
np.unique(): inverse is shaped in numpy 2 (GH#10819) crusaderky
Pin test_split_adaptive_files to pyarrow engine (GH#10820) Patrick Hoefler
Adjust remaining tests in dask/dask (GH#10813) Patrick Hoefler
Restrict test to Arrow only (GH#10814) Patrick Hoefler
Filter warnings from std test (GH#10815) Patrick Hoefler
Adjust mostly indexing tests (GH#10790) Patrick Hoefler
Updates to deployment docs (GH#10778) Sarah Charlotte Johnson
Unblock documentation build (GH#10807) Miles
Adjust test_to_datetime for dask-expr compatibility Hendrik Makait
Upstream CI tweaks (GH#10806) crusaderky
Improve tests for to_numeric (GH#10804) Hendrik Makait
Fix test-report cache key indent (GH#10798) Miles
Add test-report workflow (GH#10783) Miles
Handle matrix subclass serialization (GH#8480) Florian Jetter
Use smallest data type for partition column in P2P (GH#8479) Florian Jetter
pandas 2.2: fix test_dataframe_groupby_tasks (GH#8475) crusaderky
Bump actions/cache from 3 to 4 (GH#8477)
pandas 2.2 vs. pyarrow 14: deprecated DatetimeTZBlock (GH#8476) crusaderky
pandas 2.2.0: Deprecated frequency alias M in favor of ME (GH#8473) Hendrik Makait
Fix docs build (GH#8472) Hendrik Makait
Fix P2P-based joins with explicit npartitions (GH#8470) Hendrik Makait
Ignore dask-expr in test_report.py script (GH#8464) Miles
Nit: hardcode Python version in test report environment (GH#8462) crusaderky
Change test_report.py - skip bad artifacts in dask/dask (GH#8461) Miles
Replace all occurrences of sys.is_finalizing (GH#8449) Florian Jetter

2024.1.0¶

Released on January 12, 2024

Highlights¶

Partial rechunks within P2P¶

P2P rechunking now utilizes the relationships between input and output chunks. For situations that do not require all-to-all data transfer, this may significantly reduce the runtime and memory/disk footprint. It also enables task culling.

See GH#8330 from Hendrik Makait for details.

Fastparquet engine deprecated¶

The fastparquet Parquet engine has been deprecated. Users should migrate to the pyarrow engine by installing PyArrow and removing engine="fastparquet" in read_parquet or to_parquet calls.

See GH#10743 from crusaderky for details.

Improved serialization for arbitrary data¶

This release improves serialization robustness for arbitrary data. Previously there were some cases where serialization could fail for non-msgpack serializable data. In those cases we now fallback to using pickle.

See GH#8447 from Hendrik Makait for details.

Additional deprecations¶

Deprecate shuffle keyword in favour of shuffle_method for DataFrame methods (GH#10738) Hendrik Makait
Deprecate automatic argument inference in repartition (GH#10691) Patrick Hoefler
Deprecate compute parameter in set_index (GH#10784) Miles
Deprecate inplace in eval (GH#10785) Miles
Deprecate Series.view (GH#10754) Miles
Deprecate npartitions="auto" for set_index & sort_values (GH#10750) Miles

Additional changes

Avoid shortcut in tasks shuffle that let to data loss (GH#10763) Patrick Hoefler
Ignore data tasks when ordering (GH#10706) Florian Jetter
Add get_dummies from dask-expr (GH#10791) Patrick Hoefler
Adjust IO tests for dask-expr migration (GH#10776) Patrick Hoefler
Remove deprecation warning about sort and split_out in groupby (GH#10788) Patrick Hoefler
Address pandas deprecations (GH#10789) Patrick Hoefler
Import distributed only once in get_scheduler (GH#10771) Florian Jetter
Simplify GitHub actions (GH#10781) crusaderky
Add unit test overview (GH#10769) Miles
Clean up redundant bits in CI (GH#10768) crusaderky
Update tests for ufunc (GH#10773) Patrick Hoefler
Use pytest.mark.skipif(DASK_EXPR_ENABLED) (GH#10774) crusaderky
Adjust shuffle tests for dask-expr (GH#10759) Patrick Hoefler
Fix some deprecation warnings from pandas (GH#10749) Patrick Hoefler
Adjust shuffle tests for dask-expr (GH#10762) Patrick Hoefler
Update pre-commit (GH#10767) Hendrik Makait
Clean up config switches in CI (GH#10766) crusaderky
Improve exception for validate_key (GH#10765) Hendrik Makait
Handle datetimeindexes in set_index with unknown divisions (GH#10757) Patrick Hoefler
Add hashing for decimals (GH#10758) Patrick Hoefler
Review tests for is_monotonic (GH#10756) crusaderky
Change argument order in value_counts_aggregate (GH#10751) Patrick Hoefler
Adjust some groupby tests for dask-expr (GH#10752) Patrick Hoefler
Restrict mimesis to < 12 for 3.9 build (GH#10755) Patrick Hoefler
Don’t evaluate config in skip condition (GH#10753) Patrick Hoefler
Adjust some tests to be compatible with dask-expr (GH#10714) Patrick Hoefler
Make dask.array.utils functions more generic to other Dask Arrays (GH#10676) Matthew Rocklin
Remove duplciate “single machine” section (GH#10747) Matthew Rocklin
Tweak ORC engine= parameter (GH#10746) crusaderky
Add pandas 3.0 deprecations and migration prep for dask-expr (GH#10723) Miles
Add task graph animation to docs homepage (GH#10730) Sarah Charlotte Johnson
Use new Xarray logo (GH#10729) James Bourbeau
Update tab styling on “10 Minutes to Dask” page (GH#10728) James Bourbeau
Update environment file upload step in CI (GH#10726) James Bourbeau
Don’t duplicate unobserved categories in GroupBy.nunqiue if split_out>1 (GH#10716) Patrick Hoefler
Changelog entry for dask.order update (GH#10715) Florian Jetter
Relax redundant-key check in _check_dsk (GH#10701) Richard (Rick) Zamora
Fix test_report.py (GH#8459) Miles
Revert pickle change (GH#8456) Florian Jetter
Adapt test_report.py to support dask/dask repository (GH#8450) Miles
Maintain stable ordering for P2P shuffling (GH#8453) Hendrik Makait
Add no worker timeout for scheduler (GH#8371) FTang21
Allow tests workflow to be dispatched manually by maintainers (GH#8445) Erik Sundell
Make scheduler-related transition functionality private (GH#8448) Hendrik Makait
Update pre-commit hooks (GH#8444) Hendrik Makait
Do not always check if __main__ in result when pickling (GH#8443) Florian Jetter
Delegate wait_for_workers to cluster instances only when implemented (GH#8441) Erik Sundell
Extend sleep in test_pandas (GH#8440) Julian Gilbey
Avoid deprecated shuffle keyword (GH#8439) Hendrik Makait
Shuffle metrics 4/4: Remove bespoke diagnostics (GH#8367) crusaderky
Do not run gilknocker in testsuite (GH#8423) Florian Jetter
Tweak abstractmethods (GH#8427) crusaderky
Shuffle metrics 3/4: Capture background metrics (GH#8366) crusaderky
Shuffle metrics 2/4: Add background metrics (GH#8365) crusaderky
Shuffle metrics 1/4: Add foreground metrics (GH#8364) crusaderky
Bump actions/upload-artifact from 3 to 4 (GH#8420)
Fix test_merge_p2p_shuffle_reused_dataframe_with_different_parameters (GH#8422) Hendrik Makait
Expand Client.upload_file docs example (GH#8313) Miles
Improve logging in P2P’s scheduler plugin (GH#8410) Hendrik Makait
Re-enable test_decide_worker_coschedule_order_neighbors (GH#8402) Florian Jetter
Add cuDF spilling statistics to RMM/GPU memory plot (GH#8148) Charles Blackmon-Luca
Fix inconsistent hashing for Nanny-spawned workers (GH#8400) Charles Stern
Do not allow workers to downscale if they are running long-running tasks (e.g. worker_client) (GH#7481) Florian Jetter
Fix flaky test_subprocess_cluster_does_not_depend_on_logging (GH#8417) crusaderky

2023.12.1¶

Released on December 15, 2023

Highlights¶

Logical Query Planning now available for Dask DataFrames¶

Dask DataFrames are now much more performant by using a logical query planner. This feature is currently off by default, but can be turned on with:

dask.config.set({"dataframe.query-planning": True})

You also need to have dask-expr installed:

pip install dask-expr

We’ve seen promising performance improvements so far, see this blog post and these regularly updated benchmarks for more information. A more detailed explanation of how the query optimizer works can be found in this blog post.

This feature is still under active development and the API isn’t stable yet, so breaking changes can occur. We expect to make the query optimizer the default early next year.

See GH#10634 from Patrick Hoefler for details.

Dtype inference in `read_parquet`¶

read_parquet will now infer the Arrow types pa.date32(), pa.date64() and pa.decimal() as a ArrowDtype in pandas. These dtypes are backed by the original Arrow array, and thus avoid the conversion to NumPy object. Additionally, read_parquet will no longer infer nested and binary types as strings, they will be stored in NumPy object arrays.

See GH#10698 and GH#10705 from Patrick Hoefler for details.

Scheduling improvements to reduce memory usage¶

This release includes a major rewrite to a core part of our scheduling logic. It includes a new approach to the topological sorting algorithm in dask.order which determines the order in which tasks are run. Improper ordering is known to be a major contributor to too large cluster memory pressure.

Updates in this release fix a couple of performance regressions that were introduced in the release 2023.10.0 (see GH#10535). Generally, computations should now be much more eager to release data if it is no longer required in memory.

See GH#10660, GH#10697 from Florian Jetter for details.

Improved P2P-based merging robustness and performance¶

This release contains several updates that fix a possible deadlock introduced in 2023.9.2 and improve the robustness of P2P-based merging when the cluster is dynamically scaling up.

See GH#8415, GH#8416, and GH#8414 from Hendrik Makait for details.

Removed disabling pickle option¶

The distributed.scheduler.pickle configuration option is no longer supported. As of the 2023.4.0 release, pickle is used to transmit task graphs, so can no longer be disabled. We now raise an informative error when distributed.scheduler.pickle is set to False.

See GH#8401 from Florian Jetter for details.

Additional changes

Add changelog entry for recent P2P merge fixes (GH#10712) Hendrik Makait
Update DataFrame page (GH#10710) Matthew Rocklin
Add changelog entry for dask-expr switch (GH#10704) Patrick Hoefler
Improve changelog entry for PipInstall changes (GH#10711) Hendrik Makait
Remove PR labeler (GH#10709) James Bourbeau
Add .__wrapped__ to Delayed object (GH#10695) Andrew S. Rosen
Bump actions/labeler from 4.3.0 to 5.0.0 (GH#10689)
Bump actions/stale from 8 to 9 (GH#10690)
[Dask.order] Remove non-runnable leaf nodes from ordering (GH#10697) Florian Jetter
Update installation docs (GH#10699) Matthew Rocklin
Fix software environment link in docs (GH#10700) James Bourbeau
Avoid converting non-strings to arrow strings for read_parquet (GH#10692) Patrick Hoefler
Bump xarray-contrib/issue-from-pytest-log from 1.2.7 to 1.2.8 (GH#10687)
Fix tokenize for pd.DateOffset (GH#10664) jochenott
Bugfix for writing empty array to zarr (GH#10506) Ben
Docs update, fixup styling, mention free (GH#10679) Matthew Rocklin
Update deployment docs (GH#10680) Matthew Rocklin
Dask.order rewrite using a critical path approach (GH#10660) Florian Jetter
Avoid substituting keys that occur multiple times (GH#10646) Florian Jetter
Add missing image to docs (GH#10694) Matthew Rocklin
Bump actions/setup-python from 4 to 5 (GH#10688)
Update landing page (GH#10674) Matthew Rocklin
Make meta check simpler in dispatch (GH#10638) Patrick Hoefler
Pin PR Labeler (GH#10675) Matthew Rocklin
Reorganize docs index a bit (GH#10669) Matthew Rocklin
Bump actions/setup-java from 3 to 4 (GH#10667)
Bump conda-incubator/setup-miniconda from 2.2.0 to 3.0.1 (GH#10668)
Bump xarray-contrib/issue-from-pytest-log from 1.2.6 to 1.2.7 (GH#10666)
Fix test_categorize_info with nightly pyarrow (GH#10662) James Bourbeau
Rewrite test_subprocess_cluster_does_not_depend_on_logging (GH#8409) Hendrik Makait
Avoid RecursionError when failing to pickle key in SpillBuffer and using tblib=3 (GH#8404) Hendrik Makait
Allow tasks to override is_rootish heuristic (GH#8412) Hendrik Makait
Remove GPU executor (GH#8399) Hendrik Makait
Do not rely on logging for subprocess cluster (GH#8398) Hendrik Makait
Update gpuCI RAPIDS_VER to 24.02 (GH#8384)
Bump actions/setup-python from 4 to 5 (GH#8396)
Ensure output chunks in P2P rechunking are distributed homogeneously (GH#8207) Florian Jetter
Trivial: fix typo (GH#8395) crusaderky
Bump JamesIves/github-pages-deploy-action from 4.4.3 to 4.5.0 (GH#8387)
Bump conda-incubator/setup-miniconda from 3.0.0 to 3.0.1 (GH#8388)

2023.12.0¶

Released on December 1, 2023

Highlights¶

PipInstall restart and environment variables¶

The distributed.PipInstall plugin now has more robust restart logic and also supports environment variables.

Below shows how users can use the distributed.PipInstall plugin and a TOKEN environment variable to securely install a package from a private repository:

from dask.distributed import PipInstall
plugin = PipInstall(packages=["private_package@git+https://${TOKEN}@github.com/dask/private_package.git])
client.register_plugin(plugin)

See GH#8374, GH#8357, and GH#8343 from Hendrik Makait for details.

Bokeh 3.3.0 compatibility¶

This release contains compatibility updates for using bokeh>=3.3.0 with proxied Dask dashboards. Previously the contents of dashboard plots wouldn’t be displayed.

See GH#8347 and GH#8381 from Jacob Tomlinson for details.

2023.11.0¶

Released on November 10, 2023

Highlights¶

Zero-copy P2P Array Rechunking¶

Users should see significant performance improvements when using in-memory P2P array rechunking. This is due to no longer copying underlying data buffers.

Below shows a simple example where we compare performance of different rechunking methods.

shape = (30_000, 6_000, 150) # 201.17 GiB
input_chunks = (60, -1, -1) # 411.99 MiB
output_chunks = (-1, 6, -1) # 205.99 MiB

arr = da.random.random(size, chunks=input_chunks)
with dask.config.set({
    "array.rechunk.method": "p2p",
    "distributed.p2p.disk": True,
}):
    (
      da.random.random(size, chunks=input_chunks)
      .rechunk(output_chunks)
      .sum()
      .compute()
    )

A comparison of rechunking performance between the different methods tasks, p2p with disk and p2p without disk on different cluster sizes. The graph shows that p2p without disk is up to 60% faster than the default tasks based approach.

See GH#8282, GH#8318, GH#8321 from crusaderky and (GH#8322) from Hendrik Makait for details.

Deprecating PyArrow <14.0.1¶

pyarrow<14.0.1 usage is deprecated starting in this release. It’s recommended for all users to upgrade their version of pyarrow or install pyarrow-hotfix. See this CVE for full details.

See GH#10622 from Florian Jetter for details.

Improved PyArrow filesystem for Parquet¶

Using filesystem="arrow" when reading Parquet datasets now properly inferrs the correct cloud region when accessing remote, cloud-hosted data.

See GH#10590 from Richard (Rick) Zamora for details.

Improve Type Reconciliation in P2P Shuffling¶

See GH#8332 from Hendrik Makait for details.

2023.10.1¶

Released on October 27, 2023

Highlights¶

Python 3.12¶

This release adds official support for Python 3.12.

See GH#10544 and GH#8223 from Thomas Grainger for details.

2023.10.0¶

Released on October 13, 2023

Highlights¶

Reduced memory pressure for multi array reductions¶

This release contains major updates to Dask’s task graph scheduling logic. The updates here significantly reduce memory pressure on array reductions. We anticipate this will have a strong impact on the array computing community.

See GH#10535 from Florian Jetter for details.

Improved P2P shuffling robustness¶

There are several updates (listed below) that make P2P shuffling much more robust and less likely to fail.

See GH#8262, GH#8264, GH#8242, GH#8244, and GH#8235 from Hendrik Makait and GH#8124 from Charles Blackmon-Luca for details.

Reduced scheduler CPU load for large graphs¶

Users should see reduced CPU load on their scheduler when computing large task graphs.

See GH#8238 and GH#10547 from Florian Jetter and GH#8240 from crusaderky for details.

2023.9.3¶

Released on September 29, 2023

Highlights¶

Restore previous configuration override behavior¶

The 2023.9.2 release introduced an unintentional breaking change in how configuration options are overriden in dask.config.get with the override_with= keyword (see GH#10519). This release restores the previous behavior.

See GH#10521 from crusaderky for details.

Complex dtypes in Dask Array reductions¶

This release includes improved support for using common reductions in Dask Array (e.g. var, std, moment) with complex dtypes.

See GH#10009 from wkrasnicki for details.

2023.9.2¶

Released on September 15, 2023

Highlights¶

P2P shuffling now raises when outdated PyArrow is installed¶

Previously the default shuffling method would silently fallback from P2P to task-based shuffling if an older version of pyarrow was installed. Now we raise an informative error with the minimum required pyarrow version for P2P instead of silently falling back.

See GH#10496 from Hendrik Makait for details.

Deprecation cycle for admin.traceback.shorten¶

The 2023.9.0 release modified the admin.traceback.shorten configuration option without introducing a deprecation cycle. This resulted in failures to create Dask clusters in some cases. This release introduces a deprecation cycle for this configuration change.

See GH#10509 from crusaderky for details.

2023.9.1¶

Released on September 6, 2023

Note

This is a hotfix release that fixes a P2P shuffling bug introduced in the 2023.9.0 release (see GH#10493).

Enhancements¶

Stricter data type for dask keys (GH#10485) crusaderky
Special handling for None in DASK_ environment variables (GH#10487) crusaderky

Bug Fixes¶

Fix _partitions dtype in meta for DataFrame.set_index and DataFrame.sort_values (GH#10493) Hendrik Makait
Handle cached_property decorators in derived_from (GH#10490) Lawrence Mitchell

Maintenance¶

Bump actions/checkout from 3.6.0 to 4.0.0 (GH#10492)
Simplify some tests that import distributed (GH#10484) crusaderky

2023.9.0¶

Released on September 1, 2023

Bug Fixes¶

Remove support for np.int64 in keys (GH#10483) crusaderky
Fix _partitions dtype in meta for shuffling (GH#10462) Hendrik Makait
Don’t use exception hooks to shorten tracebacks (GH#10456) crusaderky

Documentation¶

Add p2p shuffle option to DataFrame docs (GH#10477) Patrick Hoefler

Maintenance¶

Skip failing tests for pandas=2.1.0 (GH#10488) Patrick Hoefler
Update tests for pandas=2.1.0 (GH#10439) Patrick Hoefler
Enable pytest-timeout (GH#10482) crusaderky
Bump actions/checkout from 3.5.3 to 3.6.0 (GH#10470)

2023.8.1¶

Released on August 18, 2023

Enhancements¶

Adding support for cgroup v2 to cpu_count (GH#10419) Johan Olsson
Support multi-column groupby with sort=True and split_out>1 (GH#10425) Richard (Rick) Zamora
Add DataFrame.enforce_runtime_divisions method (GH#10404) Richard (Rick) Zamora
Enable file mode="x" with a single_file=True for Dask DataFrame to_csv (GH#10443) Genevieve Buckley

Bug Fixes¶

Fix ValueError when running to_csv in append mode with single_file as True (GH#10441) Ben

Maintenance¶

Add default types_mapper to from_pyarrow_table_dispatch for pandas (GH#10446) Richard (Rick) Zamora

2023.8.0¶

Released on August 4, 2023

Enhancements¶

Fix for make_timeseries performance regression (GH#10428) Irina Truong

Documentation¶

Add distributed.print to debugging docs (GH#10435) James Bourbeau
Documenting compatibility of NumPy functions with Dask functions (GH#9941) Chiara Marmo

Maintenance¶

Use SPDX in license metadata (GH#10437) John A Kirkham
Require dask[array] in dask[dataframe] (GH#10357) John A Kirkham
Update gpuCI RAPIDS_VER to 23.10 (GH#10427)
Simplify compatibility code (GH#10426) Hendrik Makait
Fix compatibility variable naming (GH#10424) Hendrik Makait
Fix a few errors with upstream pandas and pyarrow (GH#10412) Irina Truong

2023.7.1¶

Released on July 20, 2023

Note

This release updates Dask DataFrame to automatically convert text data using object data types to string[pyarrow] if pandas>=2 and pyarrow>=12 are installed.

This should result in significantly reduced memory consumption and increased computation performance in many workflows that deal with text data.

You can disable this change by setting the dataframe.convert-string configuration value to False with

dask.config.set({"dataframe.convert-string": False})

Enhancements¶

Convert to pyarrow strings if proper dependencies are installed (GH#10400) James Bourbeau
Avoid repartition before shuffle for p2p (GH#10421) Patrick Hoefler
API to generate random Dask DataFrames (GH#10392) Irina Truong
Speed up dask.bag.Bag.random_sample (GH#10356) crusaderky
Raise helpful ValueError for invalid time units (GH#10408) Nat Tabris
Make repartition a no-op when divisions match (divisions provided as a list) (GH#10395) Nicolas Grandemange

Bug Fixes¶

Use dataframe.convert-string in read_parquet token (GH#10411) James Bourbeau
Category dtype is lost when concatenating MultiIndex (GH#10407) Irina Truong
Fix FutureWarning: The provided callable... (GH#10405) Irina Truong
Enable non-categorical hive-partition columns in read_parquet (GH#10353) Richard (Rick) Zamora
concat ignoring DataFrame withouth columns (GH#10359) Patrick Hoefler

2023.7.0¶

Released on July 7, 2023

Enhancements¶

Catch exceptions when attempting to load CLI entry points (GH#10380) Jacob Tomlinson

Bug Fixes¶

Fix typo in _clean_ipython_traceback (GH#10385) Alexander Clausen
Ensure that df is immutable after from_pandas (GH#10383) Patrick Hoefler
Warn consistently for inplace in Series.rename (GH#10313) Patrick Hoefler

Documentation¶

Add clarification about output shape and reshaping in rechunk documentation (GH#10377) Swayam Patil

Maintenance¶

Simplify astype implementation (GH#10393) Patrick Hoefler
Fix test_first_and_last to accommodate deprecated last (GH#10373) James Bourbeau
Add level to create_merge_tree (GH#10391) Patrick Hoefler
Do not derive from scipy.stats.chisquare docstring (GH#10382) Doug Davis

2023.6.1¶

Released on June 26, 2023

Enhancements¶

Remove no longer supported clip_lower and clip_upper (GH#10371) Patrick Hoefler
Support DataFrame.set_index(..., sort=False) (GH#10342) Miles
Cleanup remote tracebacks (GH#10354) Irina Truong
Add dispatching mechanisms for pyarrow.Table conversion (GH#10312) Richard (Rick) Zamora
Choose P2P even if fusion is enabled (GH#10344) Hendrik Makait
Validate that rechunking is possible earlier in graph generation (GH#10336) Hendrik Makait

Bug Fixes¶

Fix issue with header passed to read_csv (GH#10355) GALI PREM SAGAR
Respect dropna and observed in GroupBy.var and GroupBy.std (GH#10350) Patrick Hoefler
Fix H5FD_lock error when writing to hdf with distributed client (GH#10309) Irina Truong
Fix for total_mem_usage of bag.map() (GH#10341) Irina Truong

Deprecations¶

Deprecate DataFrame.fillna/Series.fillna with method (GH#10349) Irina Truong
Deprecate DataFrame.first and Series.first (GH#10352) Irina Truong

Maintenance¶

Deprecate numpy.compat (GH#10370) Irina Truong
Fix annotations and spans leaking between threads (GH#10367) Irina Truong
Use general kwargs in pyarrow_table_dispatch functions (GH#10364) Richard (Rick) Zamora
Remove unnecessary try/except in isna (GH#10363) Patrick Hoefler
mypy support for numpy 1.25 (GH#10362) crusaderky
Bump actions/checkout from 3.5.2 to 3.5.3 (GH#10348)
Restore numba in upstream build (GH#10330) James Bourbeau
Update nightly wheel index for pandas/numpy/scipy (GH#10346) Matthew Roeschke
Add rechunk config values to yaml (GH#10343) Hendrik Makait

2023.6.0¶

Released on June 9, 2023

Enhancements¶

Add missing not in predicate support to read_parquet (GH#10320) Richard (Rick) Zamora

Bug Fixes¶

Fix for incorrect value_counts (GH#10323) Irina Truong
Update empty describe top and freq values (GH#10319) James Bourbeau

Documentation¶

Fix hetzner typo (GH#10332) Sarah Charlotte Johnson

Maintenance¶

Test with numba and sparse on Python 3.11 (GH#10329) Thomas Grainger
Remove numpy.find_common_type warning ignore (GH#10311) James Bourbeau
Update gpuCI RAPIDS_VER to 23.08 (GH#10310)

2023.5.1¶

Released on May 26, 2023

Note

This release drops support for Python 3.8. As of this release Dask supports Python 3.9, 3.10, and 3.11. See this community issue for more details.

Enhancements¶

Drop Python 3.8 support (GH#10295) Thomas Grainger
Change Dask Bag partitioning scheme to improve cluster saturation (GH#10294) Jacob Tomlinson
Generalize dd.to_datetime for GPU-backed collections, introduce get_meta_library utility (GH#9881) Charles Blackmon-Luca
Add na_action to DataFrame.map (GH#10305) Patrick Hoefler
Raise TypeError in DataFrame.nsmallest and DataFrame.nlargest when columns is not given (GH#10301) Patrick Hoefler
Improve sizeof for pd.MultiIndex (GH#10230) Patrick Hoefler
Support duplicated columns in a bunch of DataFrame methods (GH#10261) Patrick Hoefler
Add numeric_only support to DataFrame.idxmin and DataFrame.idxmax (GH#10253) Patrick Hoefler
Implement numeric_only support for DataFrame.quantile (GH#10259) Patrick Hoefler
Add support for numeric_only=False in DataFrame.std (GH#10251) Patrick Hoefler
Implement numeric_only=False for GroupBy.cumprod and GroupBy.cumsum (GH#10262) Patrick Hoefler
Implement numeric_only for skew and kurtosis (GH#10258) Patrick Hoefler
mask and where should accept a callable (GH#10289) Irina Truong
Fix conversion from Categorical to pa.dictionary in read_parquet (GH#10285) Patrick Hoefler

Bug Fixes¶

Spurious config on nested annotations (GH#10318) crusaderky
Fix rechunking behavior for dimensions with known and unknown chunk sizes (GH#10157) Hendrik Makait
Enable drop to support mismatched partitions (GH#10300) James Bourbeau
Fix divisions construction for to_timestamp (GH#10304) Patrick Hoefler
pandas ExtensionDtype raising in Series reduction operations (GH#10149) Patrick Hoefler
Fix regression in da.random interface (GH#10247) Eray Aslan
da.coarsen doesn’t trim an empty chunk in meta (GH#10281) Irina Truong
Fix dtype inference for engine="pyarrow" in read_csv (GH#10280) Patrick Hoefler

Documentation¶

Add meta_from_array to API docs (GH#10306) Ruth Comer
Update Coiled links (GH#10296) Sarah Charlotte Johnson
Add docs for demo day (GH#10288) Matthew Rocklin

Maintenance¶

Explicitly install anaconda-client from conda-forge when uploading conda nightlies (GH#10316) Charles Blackmon-Luca
Configure isort to add from __future__ import annotations (GH#10314) Thomas Grainger
Avoid pandas Series.__getitem__ deprecation in tests (GH#10308) James Bourbeau
Ignore numpy.find_common_type warning from pandas (GH#10307) James Bourbeau
Add test to check that DataFrame.__setitem__ does not modify df inplace (GH#10223) Patrick Hoefler
Clean up default value of dropna in value_counts (GH#10299) Patrick Hoefler
Add pytest-cov to test extra (GH#10271) James Bourbeau

2023.5.0¶

Released on May 12, 2023

Enhancements¶

Implement numeric_only=False for GroupBy.corr and GroupBy.cov (GH#10264) Patrick Hoefler
Add support for numeric_only=False in DataFrame.var (GH#10250) Patrick Hoefler
Add numeric_only support to DataFrame.mode (GH#10257) Patrick Hoefler
Add DataFrame.map to dask.DataFrame API (GH#10246) Patrick Hoefler
Adjust for DataFrame.applymap deprecation and all NA concat behaviour change (GH#10245) Patrick Hoefler
Enable numeric_only=False for DataFrame.count (GH#10234) Patrick Hoefler
Disallow array input in mask/where (GH#10163) Irina Truong
Support numeric_only=True in GroupBy.corr and GroupBy.cov (GH#10227) Patrick Hoefler
Add numeric_only support to GroupBy.median (GH#10236) Patrick Hoefler
Support mimesis=9 in dask.datasets (GH#10241) James Bourbeau
Add numeric_only support to min, max and prod (GH#10219) Patrick Hoefler
Add numeric_only=True support for GroupBy.cumsum and GroupBy.cumprod (GH#10224) Patrick Hoefler
Add helper to unpack numeric_only keyword (GH#10228) Patrick Hoefler

Bug Fixes¶

Fix clone + from_array failure (GH#10211) crusaderky
Fix dataframe reductions for ea dtypes (GH#10150) Patrick Hoefler
Avoid scalar conversion deprecation warning in numpy=1.25 (GH#10248) James Bourbeau
Make sure transform output has the same index as input (GH#10184) Irina Truong
Fix corr and cov on a single-row partition (GH#9756) Irina Truong
Fix test_groupby_numeric_only_supported and test_groupby_aggregate_categorical_observed upstream errors (GH#10243) Irina Truong

Documentation¶

Clean up futures docs (GH#10266) Matthew Rocklin
Add Index API reference (GH#10263) hotpotato

Maintenance¶

Warn when meta is passed to apply (GH#10256) Patrick Hoefler
Remove imageio version restriction in CI (GH#10260) Patrick Hoefler
Remove unused DataFrame variance methods (GH#10252) Patrick Hoefler
Un-xfail test_categories with pyarrow strings and pyarrow>=12 (GH#10244) Irina Truong
Bump gpuCI PYTHON_VER 3.8->3.9 (GH#10233) Charles Blackmon-Luca

2023.4.1¶

Released on April 28, 2023

Enhancements¶

Implement numeric_only support for DataFrame.sum (GH#10194) Patrick Hoefler
Add support for numeric_only=True in GroupBy operations (GH#10222) Patrick Hoefler
Avoid deep copy in DataFrame.__setitem__ for pandas 1.4 and up (GH#10221) Patrick Hoefler
Avoid calling Series.apply with _meta_nonempty (GH#10212) Patrick Hoefler
Unpin sqlalchemy and fix compatibility issues (GH#10140) Patrick Hoefler

Bug Fixes¶

Partially revert default client discovery (GH#10225) Florian Jetter
Support arrow dtypes in Index meta creation (GH#10170) Patrick Hoefler
Repartitioning raises with extension dtype when truncating floats (GH#10169) Patrick Hoefler
Adjust empty Index from fastparquet to object dtype (GH#10179) Patrick Hoefler

Documentation¶

Update Kubernetes docs (GH#10232) Jacob Tomlinson
Add DataFrame.reduction to API docs (GH#10229) James Bourbeau
Add DataFrame.persist to docs and fix links (GH#10231) Patrick Hoefler
Add documentation for GroupBy.transform (GH#10185) Irina Truong
Fix formatting in random number generation docs (GH#10189) Eray Aslan

Maintenance¶

Pin imageio to <2.28 (GH#10216) Patrick Hoefler
Add note about importlib_metadata backport (GH#10207) James Bourbeau
Add xarray back to Python 3.11 CI builds (GH#10200) James Bourbeau
Add mindeps build with all optional dependencies (GH#10161) Charles Blackmon-Luca
Provide proper like value for array_safe in percentiles_summary (GH#10156) Charles Blackmon-Luca
Avoid re-opening hdf file multiple times in read_hdf (GH#10205) Thomas Grainger
Add merge tests on nullable columns (GH#10071) Charles Blackmon-Luca
Fix coverage configuration (GH#10203) Thomas Grainger
Remove is_period_dtype and is_sparse_dtype (GH#10197) Patrick Hoefler
Bump actions/checkout from 3.5.0 to 3.5.2 (GH#10201)
Avoid deprecated is_categorical_dtype from pandas (GH#10180) Patrick Hoefler
Adjust for deprecated is_interval_dtype and is_datetime64tz_dtype (GH#10188) Patrick Hoefler

2023.4.0¶

Released on April 14, 2023

Enhancements¶

Override old default values in update_defaults (GH#10159) Gabe Joseph
Add a CLI command to list and get a value from dask config (GH#9936) Irina Truong
Handle string-based engine argument to read_json (GH#9947) Richard (Rick) Zamora
Avoid deprecated GroupBy.dtypes (GH#10111) Irina Truong

Bug Fixes¶

Revert grouper-related changes (GH#10182) Irina Truong
GroupBy.cov raising for non-numeric grouping column (GH#10171) Patrick Hoefler
Updates for Index supporting numpy numeric dtypes (GH#10154) Irina Truong
Preserve dtype for partitioning columns when read with pyarrow (GH#10115) Patrick Hoefler
Fix annotations for to_hdf (GH#10123) Hendrik Makait
Handle None column name when checking if columns are all numeric (GH#10128) Lawrence Mitchell
Fix valid_divisions when passed a tuple (GH#10126) Brian Phillips
Maintain annotations in DataFrame.categorize (GH#10120) Hendrik Makait
Fix handling of missing min/max parquet statistics during filtering (GH#10042) Richard (Rick) Zamora

Deprecations¶

Deprecate use_nullable_dtypes= and add dtype_backend= (GH#10076) Irina Truong
Deprecate convert_dtype in Series.apply (GH#10133) Irina Truong

Documentation¶

Document Generator based random number generation (GH#10134) Eray Aslan

Maintenance¶

Update dataframe.convert_string to dataframe.convert-string (GH#10191) Irina Truong
Add python-cityhash to CI environments (GH#10190) Charles Blackmon-Luca
Temporarily pin scikit-image to fix Windows CI (GH#10186) Patrick Hoefler
Handle pandas deprecation warnings for to_pydatetime and apply (GH#10168) Patrick Hoefler
Drop bokeh<3 restriction (GH#10177) James Bourbeau
Fix failing tests under copy-on-write (GH#10173) Patrick Hoefler
Allow pyarrow CI to fail (GH#10176) James Bourbeau
Switch to Generator for random number generation in dask.array (GH#10003) Eray Aslan
Bump peter-evans/create-pull-request from 4 to 5 (GH#10166)
Fix flaky modf operation in test_arithmetic (GH#10162) Irina Truong
Temporarily remove xarray from CI with pandas 2.0 (GH#10153) James Bourbeau
Fix update_graph counting logic in test_default_scheduler_on_worker (GH#10145) James Bourbeau
Fix documentation build with pandas 2.0 (GH#10138) James Bourbeau
Remove dask/gpu from gpuCI update reviewers (GH#10135) Charles Blackmon-Luca
Update gpuCI RAPIDS_VER to 23.06 (GH#10129)
Bump actions/stale from 6 to 8 (GH#10121)
Use declarative setuptools (GH#10102) Thomas Grainger
Relax assert_eq checks on Scalar-like objects (GH#10125) Matthew Rocklin
Upgrade readthedocs config to ubuntu 22.04 and Python 3.11 (GH#10124) Thomas Grainger
Bump actions/checkout from 3.4.0 to 3.5.0 (GH#10122)
Fix test_null_partition_pyarrow in pyarrow CI build (GH#10116) Irina Truong
Drop distributed pack (GH#9988) Florian Jetter
Make dask.compatibility private (GH#10114) Jacob Tomlinson

2023.3.2¶

Released on March 24, 2023

Enhancements¶

Deprecate observed=False for groupby with categoricals (GH#10095) Irina Truong
Deprecate axis= for some groupby operations (GH#10094) James Bourbeau
The axis keyword in DataFrame.rolling/Series.rolling is deprecated (GH#10110) Irina Truong
DataFrame._data deprecation in pandas (GH#10081) Irina Truong
Use importlib_metadata backport to avoid CLI UserWarning (GH#10070) Thomas Grainger
Port option parsing logic from dask.dataframe.read_parquet to to_parquet (GH#9981) Anton Loukianov

Bug Fixes¶

Avoid using dd.shuffle in groupby-apply (GH#10043) Richard (Rick) Zamora
Enable null hive partitions with pyarrow parquet engine (GH#10007) Richard (Rick) Zamora
Support unknown shapes in *_like functions (GH#10064) Doug Davis

Documentation¶

Add to_backend methods to API docs (GH#10093) Lawrence Mitchell
Remove broken gpuCI link in developer docs (GH#10065) Charles Blackmon-Luca

Maintenance¶

Configure readthedocs sphinx warnings as errors (GH#10104) Thomas Grainger
Un-xfail test_division_or_partition with pyarrow strings active (GH#10108) Irina Truong
Un-xfail test_different_columns_are_allowed with pyarrow strings active (GH#10109) Irina Truong
Restore Entrypoints compatibility (GH#10113) Jacob Tomlinson
Un-xfail test_to_dataframe_optimize_graph with pyarrow strings active (GH#10087) Irina Truong
Only run test_development_guidelines_matches_ci on editable install (GH#10106) Charles Blackmon-Luca
Un-xfail test_dataframe_cull_key_dependencies_materialized with pyarrow strings active (GH#10088) Irina Truong
Install mimesis in CI environments (GH#10105) Charles Blackmon-Luca
Fix for no module named ipykernel (GH#10101) Irina Truong
Fix docs builds by installing ipykernel (GH#10103) Thomas Grainger
Allow pyarrow build to continue on failures (GH#10097) James Bourbeau
Bump actions/checkout from 3.3.0 to 3.4.0 (GH#10096)
Fix test_set_index_on_empty with pyarrow strings active (GH#10054) Irina Truong
Un-xfail pyarrow pickling tests (GH#10082) James Bourbeau
CI environment file cleanup (GH#10078) James Bourbeau
Un-xfail more pyarrow tests (GH#10066) Irina Truong
Temporarily skip pyarrow_compat tests with p`andas 2.0 (GH#10063) James Bourbeau
Fix test_melt with pyarrow strings active (GH#10052) Irina Truong
Fix test_str_accessor with pyarrow strings active (GH#10048) James Bourbeau
Fix test_better_errors_object_reductions with pyarrow strings active (GH#10051) James Bourbeau
Fix test_loc_with_non_boolean_series with pyarrow strings active (GH#10046) James Bourbeau
Fix test_values with pyarrow strings active (GH#10050) James Bourbeau
Temporarily xfail test_upstream_packages_installed (GH#10047) James Bourbeau

2023.3.1¶

Released on March 10, 2023

Enhancements¶

Support pyarrow strings in MultiIndex (GH#10040) Irina Truong
Improved support for pyarrow strings (GH#10000) Irina Truong
Fix flaky RuntimeWarning during array reductions (GH#10030) James Bourbeau
Extend complete extras (GH#10023) James Bourbeau
Raise an error with dataframe.convert-string=True and pandas<2.0 (GH#10033) Irina Truong
Rename shuffle/rechunk config option/kwarg to method (GH#10013) James Bourbeau
Add initial support for converting pandas extension dtypes to arrays (GH#10018) James Bourbeau
Remove randomgen support (GH#9987) Eray Aslan

Bug Fixes¶

Skip rechunk when rechunking to the same chunks with unknown sizes (GH#10027) Hendrik Makait
Custom utility to convert parquet filters to pyarrow expression (GH#9885) Richard (Rick) Zamora
Consider numpy scalars and 0d arrays as scalars when padding (GH#9653) Justus Magin
Fix parquet overwrite behavior after an adaptive read_parquet operation (GH#10002) Richard (Rick) Zamora

Documentation¶

Add and update docs for Data Transfer section (GH#10022) Miles

Maintenance¶

Remove stale hive-partitioning code from pyarrow parquet engine (GH#10039) Richard (Rick) Zamora
Increase minimum supported pyarrow to 7.0 (GH#10024) James Bourbeau
Revert “Prepare drop packunpack (GH#9994) (GH#10037) Florian Jetter
Have codecov wait for more builds before reporting (GH#10031) James Bourbeau
Prepare drop packunpack (GH#9994) Florian Jetter
Add CI job with pyarrow strings turned on (GH#10017) James Bourbeau
Fix test_groupby_dropna_with_agg for pandas 2.0 (GH#10001) Irina Truong
Fix test_pickle_roundtrip for pandas 2.0 (GH#10011) James Bourbeau

2023.3.0¶

Released on March 1, 2023

Bug Fixes¶

Bag must not pick p2p as shuffle default (GH#10005) Florian Jetter

Documentation¶

Minor follow-up to P2P by default (GH#10008) James Bourbeau

Maintenance¶

Add minimum version to optional jinja2 dependency (GH#9999) Charles Blackmon-Luca

2023.2.1¶

Released on February 24, 2023

Note

This release changes the default DataFrame shuffle algorithm to p2p to improve stability and performance. Learn more here and please provide any feedback on this discussion.

If you encounter issues with this new algorithm, please see the documentation for more information, and how to switch back to the old mode.

Enhancements¶

Enable P2P shuffling by default (GH#9991) Florian Jetter
P2P rechunking (GH#9939) Hendrik Makait
Efficient dataframe.convert-string support for read_parquet (GH#9979) Irina Truong
Allow p2p shuffle kwarg for DataFrame merges (GH#9900) Florian Jetter
Change split_row_groups default to “infer” (GH#9637) Richard (Rick) Zamora
Add option for converting string data to use pyarrow strings (GH#9926) James Bourbeau
Add support for multi-column sort_values (GH#8263) Charles Blackmon-Luca
Generator based random-number generation in``dask.array`` (GH#9038) Eray Aslan
Support numeric_only for simple groupby aggregations for pandas 2.0 compatibility (GH#9889) Irina Truong

Bug Fixes¶

Fix profilers plot not being aligned to context manager enter time (GH#9739) David Hoese
Relax dask.dataframe assert_eq type checks (GH#9989) Matthew Rocklin
Restore describe compatibility for pandas 2.0 (GH#9982) James Bourbeau

Documentation¶

Improving deploying Dask docs (GH#9912) Sarah Charlotte Johnson
More docs for DataFrame.partitions (GH#9976) Tom Augspurger
Update docs with more information on default Delayed scheduler (GH#9903) Guillaume Eynard-Bontemps
Deployment Considerations documentation (GH#9933) Gabe Joseph

Maintenance¶

Temporarily rerun flaky tests (GH#9983) James Bourbeau
Update parsing of FULL_RAPIDS_VER/FULL_UCX_PY_VER (GH#9990) Charles Blackmon-Luca
Increase minimum supported versions to pandas=1.3 and numpy=1.21 (GH#9950) James Bourbeau
Fix std to work with numeric_only for pandas 2.0 (GH#9960) Irina Truong
Temporarily xfail test_roundtrip_partitioned_pyarrow_dataset (GH#9977) James Bourbeau
Fix copy on write failure in test_idxmaxmin (GH#9944) Patrick Hoefler
Bump pre-commit versions (GH#9955) crusaderky
Fix test_groupby_unaligned_index for pandas 2.0 (GH#9963) Irina Truong
Un-xfail test_set_index_overlap_2 for pandas 2.0 (GH#9959) James Bourbeau
Fix test_merge_by_index_patterns for pandas 2.0 (GH#9930) Irina Truong
Bump jacobtomlinson/gha-find-replace from 2 to 3 (GH#9953) James Bourbeau
Fix test_rolling_agg_aggregate for pandas 2.0 compatibility (GH#9948) Irina Truong
Bump black to 23.1.0 (GH#9956) crusaderky
Run GPU tests on python 3.8 & 3.10 (GH#9940) Charles Blackmon-Luca
Fix test_to_timestamp for pandas 2.0 (GH#9932) Irina Truong
Fix an error with groupby value_counts for pandas 2.0 compatibility (GH#9928) Irina Truong
Config converter: replace all dashes with underscores (GH#9945) Jacob Tomlinson
CI: use nightly wheel to install pyarrow in upstream test build (GH#9873) Joris Van den Bossche

2023.2.0¶

Released on February 10, 2023

Enhancements¶

Update numeric_only default in quantile for pandas 2.0 (GH#9854) Irina Truong
Make repartition a no-op when divisions match (GH#9924) James Bourbeau
Update datetime_is_numeric behavior in describe for pandas 2.0 (GH#9868) Irina Truong
Update value_counts to return correct name in pandas 2.0 (GH#9919) Irina Truong
Support new axis=None behavior in pandas 2.0 for certain reductions (GH#9867) James Bourbeau
Filter out all-nan RuntimeWarning at the chunk level for nanmin and nanmax (GH#9916) Julia Signell
Fix numeric meta_nonempty index creation for pandas 2.0 (GH#9908) James Bourbeau
Fix DataFrame.info() tests for pandas 2.0 (GH#9909) James Bourbeau

Bug Fixes¶

Fix GroupBy.value_counts handling for multiple groupby columns (GH#9905) Charles Blackmon-Luca

Documentation¶

Fix some outdated information/typos in development guide (GH#9893) Patrick Hoefler
Add note about keep=False in drop_duplicates docstring (GH#9887) Jayesh Manani
Add meta details to dask Array (GH#9886) Jayesh Manani
Clarify task stream showing more rows than threads (GH#9906) Gabe Joseph

Maintenance¶

Fix test_numeric_column_names for pandas 2.0 (GH#9937) Irina Truong
Fix dask/dataframe/tests/test_utils_dataframe.py tests for pandas 2.0 (GH#9788) James Bourbeau
Replace index.is_numeric with is_any_real_numeric_dtype for pandas 2.0 compatibility (GH#9918) Irina Truong
Avoid pd.core import in dask utils (GH#9907) Matthew Roeschke
Use label for upstream build on pull requests (GH#9910) James Bourbeau
Broaden exception catching for sqlalchemy.exc.RemovedIn20Warning (GH#9904) James Bourbeau
Temporarily restrict sqlalchemy < 2 in CI (GH#9897) James Bourbeau
Update isort version to 5.12.0 (GH#9895) Lawrence Mitchell
Remove unused skiprows variable in read_csv (GH#9892) Patrick Hoefler

2023.1.1¶

Released on January 27, 2023

Enhancements¶

Add to_backend method to Array and _Frame (GH#9758) Richard (Rick) Zamora
Small fix for timestamp index divisions in pandas 2.0 (GH#9872) Irina Truong
Add numeric_only to DataFrame.cov and DataFrame.corr (GH#9787) James Bourbeau
Fixes related to group_keys default change in pandas 2.0 (GH#9855) Irina Truong
infer_datetime_format compatibility for pandas 2.0 (GH#9783) James Bourbeau

Bug Fixes¶

Fix serialization bug in BroadcastJoinLayer (GH#9871) Richard (Rick) Zamora
Satisfy broadcast argument in DataFrame.merge (GH#9852) Richard (Rick) Zamora
Fix pyarrow parquet columns statistics computation (GH#9772) aywandji

Documentation¶

Fix “duplicate explicit target name” docs warning (GH#9863) Chiara Marmo
Fix code formatting issue in “Defining a new collection backend” docs (GH#9864) Chiara Marmo
Update dashboard documentation for memory plot (GH#9768) Jayesh Manani
Add docs section about no-worker tasks (GH#9839) Florian Jetter

Maintenance¶

Additional updates for detecting a distributed scheduler (GH#9890) James Bourbeau
Update gpuCI RAPIDS_VER to 23.04 (GH#9876)
Reverse precedence between collection and distributed default (GH#9869) Florian Jetter
Update xarray-contrib/issue-from-pytest-log to version 1.2.6 (GH#9865) James Bourbeau
Dont require dask config shuffle default (GH#9826) Florian Jetter
Un-xfail datetime64 Parquet roundtripping tests for new fastparquet (GH#9811) James Bourbeau
Add option to manually run upstream CI build (GH#9853) James Bourbeau
Use custom timeout in CI builds (GH#9844) James Bourbeau
Remove kwargs from make_blockwise_graph (GH#9838) Florian Jetter
Ignore warnings on persist call in test_setitem_extended_API_2d_mask (GH#9843) Charles Blackmon-Luca
Fix running S3 tests locally (GH#9833) James Bourbeau

2023.1.0¶

Released on January 13, 2023

Enhancements¶

Use distributed default clients even if no config is set (GH#9808) Florian Jetter
Implement ma.where and ma.nonzero (GH#9760) Erik Holmgren
Update zarr store creation functions (GH#9790) Ryan Abernathey
iteritems compatibility for pandas 2.0 (GH#9785) James Bourbeau
Accurate sizeof for pandas string[python] dtype (GH#9781) crusaderky
Deflate sizeof() of duplicate references to pandas object types (GH#9776) crusaderky
GroupBy.__getitem__ compatibility for pandas 2.0 (GH#9779) James Bourbeau
append compatibility for pandas 2.0 (GH#9750) James Bourbeau
get_dummies compatibility for pandas 2.0 (GH#9752) James Bourbeau
is_monotonic compatibility for pandas 2.0 (GH#9751) James Bourbeau
numpy=1.24 compatability (GH#9777) James Bourbeau

Documentation¶

Remove duplicated encoding kwarg in docstring for to_json (GH#9796) Sultan Orazbayev
Mention SubprocessCluster in LocalCluster documentation (GH#9784) Hendrik Makait
Move Prometheus docs to dask/distributed (GH#9761) crusaderky

Maintenance¶

Temporarily ignore RuntimeWarning in test_setitem_extended_API_2d_mask (GH#9828) James Bourbeau
Fix flaky test_threaded.py::test_interrupt (GH#9827) Hendrik Makait
Update xarray-contrib/issue-from-pytest-log in upstream report (GH#9822) James Bourbeau
pip install dask on gpuCI builds (GH#9816) Charles Blackmon-Luca
Bump actions/checkout from 3.2.0 to 3.3.0 (GH#9815)
Resolve sqlalchemy import failures in mindeps testing (GH#9809) Charles Blackmon-Luca
Ignore sqlalchemy.exc.RemovedIn20Warning (GH#9801) Thomas Grainger
xfail datetime64 Parquet roundtripping tests for pandas 2.0 (GH#9786) James Bourbeau
Remove sqlachemy 1.3 compatibility (GH#9695) McToel
Reduce size of expected DoK sparse matrix (GH#9775) Elliott Sales de Andrade
Remove executable flag from dask/dataframe/io/orc/utils.py (GH#9774) Elliott Sales de Andrade

2022.12.1¶

Released on December 16, 2022

Enhancements¶

Support dtype_backend="pandas|pyarrow" configuration (GH#9719) James Bourbeau
Support cupy.ndarray to cudf.DataFrame dispatching in dask.dataframe (GH#9579) Richard (Rick) Zamora
Make filesystem-backend configurable in read_parquet (GH#9699) Richard (Rick) Zamora
Serialize all pyarrow extension arrays efficiently (GH#9740) James Bourbeau

Bug Fixes¶

Fix bug when repartitioning with tz-aware datetime index (GH#9741) James Bourbeau
Partial functions in aggs may have arguments (GH#9724) Irina Truong
Add support for simple operation with pyarrow-backed extension dtypes (GH#9717) James Bourbeau
Rename columns correctly in case of SeriesGroupby (GH#9716) Lawrence Mitchell

Documentation¶

Fix url link typo in collection backend doc (GH#9748) Shawn
Update Prometheus docs (GH#9696) Hendrik Makait

Maintenance¶

Add zarr to Python 3.11 CI environment (GH#9771) James Bourbeau
Add support for Python 3.11 (GH#9708) Thomas Grainger
Bump actions/checkout from 3.1.0 to 3.2.0 (GH#9753)
Avoid np.bool8 deprecation warning (GH#9737) James Bourbeau
Make sure dev packages aren’t overwritten in upstream CI build (GH#9731) James Bourbeau
Avoid adding data.h5 and mydask.html files during tests (GH#9726) Thomas Grainger

2022.12.0¶

Released on December 2, 2022

Enhancements¶

Remove statistics-based set_index logic from read_parquet (GH#9661) Richard (Rick) Zamora
Add support for use_nullable_dtypes to dd.read_parquet (GH#9617) Ian Rose
Fix map_overlap in order to accept pandas arguments (GH#9571) Fabien Aulaire
Fix pandas 1.5+ FutureWarning in .str.split(..., expand=True) (GH#9704) Jacob Hayes
Enable column projection for groupby slicing (GH#9667) Richard (Rick) Zamora
Support duplicate column cum-functions (GH#9685) Ben
Improve error message for failed backend dispatch call (GH#9677) Richard (Rick) Zamora

Bug Fixes¶

Revise meta creation in arrow parquet engine (GH#9672) Richard (Rick) Zamora
Fix da.fft.fft for array-like inputs (GH#9688) James Bourbeau
Fix groupby -aggregation when grouping on an index by name (GH#9646) Richard (Rick) Zamora

Maintenance¶

Avoid PytestReturnNotNoneWarning in test_inheriting_class (GH#9707) Thomas Grainger
Fix flaky test_dataframe_aggregations_multilevel (GH#9701) Richard (Rick) Zamora
Bump mypy version (GH#9697) crusaderky
Disable dashboard in test_map_partitions_df_input (GH#9687) James Bourbeau
Use latest xarray-contrib/issue-from-pytest-log in upstream build (GH#9682) James Bourbeau
xfail ttest_1samp for upstream scipy (GH#9670) James Bourbeau
Update gpuCI RAPIDS_VER to 23.02 (GH#9678)

2022.11.1¶

Released on November 18, 2022

Enhancements¶

Restrict bokeh=3 support (GH#9673) Gabe Joseph
Updates for fastparquet evolution (GH#9650) Martin Durant

Maintenance¶

Update ga-yaml-parser step in gpuCI updating workflow (GH#9675) Charles Blackmon-Luca
Revert importlib.metadata workaround (GH#9658) James Bourbeau
Fix mindeps-distributed CI build to handle numpy/pandas not being installed (GH#9668) James Bourbeau

2022.11.0¶

Released on November 15, 2022

Enhancements¶

Generalize from_dict implementation to allow usage from other backends (GH#9628) GALI PREM SAGAR

Bug Fixes¶

Avoid pandas constructors in dask.dataframe.core (GH#9570) Richard (Rick) Zamora
Fix sort_values with Timestamp data (GH#9642) James Bourbeau
Generalize array checking and remove pd.Index call in _get_partitions (GH#9634) Benjamin Zaitlen
Fix read_csv behavior for header=0 and names (GH#9614) Richard (Rick) Zamora

Documentation¶

Update dashboard docs for queuing (GH#9660) Gabe Joseph
Remove import dask as d from docstrings (GH#9644) Matthew Rocklin
Fix link to partitions docs in read_parquet docstring (GH#9636) qheuristics
Add API doc links to array/bag/dataframe sections (GH#9630) Matthew Rocklin

Maintenance¶

Use conda-incubator/setup-miniconda@v2.2.0 (GH#9662) John A Kirkham
Allow bokeh=3 (GH#9659) James Bourbeau
Run upstream build with Python 3.10 (GH#9655) James Bourbeau
Pin pyyaml version in mindeps testing (GH#9640) Charles Blackmon-Luca
Add pre-commit to catch breakpoint() (GH#9638) James Bourbeau
Bump xarray-contrib/issue-from-pytest-log from 1.1 to 1.2 (GH#9635)
Remove blosc references (GH#9625) Naty Clementi
Upgrade mypy and drop unused comments (GH#9616) Hendrik Makait
Harden test_repartition_npartitions (GH#9585) Richard (Rick) Zamora

2022.10.2¶

Released on October 31, 2022

This was a hotfix and has no changes in this repository. The necessary fix was in dask/distributed, but we decided to bump this version number for consistency.

2022.10.1¶

Released on October 28, 2022

Enhancements¶

Enable named aggregation syntax (GH#9563) ChrisJar
Add extension dtype support to set_index (GH#9566) James Bourbeau
Redesigning the array HTML repr for clarity (GH#9519) Shingo OKAWA

Bug Fixes¶

Fix merge with emtpy left DataFrame (GH#9578) Ian Rose

Documentation¶

Add note about limiting thread oversubscription by default (GH#9592) James Bourbeau
Use sphinx-click for dask CLI (GH#9589) James Bourbeau
Fix Semaphore API docs (GH#9584) James Bourbeau
Render meta description in map_overlap docstring (GH#9568) James Bourbeau

Maintenance¶

Require Click 7.0+ in Dask (GH#9595) John A Kirkham
Temporarily restrict bokeh<3 (GH#9607) James Bourbeau
Resolve importlib-related failures in upstream CI (GH#9604) Charles Blackmon-Luca
Improve upstream CI report (GH#9603) James Bourbeau
Fix upstream CI report (GH#9602) James Bourbeau
Remove setuptools host dep, add CLI entrypoint (GH#9600) Charles Blackmon-Luca
More Backend dispatch class type annotations (GH#9573) Ian Rose

2022.10.0¶

Released on October 14, 2022

New Features¶

Backend library dispatching for IO in Dask-Array and Dask-DataFrame (GH#9475) Richard (Rick) Zamora
Add new CLI that is extensible (GH#9283) Doug Davis

Enhancements¶

Groupby median (GH#9516) Ian Rose
Fix array copy not being a no-op (GH#9555) David Hoese
Add support for string timedelta in map_overlap (GH#9559) Nicolas Grandemange
Shuffle-based groupby for single functions (GH#9504) Ian Rose
Make datetime.datetime tokenize idempotantly (GH#9532) Martin Durant
Support tokenizing datetime.time (GH#9528) Tim Paine

Bug Fixes¶

Avoid race condition in lazy dispatch registration (GH#9545) James Bourbeau
Do not allow setitem to np.nan for int dtype (GH#9531) Doug Davis
Stable demo column projection (GH#9538) Ian Rose
Ensure pickle-able binops in delayed (GH#9540) Ian Rose
Fix project CSV columns when selecting (GH#9534) Martin Durant

Documentation¶

Update Parquet best practice (GH#9537) Matthew Rocklin

Maintenance¶

Restrict tiledb-py version to avoid CI failures (GH#9569) James Bourbeau
Bump actions/github-script from 3 to 6 (GH#9564)
Bump actions/stale from 4 to 6 (GH#9551)
Bump peter-evans/create-pull-request from 3 to 4 (GH#9550)
Bump actions/checkout from 2 to 3.1.0 (GH#9552)
Bump codecov/codecov-action from 1 to 3 (GH#9549)
Bump the-coding-turtle/ga-yaml-parser from 0.1.1 to 0.1.2 (GH#9553)
Move dependabot configuration file (GH#9547) James Bourbeau
Add dependabot for GitHub actions (GH#9542) James Bourbeau
Run mypy on Windows and Linux (GH#9530) crusaderky
Update gpuCI RAPIDS_VER to 22.12 (GH#9524)

2022.9.2¶

Released on September 30, 2022

Enhancements¶

Remove factorization logic from array auto chunking (GH#9507) James Bourbeau

Documentation¶

Add docs on running Dask in a standalone Python script (GH#9513) James Bourbeau
Clarify custom-graph multiprocessing example (GH#9511) nouman

Maintenance¶

Groupby sort upstream compatibility (GH#9486) Ian Rose

2022.9.1¶

Released on September 16, 2022

New Features¶

Add DataFrame and Series median methods (GH#9483) James Bourbeau

Enhancements¶

Shuffle groupby default (GH#9453) Ian Rose
Filter by list (GH#9419) Greg Hayes
Added distributed.utils.key_split functionality to dask.utils.key_split (GH#9464) Luke Conibear

Bug Fixes¶

Fix overlap so that set_index doesn’t drop rows (GH#9423) Julia Signell
Fix assigning pandas Series to column when ddf.columns.min() raises (GH#9485) Erik Welch
Fix metadata comparison stack_partitions (GH#9481) James Bourbeau
Provide default for split_out (GH#9493) Lawrence Mitchell

Deprecations¶

Allow split_out to be None, which then defaults to 1 in groupby().aggregate() (GH#9491) Ian Rose

Documentation¶

Fixing enforce_metadata documentation, not checking for dtypes (GH#9474) Nicolas Grandemange
Fix it's –> its typo (GH#9484) Nat Tabris

Maintenance¶

Workaround for parquet writing failure using some datetime series but not others (GH#9500) Ian Rose
Filter out numeric_only warnings from pandas (GH#9496) James Bourbeau
Avoid set_index(..., inplace=True) where not necessary (GH#9472) James Bourbeau
Avoid passing groupby key list of length one (GH#9495) James Bourbeau
Update test_groupby_dropna_cudf based on cudf support for group_keys (GH#9482) James Bourbeau
Remove dd.from_bcolz (GH#9479) James Bourbeau
Added flake8-bugbear to pre-commit hooks (GH#9457) Luke Conibear
Bind loop variables in function definitions (B023) (GH#9461) Luke Conibear
Added assert for comparisons (B015) (GH#9459) Luke Conibear
Set top-level default shell in CI workflows (GH#9469) James Bourbeau
Removed unused loop control variables (B007) (GH#9458) Luke Conibear
Replaced getattr calls for constant attributes (B009) (GH#9460) Luke Conibear
Pin libprotobuf to allow nightly pyarrow in the upstream CI build (GH#9465) Joris Van den Bossche
Replaced mutable data structures for default arguments (B006) (GH#9462) Luke Conibear
Changed flake8 mirror and updated version (GH#9456) Luke Conibear

2022.9.0¶

Released on September 2, 2022

Enhancements¶

Enable automatic column projection for groupby aggregations (GH#9442) Richard (Rick) Zamora
Accept superclasses in NEP-13/17 dispatching (GH#6710) Gabe Joseph

Bug Fixes¶

Rename by columns internally for cumulative operations on the same by columns (GH#9430) Pavithra Eswaramoorthy
Fix get_group with categoricals (GH#9436) Pavithra Eswaramoorthy
Fix caching-related MaterializedLayer.cull performance regression (GH#9413) Richard (Rick) Zamora

Documentation¶

Add maintainer documentation page (GH#9309) James Bourbeau

Maintenance¶

Revert skipped fastparquet test (GH#9439) Pavithra Eswaramoorthy
tmpfile does not end files with period on empty extension (GH#9429) Hendrik Makait
Skip failing fastparquet test with latest release (GH#9432) James Bourbeau

2022.8.1¶

Released on August 19, 2022

New Features¶

Implement ma.*_like functions (GH#9378) Ruth Comer

Enhancements¶

Fuse compatible annotations (GH#9402) Ian Rose
Shuffle-based groupby aggregation for high-cardinality groups (GH#9302) Richard (Rick) Zamora
Unpack namedtuple (GH#9361) Hendrik Makait

Bug Fixes¶

Fix SeriesGroupBy cumulative functions with axis=1 (GH#9377) Pavithra Eswaramoorthy
Sparse array reductions (GH#9342) Ian Rose
Fix make_meta while using categorical column with index (GH#9348) Pavithra Eswaramoorthy
Don’t allow incompatible keywords in DataFrame.dropna (GH#9366) Naty Clementi
Make set_index handle entirely empty dataframes (GH#8896) Julia Signell
Improve dataclass handling in unpack_collections (GH#9345) Hendrik Makait
Fix bag sampling when there are some smaller partitions (GH#9349) Ian Rose
Add support for empty partitions to da.min/da.max functions (GH#9268) geraninam

Documentation¶

Clarify that bind() etc. regenerate the keys (GH#9385) crusaderky
Consolidate dashboard diagnostics documentation (GH#9357) Sarah Charlotte Johnson
Remove outdated meta information Pavithra Eswaramoorthy

Maintenance¶

Use entry_points utility in sizeof (GH#9390) James Bourbeau
Add entry_points compatibility utility (GH#9388) Jacob Tomlinson
Upload environment file artifact for each CI build (GH#9372) James Bourbeau
Remove werkzeug pin in CI (GH#9371) James Bourbeau
Fix type annotations for dd.from_pandas and dd.from_delayed (GH#9362) Jordan Yap

2022.8.0¶

Released on August 5, 2022

Enhancements¶

Ensure make_meta doesn’t hold ref to data (GH#9354) Jim Crist-Harif
Revise divisions logic in from_pandas (GH#9221) Richard (Rick) Zamora
Warn if user sets index with existing index (GH#9341) Julia Signell
Add keepdims keyword for da.average (GH#9332) Ruth Comer
Change repr methods to avoid Layer materialization (GH#9289) Richard (Rick) Zamora

Bug Fixes¶

Make sure order kwarg will not crash the astype method (GH#9317) Genevieve Buckley
Fix bug for cumsum on cupy chunked dask arrays (GH#9320) Genevieve Buckley
Match input and output structure in _sample_reduce (GH#9272) Pavithra Eswaramoorthy
Include meta in array serialization (GH#9240) Frédéric BRIOL
Fix Index.memory_usage (GH#9290) James Bourbeau
Fix division calculation in dask.dataframe.io.from_dask_array (GH#9282) Jordan Yap

Documentation¶

Fow to use kwargs with custom task graphs (GH#9322) Genevieve Buckley
Add note to da.from_array about how the order is not preserved (GH#9346) Julia Signell
Add I/O info for async functions (GH#9326) Logan Norman
Tidy up docs snippet for futures IO functions (GH#9340) Julia Signell
Use consistent variable names for pandas df and Dask ddf in dataframe-groupby.rst (GH#9304) ivojuroro
Switch js-yaml for yaml.js in config converter (GH#9306) Jacob Tomlinson

Maintenance¶

Update da.linalg.solve for SciPy 1.9.0 compatibility (GH#9350) Pavithra Eswaramoorthy
Update test_getitem_avoids_large_chunks_missing (GH#9347) Pavithra Eswaramoorthy
Fix docs title formatting for “Extend sizeof” Doug Davis
Import loop_in_thread fixture in tests (GH#9337) James Bourbeau
Temporarily xfail test_solve_sym_pos (GH#9336) Pavithra Eswaramoorthy
Fix small typo in 10 minutes to Dask page (GH#9329) Shaghayegh
Temporarily pin werkzeug in CI to avoid test suite hanging (GH#9325) James Bourbeau
Add tests for cupy.angle() (GH#9312) Peter Andreas Entschev
Update gpuCI RAPIDS_VER to 22.10 (GH#9314)
Add pandas[test] to test extra (GH#9110) Ben Beasley
Add bokeh and scipy to upstream CI build (GH#9265) James Bourbeau

2022.7.1¶

Released on July 22, 2022

Enhancements¶

Return Dask array if all axes are squeezed (GH#9250) Pavithra Eswaramoorthy
Make cycle reported by toposort shorter (GH#9068) Erik Welch
Unknown chunk slicing - raise informative error (GH#9285) Naty Clementi

Bug Fixes¶

Fix bug in HighLevelGraph.cull (GH#9267) Richard (Rick) Zamora
Sort categories (GH#9264) Pavithra Eswaramoorthy
Use max (instead of sum) for calculating warnsize (GH#9235) Pavithra Eswaramoorthy
Fix bug when filtering on partitioned column with pyarrow (GH#9252) Richard (Rick) Zamora

Documentation¶

Updated repartition documentation to add note about partition_size (GH#9288) Dylan Stewart
Don’t include docs in Array methods, just refer to module docs (GH#9244) Julia Signell
Remove outdated reference to scheduler and worker dashboards (GH#9278) Pavithra Eswaramoorthy
Fix a few typos (GH#9270) Tim Gates
Adds an custom aggregate example using numpy methods (GH#9260) geraninam

Maintenance¶

Add type annotations to dd.from_pandas and dd.from_delayed (GH#9237) Michael Milton
Update calculate_divisions docstring (GH#9275) Tom Augspurger
Update test_plot_multiple for upcoming bokeh release (GH#9261) James Bourbeau
Add typing to common array properties (GH#9255) Illviljan

2022.7.0¶

Released on July 8, 2022

Enhancements¶

Support pathlib.PurePath in normalize_token (GH#9229) Angus Hollands
Add AttributeNotImplementedError for properties so IPython glob search works (GH#9231) Erik Welch
map_overlap: multiple dataframe handling (GH#9145) Fabien Aulaire
Read entrypoints in dask.sizeof (GH#7688) Angus Hollands

Bug Fixes¶

Fix TypeError: 'Serialize' object is not subscriptable when writing parquet dataset with Client(processes=False) (GH#9015) Lucas Miguel Ponce
Correct dtypes when concat with an empty dataframe (GH#9193) Pavithra Eswaramoorthy

Documentation¶

Highlight note about persist (GH#9234) Pavithra Eswaramoorthy
Update release-procedure to include more detail and helpful commands (GH#9215) Julia Signell
Better SEO for Futures and Dask vs. Spark pages (GH#9217) Sarah Charlotte Johnson

Maintenance¶

Use math.prod instead of np.prod on lists, tuples, and iters (GH#9232) crusaderky
Only import IPython if type checking (GH#9230) Florian Jetter
Tougher mypy checks (GH#9206) crusaderky

2022.6.1¶

Released on June 24, 2022

Enhancements¶

Dask in pyodide (GH#9053) Ian Rose
Create dask.utils.show_versions (GH#9144) Sultan Orazbayev
Better error message for unsupported numpy operations on dask.dataframe objects. (GH#9201) Julia Signell
Add allow_rechunk kwarg to dask.array.overlap function (GH#7776) Genevieve Buckley
Add minutes and hours to dask.utils.format_time (GH#9116) Matthew Rocklin
More retries when writing parquet to remote filesystem (GH#9175) Ian Rose

Bug Fixes¶

Timedelta deterministic hashing (GH#9213) Fabien Aulaire
Enum deterministic hashing (GH#9212) Fabien Aulaire
shuffle_group(): avoid converting to arrays (GH#9157) Mads R. B. Kristensen

Deprecations¶

Deprecate extra format_time utility (GH#9184) James Bourbeau

Documentation¶

Better SEO for 10 Minutes to Dask (GH#9182) Sarah Charlotte Johnson
Better SEO for Delayed and Best Practices (GH#9194) Sarah Charlotte Johnson
Include known inconsistency in DataFrame str.split accessor docstring (GH#9177) Richard Pelgrim
Add inconsistencies keyword to derived_from (GH#9192) Richard Pelgrim
Add missing append in delayed best practices example (GH#9202) Ben
Fix indentation in Best Practices (GH#9196) Sarah Charlotte Johnson
Add link to Genevieve Buckley’s blog on chunk sizes (GH#9199) Pavithra Eswaramoorthy
Update to_csv docstring (GH#9094) Sarah Charlotte Johnson

Maintenance¶

Update versioneer: change from using SafeConfigParser to ConfigParser (GH#9205) Thomas A Caswell
Remove ipython hack in CI(GH#9200) crusaderky

2022.6.0¶

Released on June 10, 2022

Enhancements¶

Add feature to show names of layer dependencies in HLG JupyterLab repr (GH#9081) Angelos Omirolis
Add arrow schema extraction dispatch (GH#9169) GALI PREM SAGAR
Add sort_results argument to assert_eq (GH#9130) Pavithra Eswaramoorthy
Add weeks to parse_timedelta (GH#9168) Matthew Rocklin
Warn that cloudpickle is not always deterministic (GH#9148) Pavithra Eswaramoorthy
Switch parquet default engine (GH#9140) Jim Crist-Harif
Use deterministic hashing with _iLocIndexer / _LocIndexer (GH#9108) Fabien Aulaire
Enfore consistent schema in to_parquet pyarrow (GH#9131) Jim Crist-Harif

Bug Fixes¶

Fix pyarrow.StringArray pickle (GH#9170) Jim Crist-Harif
Fix parallel metadata collection in pyarrow engine (GH#9165) Richard (Rick) Zamora
Improve pyarrow partitioning logic (GH#9147) James Bourbeau
pyarrow 8.0 partitioning fix (GH#9143) James Bourbeau

Documentation¶

Better SEO for Installing Dask and Dask DataFrame Best Practices (GH#9178) Sarah Charlotte Johnson
Update logos page in docs (GH#9167) Sarah Charlotte Johnson
Add example using pandas Series to map_partition doctring (GH#9161) Alex-JG3
Update docs theme for rebranding (GH#9160) Sarah Charlotte Johnson
Better SEO for docs on Dask DataFrames (GH#9128) Sarah Charlotte Johnson

Maintenance¶

Remove ensure_file from recommended practice for downstream libraries (GH#9171) Matthew Rocklin
Test round-tripping DataFrame parquet I/O including pyspark (GH#9156) Ian Rose
Try disabling HDF5 locking (GH#9154) Ian Rose
Link best practices to DataFrame-parquet (GH#9150) Tom Augspurger
Fix typo in map_partitions func parameter description (GH#9149) Christopher Akiki
Un-xfail test_groupby_grouper_dispatch (GH#9139) GALI PREM SAGAR
Temporarily import cleanup fixture from distributed (GH#9138) James Bourbeau
Simplify partitioning logic in pyarrow parquet engine (GH#9041) Richard (Rick) Zamora

2022.05.2¶

Released on May 26, 2022

Enhancements¶

Add a dispatch for non-pandas Grouper objects and use it in GroupBy (GH#9074) brandon-b-miller
Error if read_parquet & to_parquet files intersect (GH#9124) Jim Crist-Harif
Visualize task graphs using ipycytoscape (GH#9091) Ian Rose

Documentation¶

Fix various typos (GH#9126) Ryan Russell

Maintenance¶

Fix flaky test_filter_nonpartition_columns (GH#9127) Pavithra Eswaramoorthy
Update gpuCI RAPIDS_VER to 22.08 (GH#9120)
Include conftest.py` in sdists (GH#9115) Ben Beasley

2022.05.1¶

Released on May 24, 2022

New Features¶

Add DataFrame.from_dict classmethod (GH#9017) Matthew Powers
Add from_map function to Dask DataFrame (GH#8911) Richard (Rick) Zamora

Enhancements¶

Improve to_parquet error for appended divisions overlap (GH#9102) Jim Crist-Harif
Enabled user-defined process-initializer functions (GH#9087) ParticularMiner
Mention align_dataframes=False option in map_partitions error (GH#9075) Gabe Joseph
Add kwarg enforce_ndim to dask.array.map_blocks() (GH#8865) ParticularMiner
Implement Series.GroupBy.fillna / DataFrame.GroupBy.fillna methods (GH#8869) Pavithra Eswaramoorthy
Allow fillna with Dask DataFrame (GH#8950) Pavithra Eswaramoorthy
Update error message for assignment with 1-d dask array (GH#9036) Pavithra Eswaramoorthy
Collection Protocol (GH#8674) Doug Davis
Patch around pandas ArrowStringArray pickling (GH#9024) Jim Crist-Harif
Band-aid for compute_as_if_collection (GH#8998) Ian Rose
Add p2p shuffle option (GH#8836) Matthew Rocklin

Bug Fixes¶

Fixup column projection with no columns (GH#9106) Jim Crist-Harif
Blockwise cull NumPy dtype (GH#9100) Ian Rose
Fix column-projection bug in from_map (GH#9078) Richard (Rick) Zamora
Prevent nulls in index for non-numeric dtypes (GH#8963) Jorge López
Fix is_monotonic methods for more than 8 partitions (GH#9019) Julia Signell
Handle enumerate and generator inputs to from_map (GH#9066) Richard (Rick) Zamora
Revert is_dask_collection; back to previous implementation (GH#9062) Doug Davis
Fix Blockwise.clone does not handle iterable literal arguments correctly (GH#8979) JSKenyon
Array setitem hardmask (GH#9027) David Hassell
Fix overlapping divisions error on append (GH#8997) Ian Rose

Deprecations¶

Add pre-deprecation warnings for read_parquet kwargs chunksize and aggregate_files (GH#9052) Richard (Rick) Zamora

Documentation¶

Document map_partitions handling of args vs kwargs, usage of partition_info (GH#9084) Charles Blackmon-Luca
Update custom collection documentation (leverage new collection protocol) (GH#9097) Doug Davis
Better SEO for docs on creating and storing Dask DataFrames (GH#9098) Sarah Charlotte Johnson
Clarify chunking in imread docstring (GH#9082) Genevieve Buckley
Rearrange docs TOC (GH#9001) Matthew Rocklin
Corrected map_blocks() docstring for kwarg enforce_ndim (GH#9071) ParticularMiner
Update DataFrame SQL docs references to other libraries (GH#9077) Charles Blackmon-Luca
Update page on creating and storing Dask DataFrames (GH#9025) Sarah Charlotte Johnson

Maintenance¶

Include NUMPY_LICENSE.txt in license files (GH#9113) Ben Beasley
Increase retries when installing nightly pandas (GH#9103) James Bourbeau
Force nightly pyarrow in the upstream build (GH#9095) Joris Van den Bossche
Improve object handling & testing of ensure_unicode (GH#9059) John A Kirkham
Force nightly pyarrow in the upstream build (GH#8993) Joris Van den Bossche
Additional check on is_dask_collection (GH#9054) Doug Davis
Update ensure_bytes (GH#9050) John A Kirkham
Add end of file pre-commit hook (GH#9045) James Bourbeau
Add codespell pre-commit hook (GH#9040) James Bourbeau
Remove the HDFS tests (GH#9039) Jim Crist-Harif
Fix flaky test_reductions_2D (GH#9037) Jim Crist-Harif
Prevent codecov from notifying of failure too soon (GH#9031) Jim Crist-Harif
Only test on Python 3.9 on macos (GH#9029) Jim Crist-Harif
Update to_timedelta default unit (GH#9010) Pavithra Eswaramoorthy

2022.05.0¶

Released on May 2, 2022

Highlights¶

This is a bugfix release for this issue.

Documentation¶

Add highlights section to 2022.04.2 release notes (GH#9012) James Bourbeau

2022.04.2¶

Released on April 29, 2022

Highlights¶

This release includes several deprecations/breaking API changes to dask.dataframe.read_parquet and dask.dataframe.to_parquet:

to_parquet no longer writes _metadata files by default. If you want to write a _metadata file, you can pass in write_metadata_file=True.
read_parquet now defaults to split_row_groups=False, which results in one Dask dataframe partition per parquet file when reading in a parquet dataset. If you’re working with large parquet files you may need to set split_row_groups=True to reduce your partition size.
read_parquet no longer calculates divisions by default. If you require read_parquet to return dataframes with known divisions, please set calculate_divisions=True.
read_parquet has deprecated the gather_statistics keyword argument. Please use the calculate_divisions keyword argument instead.
read_parquet has deprecated the require_extensions keyword argument. Please use the parquet_file_extension keyword argument instead.

New Features¶

Add removeprefix and removesuffix as StringMethods (GH#8912) Jorge López

Enhancements¶

Call fs.invalidate_cache in to_parquet (GH#8994) Jim Crist-Harif
Change to_parquet default to write_metadata_file=None (GH#8988) Jim Crist-Harif
Let arg reductions pass keepdims (GH#8926) Julia Signell
Change split_row_groups default to False in read_parquet (GH#8981) Richard (Rick) Zamora
Improve NotImplementedError message for da.reshape (GH#8987) Jim Crist-Harif
Simplify to_parquet compute path (GH#8982) Jim Crist-Harif
Raise an error if you try to use vindex with a Dask object (GH#8945) Julia Signell
Avoid pre_buffer=True when a precache method is specified (GH#8957) Richard (Rick) Zamora
from_dask_array uses blockwise instead of merging graphs (GH#8889) Bryan Weber
Use pre_buffer=True for “pyarrow” Parquet engine (GH#8952) Richard (Rick) Zamora

Bug Fixes¶

Handle dtype=None correctly in da.full (GH#8954) Tom White
Fix dask-sql bug caused by blockwise fusion (GH#8989) Richard (Rick) Zamora
to_parquet errors for non-string column names (GH#8990) Jim Crist-Harif
Make sure da.roll works even if shape is 0 (GH#8925) Julia Signell
Fix recursion error issue with set_index (GH#8967) Paul Hobson
Stringify BlockwiseDepDict mapping values when produces_keys=True (GH#8972) Richard (Rick) Zamora
Use DataFram`eIOLayer in DataFrame.from_delayed (GH#8852) Richard (Rick) Zamora
Check that values for the in predicate in read_parquet are correct (GH#8846) Bryan Weber
Fix bug for reduction of zero dimensional arrays (GH#8930) Tom White
Specify dtype when deciding division using np.linspace in read_sql_query (GH#8940) Cheun Hong

Deprecations¶

Deprecate gather_statistics from read_parquet (GH#8992) Richard (Rick) Zamora
Change require_extension to top-level parquet_file_extension read_parquet kwarg (GH#8935) Richard (Rick) Zamora

Documentation¶

Update write_metadata_file discussion in documentation (GH#8995) Richard (Rick) Zamora
Update DataFrame.merge docstring (GH#8966) Pavithra Eswaramoorthy
Added description for parameter align_arrays in array.blockwise() (GH#8977) ParticularMiner
ecommend not to use map_block(drop_axis=...) on chunked axes of an array (GH#8921) ParticularMiner
Add copy button to code snippets in docs (GH#8956) James Bourbeau

Maintenance¶

Pandas 1.5.0 compatibility (GH#8961) Ian Rose
Add pytest-timeout to distributed envs on CI (GH#8986) Julia Signell
Improve read_parquet docstring formatting (GH#8971) Bryan Weber
Remove pytest.warns(None) (GH#8924) Pavithra Eswaramoorthy
Document Python 3.10 as supported (GH#8976) Eray Aslan
parse_timedelta option to enforce explicit unit (GH#8969) crusaderky
mypy compatibility (GH#8854) Paul Hobson
Add a docs page for Dask & Parquet (GH#8899) Jim Crist-Harif
Adds configuration to ignore revs in blame (GH#8933) Bryan Weber

2022.04.1¶

Released on April 15, 2022

New Features¶

Add missing NumPy ufuncs: abs, left_shift, right_shift, positive. (GH#8920) Tom White

Enhancements¶

Avoid collecting parquet metadata in pyarrow when write_metadata_file=False (GH#8906) Richard (Rick) Zamora
Better error for failed wildcard path in dd.read_csv() (fixes #8878) (GH#8908) Roger Filmyer
Return da.Array rather than dd.Series for non-ufunc elementwise functions on dd.Series (GH#8558) Julia Signell
Let get_dummies use meta computation in map_partitions (GH#8898) Julia Signell
Masked scalars input to da.from_array (GH#8895) David Hassell
Raise ValueError in merge_asof for duplicate kwargs (GH#8861) Bryan Weber

Bug Fixes¶

Make is_monotonic work when some partitions are empty (GH#8897) Julia Signell
Fix custom getter in da.from_array when inline_array=False (GH#8903) Ian Rose
Correctly handle dict-specification for rechunk. (GH#8859) Richard
Fix merge_asof: drop index column if left_on == right_on (GH#8874) Gil Forsyth

Deprecations¶

Warn users that engine='auto' will change in future (GH#8907) Jim Crist-Harif
Remove pyarrow-legacy engine from parquet API (GH#8835) Richard (Rick) Zamora

Documentation¶

Add note on missing parameter out for dask.array.dot (GH#8913) Francesco Andreuzzi
Update DataFrame.query docstring (GH#8890) Pavithra Eswaramoorthy

Maintenance¶

Don’t test da.prod on large integer data (GH#8893) Jim Crist-Harif
Add network marks to tests that fail without an internet connection (GH#8881) Paul Hobson
Fix gpuCI GHA version (GH#8891) Charles Blackmon-Luca
xfail/skip some flaky distributed tests (GH#8887) Jim Crist-Harif
Remove unused (deprecated) code from ArrowDatasetEngine (GH#8885) Richard (Rick) Zamora
Add mild typing to common utils functions, part 2 (GH#8867) crusaderky
Documentation of Limitation of sample() (GH#8858) Nadiem Sissouno

2022.04.0¶

Released on April 1, 2022

Note

This is the first release with support for Python 3.10

New Features¶

Add Python 3.10 support (GH#8566) James Bourbeau

Enhancements¶

Add check on dtype.itemsize in order to produce a useful error (GH#8860) Davide Gavio
Add mild typing to common utils functions (GH#8848) Matthew Rocklin
Add sanity checks to divisions setter (GH#8806) Jim Crist-Harif
Use Blockwise and map_partitions for more tasks (GH#8831) Bryan Weber

Bug Fixes¶

Fix dataframe.merge_asof to preserve right_on column (GH#8857) Sarah Charlotte Johnson
Fix “Buffer dtype mismatch” for pandas >= 1.3 on 32bit (GH#8851) Ben Greiner
Fix slicing fusion by altering SubgraphCallable getter (GH#8827) Ian Rose

Deprecations¶

Remove support for PyPy (GH#8863) James Bourbeau
Drop setuptools at runtime (GH#8855) crusaderky
Remove dataframe.tseries.resample.getnanos (GH#8834) Sarah Charlotte Johnson

Documentation¶

Organize diagnostic and performance docs (GH#8871) Naty Clementi
Add image to explain drop_axis option of map_blocks (GH#8868) ParticularMiner

Maintenance¶

Update gpuCI RAPIDS_VER to 22.06 (GH#8828)
Restore test_parquet in http (GH#8850) Bryan Weber
Simplify gpuCI updating workflow (GH#8849) Charles Blackmon-Luca

2022.03.0¶

Released on March 18, 2022

New Features¶

Bag: add implementation for reservoir sampling (GH#7636) Daniel Mesejo-León
Add ma.count to Dask array (GH#8785) David Hassell
Change to_parquet default to compression="snappy" (GH#8814) Jim Crist-Harif
Add weights parameter to dask.array.reduction (GH#8805) David Hassell
Add ddf.compute_current_divisions to get divisions on a sorted index or column (GH#8517) Julia Signell

Enhancements¶

Pass __name__ and __doc__ through on DelayedLeaf (GH#8820) Leo Gao
Raise exception for not implemented merge how option (GH#8818) Naty Clementi
Move Bag.map_partitions to Blockwise (GH#8646) Richard (Rick) Zamora
Improve error messages for malformed config files (GH#8801) Jim Crist-Harif
Revise column-projection optimization to capture common dask-sql patterns (GH#8692) Richard (Rick) Zamora
Useful error for empty divisions (GH#8789) Pavithra Eswaramoorthy
Scipy 1.8.0 compat: copy private classes into dask/array/stats.py (GH#8694) Julia Signell
Raise warning when using multiple types of schedulers where one is distributed (GH#8700) Pedro Silva

Bug Fixes¶

Fix bug in applying != filter in read_parquet (GH#8824) Richard (Rick) Zamora
Fix set_index when directly passed a dask Index (GH#8680) Paul Hobson
Quick fix for unbounded memory usage in tensordot (GH#7980) Genevieve Buckley
If hdf file is empty, don’t fail on meta creation (GH#8809) Julia Signell
Update clone_key("x") to retain prefix (GH#8792) crusaderky
Fix “physical” column bug in pyarrow-based read_parquet (GH#8775) Richard (Rick) Zamora
Fix groupby.shift bug caused by unsorted partitions after shuffle (GH#8782) kori73
Fix serialization bug (GH#8786) Richard (Rick) Zamora

Deprecations¶

Bump diagnostics bokeh dependency to 2.4.2 (GH#8791) Charles Blackmon-Luca
Deprecate bcolz support (GH#8754) Pavithra Eswaramoorthy
Finish making map_overlap default boundary kwarg 'none' (GH#8743) Genevieve Buckley

Documentation¶

Custom collection example docs fix (GH#8807) Doug Davis
Add Series.str, Series.dt, and Series.cat accessors to docs (GH#8757) Sarah Charlotte Johnson
Fix docstring for ddf.compute_current_divisions (GH#8793) Julia Signell
Dashboard docs on /status page (GH#8648) Naty Clementi
Clarify divisions kwarg in repartition docstring (GH#8781) Sarah Charlotte Johnson
Update Docker images to use ghcr.io (GH#8774) Jacob Tomlinson

Maintenance¶

Reduce gpuci pytest parallelism (GH#8826) GALI PREM SAGAR
absolufy-imports - No relative imports - PEP8 (GH#8796) Julia Signell
Tidy up assert_eq calls in array tests (GH#8812) Julia Signell
Avoid pytest.warns(None) (GH#8718) LSturtew
Fix test_describe_empty to work without global -Werror (GH#8291) Michał Górny
Temporarily xfail graphviz tests on windows (GH#8794) Jim Crist-Harif
Use packaging.parse for md5 compatibility (GH#8763) James Bourbeau
Make tokenize work in a FIPS 140-2 environment (GH#8762) Jim Crist-Harif
Label issues and PRs on open with ‘needs triage’ (GH#8761) Julia Signell
Add some extra test coverage (GH#8302) lrjball
Specify action version and change from pull_request_target to pull_request (GH#8767) Julia Signell
Make scheduler kwarg pass though to sub functions in da.assert_eq (GH#8755) Julia Signell

2022.02.1¶

Released on February 25, 2022

New Features¶

Add aggregate functions first and last to dask.dataframe.pivot_table (GH#8649) Knut Nordanger
Add std() support for datetime64 dtype for pandas-like objects (GH#8523) Ben Glossner
Add materialized task counts to HighLevelGraph and Layer html reprs (GH#8589) kori73

Enhancements¶

Do not allow iterating a DataFrameGroupBy (GH#8696) Bryan Weber
Fix missing newline after info() call on empty DataFrame (GH#8727) Naty Clementi
Add groupby.compute as a not implemented method (GH#8734) Dranaxel
Improve multi dataframe join performance (GH#8740) Holden Karau
Include bool type for Index (GH#8732) Naty Clementi
Allow ArrowDatasetEngine subclass to override pandas->arrow conversion also for partitioned write (GH#8741) Joris Van den Bossche
Increase performance of k-diagonal extraction in da.diag() and da.diagonal() (GH#8689) ParticularMiner
Change linspace creation to match numpy when num equal to 0 (GH#8676) Peter
Tokenize dataclasses (GH#8557) Gabe Joseph
Update tokenize to treat dict and kwargs differently (GH#8655) James Bourbeau

Bug Fixes¶

Fix bug in dask.array.roll() for roll-shifts that match the size of the input array (GH#8723) ParticularMiner
Fix for normalize_function dataclass methods (GH#8527) Sarah Charlotte Johnson
Fix rechunking with zero-size-chunks (GH#8703) ParticularMiner
Move creation of sqlalchemy connection for picklability (GH#8745) Julia Signell

Deprecations¶

Drop Python 3.7 (GH#8572) James Bourbeau
Deprecate iteritems (GH#8660) James Bourbeau
Deprecate dataframe.tseries.resample.getnanos (GH#8752) Sarah Charlotte Johnson
Add deprecation warning for pyarrow-legacy engine (GH#8758) Richard (Rick) Zamora

Documentation¶

Update link typos in changelog (GH#8717) James Bourbeau
Clarify dask.visualize docstring (GH#8710) Dranaxel
Update Docker example to use current best practices (GH#8731) Jacob Tomlinson
Update docs to include distributed.Client.preload (GH#8679) Bryan Weber
Document monthly social meeting (GH#8595) Thomas Grainger
Add docs for Gen2 access with RBAC/ACL i.e. security principal (GH#8748) Martin Thøgersen
Use Dask configuration extension from dask-sphinx-theme (GH#8751) Benjamin Zaitlen

Maintenance¶

Unpin coverage in CI (GH#8690) James Bourbeau
Add manual trigger for running test suite (GH#8716) James Bourbeau
Xfail scheduler_HLG_unpack_import; flaky test (GH#8724) Mike McCarty
Temporarily remove scipy upstream CI build (GH#8725) James Bourbeau
Bump pre-release version to be greater than stable releases (GH#8728) Charles Blackmon-Luca
Move custom sort function logic to internal sort_values (GH#8571) Charles Blackmon-Luca
Pin cloudpickle and scipy in docs requirements (GH#8737) Julia Signell
Make the labeler not delete labels, and look for the docs at the right spot (GH#8746) Julia Signell
Fix docs build warnings (GH#8432) Kristopher Overholt
Update test status badge (GH#8747) James Bourbeau
Fix parquet test_pandas_timestamp_overflow_pyarrow test (GH#8733) Joris Van den Bossche
Only run PR builds on changes to relevant files (GH#8756) Charles Blackmon-Luca

2022.02.0¶

Released on February 11, 2022

Note

This is the last release with support for Python 3.7

New Features¶

Add region to to_zarr when using existing array (GH#8590) Chris Roat
Add engine_kwargs support to dask.dataframe.to_sql (GH#8609) Amir Kadivar
Add include_path_column arg to read_json (GH#8603) Bryan Weber
Add expand_dims to Dask array (GH#8687) Tom White

Enhancements¶

Add scheduler option to assert_eq utilities (GH#8610) Xinrong Meng
Fix eye inconsistency with NumPy for dtype=None (GH#8685) Tom White
Fix concatenate inconsistency with NumPy for axis=None (GH#8686) Tom White
Type annotations, part 1 (GH#8295) crusaderky
Really allow any iterable to be passed as a meta (GH#8629) Julia Signell
Use map_partitions (Blockwise) in to_parquet (GH#8487) Richard (Rick) Zamora

Bug Fixes¶

Result of reducing an array should not depend on its chunk-structure (GH#8637) ParticularMiner
Pass place-holder metadata to map_partitions in ACA code path (GH#8643) Richard (Rick) Zamora

Deprecations¶

Deprecate is_monotonic (GH#8653) James Bourbeau
Remove some deprecations (GH#8605) James Bourbeau

Documentation¶

Add Domino Data Lab to Hosted / managed Dask clusters (GH#8675) Ray Bell
Fix inter-linking and remove deprecated function (GH#8715) Julia Signell
Fix imbalanced backticks. (GH#8693) Matthias Bussonnier
Add documentation for high level graph visualization (GH#8483) Genevieve Buckley
Update documentation of ProgressBar out parameter (GH#8604) Pedro Silva
Improve documentation of dask.config.set (GH#8705) crusaderky
Revert mention to mypy among type checkers (GH#8699) crusaderky

Maintenance¶

Update warning handling in get_dummies tests (GH#8651) James Bourbeau
Add a github changelog template (GH#8714) Julia Signell
Update year in LICENSE.txt (GH#8665) David Hoese
Update pre-commit version (GH#8691) James Bourbeau
Include scipy in upstream CI build (GH#8681) James Bourbeau
Temporarily pin scipy < 1.8.0 in CI (GH#8683) James Bourbeau
Pin scipy to less than 1.8.0 in GPU CI (GH#8698) Julia Signell
Avoid pytest.warns(None) in test_multi.py (GH#8678) James Bourbeau
Update GHA concurrent job cancellation (GH#8652) James Bourbeau
Make test__get_paths robust to site.PREFIXES being set (GH#8644) James Bourbeau
Bump gpuCI PYTHON_VER to 3.9 (GH#8642) Charles Blackmon-Luca

2022.01.1¶

Released on January 28, 2022

New Features¶

Add dask.dataframe.series.view() (GH#8533) Pavithra Eswaramoorthy

Enhancements¶

Update tz for fastparquet + pandas 1.4.0 (GH#8626) Martin Durant
Cleaning up misc tests for pandas compat (GH#8623) Julia Signell
Moving to SQLAlchemy >= 1.4 (GH#8158) McToel
Pandas compat: Filter sparse warnings (GH#8621) Julia Signell
Fail if meta is not a pandas object (GH#8563) Julia Signell
Use fsspec.parquet module for better remote-storage read_parquet performance (GH#8339) Richard (Rick) Zamora
Move DataFrame ACA aggregations to HLG (GH#8468) Richard (Rick) Zamora
Add optional information about originating function call in DataFrameIOLayer (GH#8453) Richard (Rick) Zamora
Blockwise array creation redux (GH#7417) Ian Rose
Refactor config default search path retrieval (GH#8573) James Bourbeau
Add optimize_graph flag to Bag.to_dataframe function (GH#8486) Maxim Lippeveld
Make sure that delayed output operations still return lists of paths (GH#8498) Julia Signell
Pandas compat: Fix to_frame name to not pass None (GH#8554) Julia Signell
Pandas compat: Fix axis=None warning (GH#8555) Julia Signell
Expand Dask YAML config search directories (GH#8531) abergou

Bug Fixes¶

Fix groupby.cumsum with series grouped by index (GH#8588) Julia Signell
Fix derived_from for pandas methods (GH#8612) Thomas J. Fan
Enforce boolean ascending for sort_values (GH#8440) Charles Blackmon-Luca
Fix parsing of __setitem__ indices (GH#8601) David Hassell
Avoid divide by zero in slicing (GH#8597) Doug Davis

Deprecations¶

Downgrade meta error in (GH#8563) to warning (GH#8628) Julia Signell
Pandas compat: Deprecate append when pandas >= 1.4.0 (GH#8617) Julia Signell

Documentation¶

Replace outdated columns argument with meta in DataFrame constructor (GH#8614) kori73
Refactor deploying docs (GH#8602) Jacob Tomlinson

Maintenance¶

Pin coverage in CI (GH#8631) James Bourbeau
Move cached_cumsum imports to be from dask.utils (GH#8606) James Bourbeau
Update gpuCI RAPIDS_VER to 22.04 (GH#8600)
Update cocstring for from_delayed function (GH#8576) Kirito1397
Handle plot_width / plot_height deprecations (GH#8544) Bryan Van de Ven
Remove unnecessary pyyaml importorskip (GH#8562) James Bourbeau
Specify scheduler in DataFrame assert_eq (GH#8559) Gabe Joseph

2022.01.0¶

Released on January 14, 2022

New Features¶

Add groupby.shift method (GH#8522) kori73
Add DataFrame.nunique (GH#8479) Sarah Charlotte Johnson
Add da.ndim to match np.ndim (GH#8502) Julia Signell

Enhancements¶

Only show percentile interpolation= keyword warning if NumPy version >= 1.22 (GH#8564) Julia Signell
Raise PerformanceWarning when limit and "array.slicing.split-large-chunks" are None (GH#8511) Julia Signell
Define normalize_seq function at import time (GH#8521) Illviljan
Ensure that divisions are alway tuples (GH#8393) Charles Blackmon-Luca
Allow a callable scheduler for bag.groupby (GH#8492) Julia Signell
Save Zarr arrays with dask-on-ray scheduler (GH#8472) TnTo
Make byte blocks more even in read_bytes (GH#8459) Martin Durant
Improved the efficiency of matmul() by completely removing concatenation (GH#8423) ParticularMiner
Limit max chunk size when reshaping dask arrays (GH#8124) Genevieve Buckley
Changes for fastparquet superthrift (GH#8470) Martin Durant

Bug Fixes¶

Fix boolean indices in array assignment (GH#8538) David Hassell
Detect default dtype on array-likes (GH#8501) aeisenbarth
Fix optimize_blockwise bug for duplicate dependency names (GH#8542) Richard (Rick) Zamora
Update warnings for DataFrame.GroupBy.apply and transform (GH#8507) Sarah Charlotte Johnson
Track HLG layer name in Delayed (GH#8452) Gabe Joseph
Fix single item nanmin and nanmax reductions (GH#8484) Julia Signell
Make read_csv with comment kwarg work even if there is a comment in the header (GH#8433) Julia Signell

Deprecations¶

Replace interpolation with method and method with internal_method (GH#8525) Julia Signell
Remove daily stock demo utility (GH#8477) James Bourbeau

Documentation¶

Add a join example in docs that be run with copy/paste (GH#8520) kori73
Mention dashboard link in config (GH#8510) Ray Bell
Fix changelog section hyperlinks (GH#8534) Aneesh Nema
Hyphenate “single-machine scheduler” for consistency (GH#8519) Deepyaman Datta
Normalize whitespace in doctests in slicing.py (GH#8512) Maren Westermann
Best practices storage line typo (GH#8529) Michael Delgado
Update figures (GH#8401) Sarah Charlotte Johnson
Remove pyarrow-only reference from split_row_groups in read_parquet docstring (GH#8490) Naty Clementi

Maintenance¶

Remove obsolete LocalFileSystem tests that fail for fsspec>=2022.1.0 (GH#8565) Richard (Rick) Zamora
Tweak: “RuntimeWarning: invalid value encountered in reciprocal” (GH#8561) crusaderky
Fix skipna=None for DataFrame.sem (GH#8556) Julia Signell
Fix PANDAS_GT_140 (GH#8552) Julia Signell
Collections with HLG must always implement __dask_layers__ (GH#8548) crusaderky
Work around race condition in import llvmlite (GH#8550) crusaderky
Set a minimum version for pyyaml (GH#8545) Gaurav Sheni
Adding nodefaults to environments to fix tiledb + mac issue (GH#8505) Julia Signell
Set ceiling for setuptools (GH#8509) Julia Signell
Add workflow / recipe to generate Dask nightlies (GH#8469) Charles Blackmon-Luca
Bump gpuCI CUDA_VER to 11.5 (GH#8489) Charles Blackmon-Luca

2021.12.0¶

Released on December 10, 2021

New Features¶

Add Series and Index is_monotonic* methods (GH#8304) Daniel Mesejo-León

Enhancements¶

Blockwise map_partitions with partition_info (GH#8310) Gabe Joseph
Better error message for length of array with unknown chunk sizes (GH#8436) Doug Davis
Use by instead of index internally on the Groupby class (GH#8441) Julia Signell
Allow custom sort functions for sort_values (GH#8345) Charles Blackmon-Luca
Add warning to read_parquet when statistics and partitions are misaligned (GH#8416) Richard (Rick) Zamora
Support where argument in ufuncs (GH#8253) mihir
Make visualize more consistent with compute (GH#8328) JSKenyon

Bug Fixes¶

Fix map_blocks not using own arguments in name generation (GH#8462) David Hoese
Fix for index error with reading empty parquet file (GH#8410) Sarah Charlotte Johnson
Fix nullable-dtype error when writing partitioned parquet data (GH#8400) Richard (Rick) Zamora
Fix CSV header bug (GH#8413) Richard (Rick) Zamora
Fix empty chunk causes exception in nanmin/nanmax (GH#8375) Boaz Mohar

Deprecations¶

Deprecate token keyword argument to map_blocks (GH#8464) James Bourbeau
Deprecation warning for default value of boundary kwarg in map_overlap (GH#8397) Genevieve Buckley

Documentation¶

Clarify block_info documentation (GH#8425) Genevieve Buckley
Output from alt text sprint (GH#8456) Sarah Charlotte Johnson
Update talks and presentations (GH#8370) Naty Clementi
Update Anaconda link in “Paid support” section of docs (GH#8427) Martin Durant
Fixed broken dask-gateway link in ecosystem.rst (GH#8424) ofirr
Fix CuPy doctest error (GH#8412) Genevieve Buckley

Maintenance¶

Bump Bokeh min version to 2.1.1 (GH#8431) Bryan Van de Ven
Fix following fsspec=2021.11.1 release (GH#8428) Martin Durant
Add dask/ml.py to pytest exclude list (GH#8414) Genevieve Buckley
Update gpuCI RAPIDS_VER to 22.02 (GH#8394)
Unpin graphviz and improve package management in environment-3.7 (GH#8411) Julia Signell

2021.11.2¶

Released on November 19, 2021

Only run gpuCI bump script daily (GH#8404) Charles Blackmon-Luca
Actually ignore index when asked in assert_eq (GH#8396) Gabe Joseph
Ensure single-partition join divisions is tuple (GH#8389) Charles Blackmon-Luca
Try to make divisions behavior clearer (GH#8379) Julia Signell
Fix typo in set_index partition_size parameter description (GH#8384) FredericOdermatt
Use blockwise in single_partition_join (GH#8341) Gabe Joseph
Use more explicit keyword arguments (GH#8354) Boaz Mohar
Fix .loc of DataFrame with nullable boolean dtype (GH#8368) Marco Rossi
Parameterize shuffle implementation in tests (GH#8250) Ian Rose
Remove some doc build warnings (GH#8369) Boaz Mohar
Include properties in array API docs (GH#8356) Julia Signell
Fix Zarr for upstream (GH#8367) Julia Signell
Pin graphviz to avoid issue with windows and Python 3.7 (GH#8365) Julia Signell
Import graphviz.Diagraph from top of module, not from dot (GH#8363) Julia Signell

2021.11.1¶

Released on November 8, 2021

Patch release to update distributed dependency to version 2021.11.1.

2021.11.0¶

Released on November 5, 2021

Fx required_extension behavior in read_parquet (GH#8351) Richard (Rick) Zamora
Add align_dataframes to map_partitions to broadcast a dataframe passed as an arg (GH#6628) Julia Signell
Better handling for arrays/series of keys in dask.dataframe.loc (GH#8254) Julia Signell
Point users to Discourse (GH#8332) Ian Rose
Add name_function option to to_parquet (GH#7682) Matthew Powers
Get rid of environment-latest.yml and update to Python 3.9 (GH#8275) Julia Signell
Require newer s3fs in CI (GH#8336) James Bourbeau
Groupby Rolling (GH#8176) Julia Signell
Add more ordering diagnostics to dask.visualize (GH#7992) Erik Welch
Use HighLevelGraph optimizations for delayed (GH#8316) Ian Rose
demo_tuples produces malformed HighLevelGraph (GH#8325) crusaderky
Dask calendar should show events in local time (GH#8312) Genevieve Buckley
Fix flaky test_interrupt (GH#8314) crusaderky
Deprecate AxisError (GH#8305) crusaderky
Fix name of cuDF in extension documentation. (GH#8311) Vyas Ramasubramani
Add single eq operator (=) to parquet filters (GH#8300) Ayush Dattagupta
Improve support for Spark output in read_parquet (GH#8274) Richard (Rick) Zamora
Add dask.ml module (GH#6384) Matthew Rocklin
CI fixups (GH#8298) James Bourbeau
Make slice errors match NumPy (GH#8248) Julia Signell
Fix API docs misrendering with new sphinx theme (GH#8296) Julia Signell
Replace block property with blockview for array-like operations on blocks (GH#8242) Davis Bennett
Deprecate file_path and make it possible to save from within a notebook (GH#8283) Julia Signell

2021.10.0¶

Released on October 22, 2021

da.store to create well-formed HighLevelGraph (GH#8261) crusaderky
CI: force nightly pyarrow in the upstream build (GH#8281) Joris Van den Bossche
Remove chest (GH#8279) James Bourbeau
Skip doctests if optional dependencies are not installed (GH#8258) Genevieve Buckley
Update tmpdir and tmpfile context manager docstrings (GH#8270) Daniel Mesejo-León
Unregister callbacks in doctests (GH#8276) James Bourbeau
Fix typo in docs (GH#8277) JoranDox
Stale label GitHub action (GH#8244) Genevieve Buckley
Client-shutdown method appears twice (GH#8273) German Shiklov
Add pre-commit to test requirements (GH#8257) Genevieve Buckley
Refactor read_metadata in fastparquet engine (GH#8092) Richard (Rick) Zamora
Support Path objects in from_zarr (GH#8266) Samuel Gaist
Make nested redirects work (GH#8272) Julia Signell
Set memory_usage to True if verbose is True in info (GH#8222) Kinshuk Dua
Remove individual API doc pages from sphinx toctree (GH#8238) James Bourbeau
Ignore whitespace in gufunc signature (GH#8267) James Bourbeau
Add workflow to update gpuCI (GH#8215) Charles Blackmon-Luca
DataFrame.head shouldn’t warn when there’s one partition (GH#8091) Pankaj Patil
Ignore arrow doctests if pyarrow not installed (GH#8256) Genevieve Buckley
Fix debugging.html redirect (GH#8251) James Bourbeau
Fix null sorting for single partition dataframes (GH#8225) Charles Blackmon-Luca
Fix setup.html redirect (GH#8249) Florian Jetter
Run pyupgrade in CI (GH#8246) crusaderky
Fix label typo in upstream CI build (GH#8237) James Bourbeau
Add support for “dependent” columns in DataFrame.assign (GH#8086) Suriya Senthilkumar
add NumPy array of Dask keys to Array (GH#7922) Davis Bennett
Remove unnecessary dask.multiprocessing import in docs (GH#8240) Ray Bell
Adjust retrieving _max_workers from Executor (GH#8228) John A Kirkham
Update function signatures in delayed best practices docs (GH#8231) Vũ Trung Đức
Docs reoganization (GH#7984) Julia Signell
Fix df.quantile on all missing data (GH#8129) Julia Signell
Add tokenize.ensure-deterministic config option (GH#7413) Hristo Georgiev
Use inclusive rather than closed with pandas>=1.4.0 and pd.date_range (GH#8213) Julia Signell
Add dask-gateway, Coiled, and Saturn-Cloud to list of Dask setup tools (GH#7814) Kristopher Overholt
Ensure existing futures get passed as deps when serializing HighLevelGraph layers (GH#8199) Jim Crist-Harif
Make sure that the divisions of the single partition merge is left (GH#8162) Julia Signell
Refactor read_metadata in pyarrow parquet engines (GH#8072) Richard (Rick) Zamora
Support negative drop_axis in map_blocks and map_overlap (GH#8192) Gregory R. Lee
Fix upstream tests (GH#8205) Julia Signell
Add support for scalar item assignment by Series (GH#8195) Charles Blackmon-Luca
Add some basic examples to doc strings on dask.bag all, any, count methods (GH#7630) Nathan Danielsen
Don’t have upstream report depend on commit message (GH#8202) James Bourbeau
Ensure upstream CI cron job runs (GH#8200) James Bourbeau
Use pytest.param to properly label param-specific GPU tests (GH#8197) Charles Blackmon-Luca
Add test_set_index to tests ran on gpuCI (GH#8198) Charles Blackmon-Luca
Suppress tmpfile OSError (GH#8191) James Bourbeau
Use s.isna instead of pd.isna(s) in set_partitions_pre (fix cudf CI) (GH#8193) Charles Blackmon-Luca
Open an issue for test-upstream failures (GH#8067) Wallace Reis
Fix to_parquet bug in call to pyarrow.parquet.read_metadata (GH#8186) Richard (Rick) Zamora
Add handling for null values in sort_values (GH#8167) Charles Blackmon-Luca
Bump RAPIDS_VER for gpuCI (GH#8184) Charles Blackmon-Luca
Dispatch walks MRO for lazily registered handlers (GH#8185) Jim Crist-Harif
Configure SSHCluster instructions (GH#8181) Ray Bell
Preserve HighLevelGraphs in DataFrame.from_delayed (GH#8174) Gabe Joseph
Deprecate inplace argument for Dask series renaming (GH#8136) Marcel Coetzee
Fix rolling for compatibility with pandas > 1.3.0 (GH#8150) Julia Signell
Raise error when setitem on unknown chunks (GH#8166) Julia Signell
Include divisions when doing Index.to_series (GH#8165) Julia Signell

2021.09.1¶

Released on September 21, 2021

Fix groupby for future pandas (GH#8151) Julia Signell
Remove warning filters in tests that are no longer needed (GH#8155) Julia Signell
Add link to diagnostic visualize function in local diagnostic docs (GH#8157) David Hoese
Add datetime_is_numeric to dataframe.describe (GH#7719) Julia Signell
Remove references to pd.Int64Index in anticipation of deprecation (GH#8144) Julia Signell
Use loc if needed for series __get_item__ (GH#7953) Julia Signell
Specifically ignore warnings on mean for empty slices (GH#8125) Julia Signell
Skip groupby nunique test for pandas >= 1.3.3 (GH#8142) Julia Signell
Implement ascending arg for sort_values (GH#8130) Charles Blackmon-Luca
Replace operator.getitem (GH#8015) Naty Clementi
Deprecate zero_broadcast_dimensions and homogeneous_deepmap (GH#8134) SnkSynthesis
Add error if drop_index is negative (GH#8064) neel iyer
Allow scheduler to be an Executor (GH#8112) John A Kirkham
Handle asarray/asanyarray cases where like is a dask.Array (GH#8128) Peter Andreas Entschev
Fix index_col duplication if index_col is type str (GH#7661) McToel
Add dtype and order to asarray and asanyarray definitions (GH#8106) Julia Signell
Deprecate dask.dataframe.Series.__contains__ (GH#7914) Julia Signell
Fix edge case with like-arrays in _wrapped_qr (GH#8122) Peter Andreas Entschev
Deprecate boundary_slice kwarg: kind for pandas compat (GH#8037) Julia Signell

2021.09.0¶

Released on September 3, 2021

Fewer open files (GH#7303) Julia Signell
Add FileNotFound to expected http errors (GH#8109) Martin Durant
Add DataFrame.sort_values to API docs (GH#8107) Benjamin Zaitlen
Change to dask.order: be more eager at times (GH#7929) Erik Welch
Add pytest color to CI (GH#8090) James Bourbeau
FIX: make_people works with processes scheduler (GH#8103) Dahn
Adds deep param to Dataframe copy method and restrict it to False (GH#8068) João Paulo Lacerda
Fix typo in configuration docs (GH#8104) Robert Hales
Update formatting in DataFrame.query docstring (GH#8100) James Bourbeau
Un-xfail sparse tests for 0.13.0 release (GH#8102) James Bourbeau
Add axes property to DataFrame and Series (GH#8069) Jordan Jensen
Add CuPy support in da.unique (values only) (GH#8021) Peter Andreas Entschev
Unit tests for sparse.zeros_like (xfailed) (GH#8093) crusaderky
Add explicit like kwarg support to array creation functions (GH#8054) Peter Andreas Entschev
Separate Array and DataFrame mindeps builds (GH#8079) James Bourbeau
Fork out percentile_dispatch to dask.array (GH#8083) GALI PREM SAGAR
Ensure filepath exists in to_parquet (GH#8057) James Bourbeau
Update scheduler plugin usage in test_scheduler_highlevel_graph_unpack_import (GH#8080) James Bourbeau
Add DataFrame.shuffle to API docs (GH#8076) Martin Fleischmann
Order requirements alphabetically (GH#8073) John A Kirkham

2021.08.1¶

Released on August 20, 2021

Add ignore_metadata_file option to read_parquet (pyarrow-dataset and fastparquet support only) (GH#8034) Richard (Rick) Zamora
Add reference to pytest-xdist in dev docs (GH#8066) Julia Signell
Include tz in meta from to_datetime (GH#8000) Julia Signell
CI Infra Docs (GH#7985) Benjamin Zaitlen
Include invalid DataFrame key in assert_eq check (GH#8061) James Bourbeau
Use __class__ when creating DataFrames (GH#8053) Mads R. B. Kristensen
Use development version of distributed in gpuCI build (GH#7976) James Bourbeau
Ignore whitespace when gufunc signature (GH#8049) James Bourbeau
Move pandas import and percentile dispatch refactor (GH#8055) GALI PREM SAGAR
Add colors to represent high level layer types (GH#7974) Freyam Mehta
Upstream instance fix (GH#8060) Jacob Tomlinson
Add dask.widgets and migrate HTML reprs to jinja2 (GH#8019) Jacob Tomlinson
Remove wrap_func_like_safe, not required with NumPy >= 1.17 (GH#8052) Peter Andreas Entschev
Fix threaded scheduler memory backpressure regression (GH#8040) David Hoese
Add percentile dispatch (GH#8029) GALI PREM SAGAR
Use a publicly documented attribute obj in groupby rather than private _selected_obj (GH#8038) GALI PREM SAGAR
Specify module to import rechunk from (GH#8039) Illviljan
Use dict to store data for {nan,}arg{min,max} in certain cases (GH#8014) Peter Andreas Entschev
Fix blocksize description formatting in read_pandas (GH#8047) Louis Maddox
Fix “point” -> “pointers” typo in docs (GH#8043) David Chudzicki

2021.08.0¶

Released on August 13, 2021

Fix to_orc delayed compute behavior (GH#8035) Richard (Rick) Zamora
Don’t convert to low-level task graph in compute_as_if_collection (GH#7969) James Bourbeau
Fix multifile read for hdf (GH#8033) Julia Signell
Resolve warning in distributed tests (GH#8025) James Bourbeau
Update to_orc collection name (GH#8024) James Bourbeau
Resolve skipfooter problem (GH#7855) Ross
Raise NotImplementedError for non-indexable arg passed to to_datetime (GH#7989) Doug Davis
Ensure we error on warnings from distributed (GH#8002) James Bourbeau
Added dict format in to_bag accessories of DataFrame (GH#7932) gurunath
Delayed docs indirect dependencies (GH#8016) aa1371
Add tooltips to graphviz high-level graphs (GH#7973) Freyam Mehta
Close 2021 User Survey (GH#8007) Julia Signell
Reorganize CuPy tests into multiple files (GH#8013) Peter Andreas Entschev
Refactor and Expand Dask-Dataframe ORC API (GH#7756) Richard (Rick) Zamora
Don’t enforce columns if enforce=False (GH#7916) Julia Signell
Fix map_overlap trimming behavior when drop_axis is not None (GH#7894) Gregory R. Lee
Mark gpuCI CuPy test as flaky (GH#7994) Peter Andreas Entschev
Avoid using Delayed in to_csv and to_parquet (GH#7968) Matthew Rocklin
Removed redundant check_dtypes (GH#7952) gurunath
Use pytest.warns instead of raises for checking parquet engine deprecation (GH#7993) Joris Van den Bossche
Bump RAPIDS_VER in gpuCI to 21.10 (GH#7991) Charles Blackmon-Luca
Add back pyarrow-legacy test coverage for pyarrow>=5 (GH#7988) Richard (Rick) Zamora
Allow pyarrow>=5 in to_parquet and read_parquet (GH#7967) Richard (Rick) Zamora
Skip CuPy tests requiring NEP-35 when NumPy < 1.20 is available (GH#7982) Peter Andreas Entschev
Add tail and head to SeriesGroupby (GH#7935) Daniel Mesejo-León
Update Zoom link for monthly meeting (GH#7979) James Bourbeau
Add gpuCI build script (GH#7966) Charles Blackmon-Luca
Deprecate daily_stock utility (GH#7949) James Bourbeau
Add distributed.nanny to configuration reference docs (GH#7955) James Bourbeau
Require NumPy 1.18+ & Pandas 1.0+ (GH#7939) John A Kirkham

2021.07.2¶

Released on July 30, 2021

Note

This is the last release with support for NumPy 1.17 and pandas 0.25. Beginning with the next release, NumPy 1.18 and pandas 1.0 will be the minimum supported versions.

Add dask.array SVG to the HTML Repr (GH#7886) Freyam Mehta
Avoid use of Delayed in to_parquet (GH#7958) Matthew Rocklin
Temporarily pin pyarrow<5 in CI (GH#7960) James Bourbeau
Add deprecation warning for top-level ucx and rmm config values (GH#7956) James Bourbeau
Remove skips from doctests (4 of 6) (GH#7865) Zhengnan Zhao
Remove skips from doctests (5 of 6) (GH#7864) Zhengnan Zhao
Adds missing prepend/append functionality to da.diff (GH#7946) Peter Andreas Entschev
Change graphviz font family to sans (GH#7931) Freyam Mehta
Fix read-csv name - when path is different, use different name for task (GH#7942) Julia Signell
Update configuration reference for ucx and rmm changes (GH#7943) James Bourbeau
Add meta support to __setitem__ (GH#7940) Peter Andreas Entschev
NEP-35 support for slice_with_int_dask_array (GH#7927) Peter Andreas Entschev
Unpin fastparquet in CI (GH#7928) James Bourbeau
Remove skips from doctests (3 of 6) (GH#7872) Zhengnan Zhao

2021.07.1¶

Released on July 23, 2021

Make array assert_eq check dtype (GH#7903) Julia Signell
Remove skips from doctests (6 of 6) (GH#7863) Zhengnan Zhao
Remove experimental feature warning from actors docs (GH#7925) Matthew Rocklin
Remove skips from doctests (2 of 6) (GH#7873) Zhengnan Zhao
Separate out Array and Bag API (GH#7917) Julia Signell
Implement lazy Array.__iter__ (GH#7905) Julia Signell
Clean up places where we inadvertently iterate over arrays (GH#7913) Julia Signell
Add numeric_only kwarg to DataFrame reductions (GH#7831) Julia Signell
Add pytest marker for GPU tests (GH#7876) Charles Blackmon-Luca
Add support for histogram2d in dask.array (GH#7827) Doug Davis
Remove skips from doctests (1 of 6) (GH#7874) Zhengnan Zhao
Add node size scaling to the Graphviz output for the high level graphs (GH#7869) Freyam Mehta
Update old Bokeh links (GH#7915) Bryan Van de Ven
Temporarily pin fastparquet in CI (GH#7907) James Bourbeau
Add dask.array import to progress bar docs (GH#7910) Fabian Gebhart
Use separate files for each DataFrame API function and method (GH#7890) Julia Signell
Fix pyarrow-dataset ordering bug (GH#7902) Richard (Rick) Zamora
Generalize unique aggregate (GH#7892) GALI PREM SAGAR
Raise NotImplementedError when using pd.Grouper (GH#7857) Ruben van de Geer
Add aggregate_files argument to enable multi-file partitions in read_parquet (GH#7557) Richard (Rick) Zamora
Un-xfail test_daily_stock (GH#7895) James Bourbeau
Update access configuration docs (GH#7837) Naty Clementi
Use packaging for version comparisons (GH#7820) Elliott Sales de Andrade
Handle infinite loops in merge_asof (GH#7842) gerrymanoim

2021.07.0¶

Released on July 9, 2021

Include fastparquet in upstream CI build (GH#7884) James Bourbeau
Blockwise: handle non-string constant dependencies (GH#7849) Mads R. B. Kristensen
fastparquet now supports new time types, including ns precision (GH#7880) Martin Durant
Avoid ParquetDataset API when appending in ArrowDatasetEngine (GH#7544) Richard (Rick) Zamora
Add retry logic to test_shuffle_priority (GH#7879) Richard (Rick) Zamora
Use strict channel priority in CI (GH#7878) James Bourbeau
Support nested dask.distributed imports (GH#7866) Matthew Rocklin
Should check module name only, not the entire directory filepath (GH#7856) Genevieve Buckley
Updates due to https://github.com/dask/fastparquet/pull/623 (GH#7875) Martin Durant
da.eye fix for chunks=-1 (GH#7854) Naty Clementi
Temporarily xfail test_daily_stock (GH#7858) James Bourbeau
Set priority annotations in SimpleShuffleLayer (GH#7846) Richard (Rick) Zamora
Blockwise: stringify constant key inputs (GH#7838) Mads R. B. Kristensen
Allow mixing dask and numpy arrays in @guvectorize (GH#6863) Julia Signell
Don’t sample dict result of a shuffle group when calculating its size (GH#7834) Florian Jetter
Fix scipy tests (GH#7841) Julia Signell
Deterministically tokenize datetime.date (GH#7836) James Bourbeau
Add sample_rows to read_csv-like (GH#7825) Martin Durant
Fix typo in config.deserialize docstring (GH#7830) Geoffrey Lentner
Remove warning filter in test_dataframe_picklable (GH#7822) James Bourbeau
Improvements to histogramdd (for handling inputs that are sequences-of-arrays). (GH#7634) Doug Davis
Make PY_VERSION private (GH#7824) James Bourbeau

2021.06.2¶

Released on June 22, 2021

layers.py compare parts_out with set(self.parts_out) (GH#7787) Genevieve Buckley
Make check_meta understand pandas dtypes better (GH#7813) Julia Signell
Remove “Educational Resources” doc page (GH#7818) James Bourbeau

2021.06.1¶

Released on June 18, 2021

Replace funding page with ‘Supported By’ section on dask.org (GH#7817) James Bourbeau
Add initial deprecation utilities (GH#7810) James Bourbeau
Enforce dtype conservation in ufuncs that explicitly use dtype= (GH#7808) Doug Davis
Add Coiled to list of paid support organizations (GH#7811) Kristopher Overholt
Small tweaks to the HTML repr for Layer & HighLevelGraph (GH#7812) Genevieve Buckley
Add dark mode support to HLG HTML repr (GH#7809) Jacob Tomlinson
Remove compatibility entries for old distributed (GH#7801) Elliott Sales de Andrade
Implementation of HTML repr for HighLevelGraph layers (GH#7763) Genevieve Buckley
Update default blockwise token to avoid DataFrame column name clash (GH#6546) James Bourbeau
Use dispatch concat for merge_asof (GH#7806) Julia Signell
Fix upstream freq tests (GH#7795) Julia Signell
Use more context managers from the standard library (GH#7796) James Bourbeau
Simplify skips in parquet tests (GH#7802) Elliott Sales de Andrade
Remove check for outdated bokeh (GH#7804) Elliott Sales de Andrade
More test coverage uploads (GH#7799) James Bourbeau
Remove ImportError catching from dask/__init__.py (GH#7797) James Bourbeau
Allow DataFrame.join() to take a list of DataFrames to merge with (GH#7578) Krishan Bhasin
Fix maximum recursion depth exception in dask.array.linspace (GH#7667) Daniel Mesejo-León
Fix docs links (GH#7794) Julia Signell
Initial da.select() implementation and test (GH#7760) Gabriel Miretti
Layers must implement get_output_keys method (GH#7790) Genevieve Buckley
Don’t include or expect freq in divisions (GH#7785) Julia Signell
A HighLevelGraph abstract layer for map_overlap (GH#7595) Genevieve Buckley
Always include kwarg name in drop (GH#7784) Julia Signell
Only rechunk for median if needed (GH#7782) Julia Signell
Add add_(prefix|suffix) to DataFrame and Series (GH#7745) tsuga
Move read_hdf to Blockwise (GH#7625) Richard (Rick) Zamora
Make Layer.get_output_keys officially an abstract method (GH#7775) Genevieve Buckley
Non-dask-arrays and broadcasting in ravel_multi_index (GH#7594) Gabe Joseph
Fix for paths ending with “/” in parquet overwrite (GH#7773) Martin Durant
Fixing calling .visualize() with filename=None (GH#7740) Freyam Mehta
Generate unique names for SubgraphCallable (GH#7637) Bruce Merry
Pin fsspec to 2021.5.0 in CI (GH#7771) James Bourbeau
Evaluate graph lazily if meta is provided in from_delayed (GH#7769) Florian Jetter
Add meta support for DatetimeTZDtype (GH#7627) gerrymanoim
Add dispatch label to automatic PR labeler (GH#7701) James Bourbeau
Fix HDFS tests (GH#7752) Julia Signell

2021.06.0¶

Released on June 4, 2021

Remove abstract tokens from graph keys in rewrite_blockwise (GH#7721) Richard (Rick) Zamora
Ensure correct column order in csv project_columns (GH#7761) Richard (Rick) Zamora
Renamed inner loop variables to avoid duplication (GH#7741) Boaz Mohar
Do not return delayed object from to_zarr (GH#7738) Chris Roat
Array: correct number of outputs in apply_gufunc (GH#7669) Gabe Joseph
Rewrite da.fromfunction with da.blockwise (GH#7704) John A Kirkham
Rename make_meta_util to make_meta (GH#7743) GALI PREM SAGAR
Repartition before shuffle if the requested partitions are less than input partitions (GH#7715) Vibhu Jawa
Blockwise: handle constant key inputs (GH#7734) Mads R. B. Kristensen
Added raise to apply_gufunc (GH#7744) Boaz Mohar
Show failing tests summary in CI (GH#7735) Genevieve Buckley
sizeof sets in Python 3.9 (GH#7739) Mads R. B. Kristensen
Warn if using pandas datetimelike string in dataframe.__getitem__ (GH#7749) Julia Signell
Highlight the client.dashboard_link (GH#7747) Genevieve Buckley
Easier link for subscribing to the Google calendar (GH#7733) Genevieve Buckley
Automatically show graph visualization in Jupyter notebooks (GH#7716) Genevieve Buckley
Add autofunction for unify_chunks in API docs (GH#7730) James Bourbeau

2021.05.1¶

Released on May 28, 2021

Pandas compatibility (GH#7712) Julia Signell
Fix optimize_dataframe_getitem bug (GH#7698) Richard (Rick) Zamora
Update make_meta import in docs (GH#7713) Benjamin Zaitlen
Implement da.searchsorted (GH#7696) Tom White
Fix format string in error message (GH#7706) Jiaming Yuan
Fix read_sql_table returning wrong result for single column loads (GH#7572) c-thiel
Add slack join link in support.rst (GH#7679) Naty Clementi
Remove unused alphabet variable (GH#7700) James Bourbeau
Fix meta creation incase of object (GH#7586) GALI PREM SAGAR
Add dispatch for union_categoricals (GH#7699) GALI PREM SAGAR
Consolidate array Dispatch objects (GH#7505) James Bourbeau
Move DataFrame dispatch.registers to their own file (GH#7503) Julia Signell
Fix delayed with dataclasses where init=False (GH#7656) Julia Signell
Allow a column to be named divisions (GH#7605) Julia Signell
Stack nd array with unknown chunks (GH#7562) Chris Roat
Promote the 2021 Dask User Survey (GH#7694) Genevieve Buckley
Fix typo in DataFrame.set_index() (GH#7691) James Lamb
Cleanup array API reference links (GH#7684) David Hoese
Accept axis tuple for flip to be consistent with NumPy (GH#7675) Andrew Champion
Bump pre-commit hook versions (GH#7676) James Bourbeau
Cleanup to_zarr docstring (GH#7683) David Hoese
Fix the docstring of read_orc (GH#7678) Justus Magin
Doc ipyparallel & mpi4py concurrent.futures (GH#7665) John A Kirkham
Update tests to support CuPy 9 (GH#7671) Peter Andreas Entschev
Fix some HighLevelGraph documentation inaccuracies (GH#7662) Mads R. B. Kristensen
Fix spelling in Series getitem error message (GH#7659) Maisie Marshall

2021.05.0¶

Released on May 14, 2021

Remove deprecated kind kwarg to comply with pandas 1.3.0 (GH#7653) Julia Signell
Fix bug in DataFrame column projection (GH#7645) Richard (Rick) Zamora
Merge global annotations when packing (GH#7565) Mads R. B. Kristensen
Avoid inplace= in pandas set_categories (GH#7633) James Bourbeau
Change the active-fusion default to False for Dask-Dataframe (GH#7620) Richard (Rick) Zamora
Array: remove extraneous code from RandomState (GH#7487) Gabe Joseph
Implement str.concat when others=None (GH#7623) Daniel Mesejo-León
Fix dask.dataframe in sandboxed environments (GH#7601) Noah D. Brenowitz
Support for cupyx.scipy.linalg (GH#7563) Benjamin Zaitlen
Move timeseries and daily-stock to Blockwise (GH#7615) Richard (Rick) Zamora
Fix bugs in broadcast join (GH#7617) Richard (Rick) Zamora
Use Blockwise for DataFrame IO (parquet, csv, and orc) (GH#7415) Richard (Rick) Zamora
Adding chunk & type information to Dask HighLevelGraph s (GH#7309) Genevieve Buckley
Add pyarrow sphinx intersphinx_mapping (GH#7612) Ray Bell
Remove skip on test freq (GH#7608) Julia Signell
Defaults in read_parquet parameters (GH#7567) Ray Bell
Remove ignore_abc_warning (GH#7606) Julia Signell
Harden DataFrame merge between column-selection and index (GH#7575) Richard (Rick) Zamora
Get rid of ignore_abc decorator (GH#7604) Julia Signell
Remove kwarg validation for bokeh (GH#7597) Julia Signell
Add loky example (GH#7590) Naty Clementi
Delayed: nout when arguments become tasks (GH#7593) Gabe Joseph
Update distributed version in mindep CI build (GH#7602) James Bourbeau
Support all or no overlap between partition columns and real columns (GH#7541) Richard (Rick) Zamora

2021.04.1¶

Released on April 23, 2021

Handle Blockwise HLG pack/unpack for concatenate=True (GH#7455) Richard (Rick) Zamora
map_partitions: use tokenized info as name of the SubgraphCallable (GH#7524) Mads R. B. Kristensen
Using tmp_path and tmpdir to avoid temporary files and directories hanging in the repo (GH#7592) Naty Clementi
Contributing to docs (development guide) (GH#7591) Naty Clementi
Add more packages to Python 3.9 CI build (GH#7588) James Bourbeau
Array: Fix NEP-18 dispatching in finalize (GH#7508) Gabe Joseph
Misc fixes for numpydoc (GH#7569) Matthias Bussonnier
Avoid pandas level= keyword deprecation (GH#7577) James Bourbeau
Map e.g. .repartition(freq="M") to .repartition(freq="MS") (GH#7504) Ruben van de Geer
Remove hash seeding in parallel CI runs (GH#7128) Elliott Sales de Andrade
Add defaults in parameters in to_parquet (GH#7564) Ray Bell
Simplify transpose axes cleanup (GH#7561) Julia Signell
Make ValueError in len(index_names) > 1 explicit it’s using fastparquet (GH#7556) Ray Bell
Fix dict-column appending for pyarrow parquet engines (GH#7527) Richard (Rick) Zamora
Add a documentation auto label (GH#7560) Doug Davis
Add dask.delayed.Delayed to docs so it can be referenced by other sphinx docs (GH#7559) Doug Davis
Fix upstream idxmaxmin for uneven split_every (GH#7538) Julia Signell
Make normalize_token for pandas Series/DataFrame future proof (no direct block access) (GH#7318) Joris Van den Bossche
Redesigned __setitem__ implementation (GH#7393) David Hassell
histogram, histogramdd improvements (docs; return consistencies) (GH#7520) Doug Davis
Force nightly pyarrow in the upstream build (GH#7530) Joris Van den Bossche
Fix Configuration Reference (GH#7533) Benjamin Zaitlen
Use .to_parquet on dask.dataframe in doc string (GH#7528) Ray Bell
Avoid double msgpack serialization of HLGs (GH#7525) Mads R. B. Kristensen
Encourage usage of yaml.safe_load() in configuration doc (GH#7529) Hristo Georgiev
Fix reshape bug. Add relevant test. Fixes #7171. (GH#7523) JSKenyon
Support custom_metadata= argument in to_parquet (GH#7359) Richard (Rick) Zamora
Clean some documentation warnings (GH#7518) Daniel Mesejo-León
Getting rid of more docs warnings (GH#7426) Julia Signell
Added product (alias of prod) (GH#7517) Freyam Mehta
Fix upstream __array_ufunc__ tests (GH#7494) Julia Signell
Escape from map_overlap to map_blocks if depth is zero (GH#7481) Genevieve Buckley
Add check_type to array assert_eq (GH#7491) Julia Signell

2021.04.0¶

Released on April 2, 2021

Adding support for multidimensional histograms with dask.array.histogramdd (GH#7387) Doug Davis
Update docs on number of threads and workers in default LocalCluster (GH#7497) cameron16
Add labels automatically when certain files are touched in a PR (GH#7506) Julia Signell
Extract ignore_order from kwargs (GH#7500) GALI PREM SAGAR
Only provide installation instructions when distributed is missing (GH#7498) Matthew Rocklin
Start adding isort (GH#7370) Julia Signell
Add ignore_order parameter in dd.concat (GH#7473) Daniel Mesejo-León
Use powers-of-two when displaying RAM (GH#7484) crusaderky
Added License Classifier (GH#7485) Tom Augspurger
Replace conda with mamba (GH#7227) crusaderky
Fix typo in array docs (GH#7478) James Lamb
Use concurrent.futures in local scheduler (GH#6322) John A Kirkham

2021.03.1¶

Released on March 26, 2021

Add a dispatch for is_categorical_dtype to handle non-pandas objects (GH#7469) brandon-b-miller
Use multiprocessing.Pool in test_read_text (GH#7472) John A Kirkham
Add missing meta kwarg to gufunc class (GH#7423) Peter Andreas Entschev
Example for memory-mapped Dask array (GH#7380) Dieter Weber
Fix NumPy upstream failures xfail pandas and fastparquet failures (GH#7441) Julia Signell
Fix bug in repartition with freq (GH#7357) Ruben van de Geer
Fix __array_function__ dispatching for tril/triu (GH#7457) Peter Andreas Entschev
Use concurrent.futures.Executors in a few tests (GH#7429) John A Kirkham
Require NumPy >=1.16 (GH#7383) crusaderky
Minor sort_values housekeeping (GH#7462) Ryan Williams
Ensure natural sort order in parquet part paths (GH#7249) Ryan Williams
Remove global env mutation upon running test_config.py (GH#7464) Hristo Georgiev
Update NumPy intersphinx URL (GH#7460) Gabe Joseph
Add rot90 (GH#7440) Trevor Manz
Update docs for required package for endpoint (GH#7454) Nick Vazquez
Master -> main in slice_array docstring (GH#7453) Gabe Joseph
Expand dask.utils.is_arraylike docstring (GH#7445) Doug Davis
Simplify BlockwiseIODeps importing (GH#7420) Richard (Rick) Zamora
Update layer annotation packing method (GH#7430) James Bourbeau
Drop duplicate test in test_describe_empty (GH#7431) John A Kirkham
Add Series.dot method to dataframe module (GH#7236) Madhu94
Added df kurtosis-method and testing (GH#7273) Jan Borchmann
Avoid quadratic-time performance for HLG culling (GH#7403) Bruce Merry
Temporarily skip problematic sparse test (GH#7421) James Bourbeau
Update some CI workflow names (GH#7422) James Bourbeau
Fix HDFS test (GH#7418) Julia Signell
Make changelog subtitles match the hierarchy (GH#7419) Julia Signell
Add support for normalize in value_counts (GH#7342) Julia Signell
Avoid unnecessary imports for HLG Layer unpacking and materialization (GH#7381) Richard (Rick) Zamora
Bincount fix slicing (GH#7391) Genevieve Buckley
Add sliding_window_view (GH#7234) Deepak Cherian
Fix typo in docs/source/develop.rst (GH#7414) Hristo Georgiev
Switch documentation builds for PRs to readthedocs (GH#7397) James Bourbeau
Adds sort_values to dask.DataFrame (GH#7286) gerrymanoim
Pin sqlalchemy<1.4.0 in CI (GH#7405) James Bourbeau
Comment fixes (GH#7215) Ryan Williams
Dead code removal / fixes (GH#7388) Ryan Williams
Use single thread for pa.Table.from_pandas calls (GH#7347) Richard (Rick) Zamora
Replace 'container' with 'image' (GH#7389) James Lamb
DOC hyperlink repartition (GH#7394) Ray Bell
Pass delimiter to fsspec in bag.read_text (GH#7349) Martin Durant
Update read_hdf default mode to "r" (GH#7039) rs9w33
Embed literals in SubgraphCallable when packing Blockwise (GH#7353) Mads R. B. Kristensen
Update test_hdf.py to not reuse file handlers (GH#7044) rs9w33
Require additional dependencies: cloudpickle, partd, fsspec, toolz (GH#7345) Julia Signell
Prepare Blockwise + IO infrastructure (GH#7281) Richard (Rick) Zamora
Remove duplicated imports from test_slicing.py (GH#7365) Hristo Georgiev
Add test deps for pip development (GH#7360) Julia Signell
Support int slicing for non-NumPy arrays (GH#7364) Peter Andreas Entschev
Automatically cancel previous CI builds (GH#7348) James Bourbeau
dask.array.asarray should handle case where xarray class is in top-level namespace (GH#7335) Tom White
HighLevelGraph length without materializing layers (GH#7274) Gabe Joseph
Drop support for Python 3.6 (GH#7006) James Bourbeau
Fix fsspec usage in create_metadata_file (GH#7295) Richard (Rick) Zamora
Change default branch from master to main (GH#7198) Julia Signell
Add Xarray to CI software environment (GH#7338) James Bourbeau
Update repartition argument name in error text (GH#7336) Eoin Shanaghy
Run upstream tests based on commit message (GH#7329) James Bourbeau
Use pytest.register_assert_rewrite on util modules (GH#7278) Bruce Merry
Add example on using specific chunk sizes in from_array() (GH#7330) James Lamb
Move NumPy skip into test (GH#7247) Julia Signell

2021.03.0¶

Released on March 5, 2021

Note

This is the first release with support for Python 3.9 and the last release with support for Python 3.6

Bump minimum version of distributed (GH#7328) James Bourbeau
Fix percentiles_summary with dask_cudf (GH#7325) Peter Andreas Entschev
Temporarily revert recent Array.__setitem__ updates (GH#7326) James Bourbeau
Blockwise.clone (GH#7312) crusaderky
NEP-35 duck array update (GH#7321) James Bourbeau
Don’t allow setting .name for array (GH#7222) Julia Signell
Use nearest interpolation for creating percentiles of integer input (GH#7305) Kyle Barron
Test exp with CuPy arrays (GH#7322) John A Kirkham
Check that computed chunks have right size and dtype (GH#7277) Bruce Merry
pytest.mark.flaky (GH#7319) crusaderky
Contributing docs: add note to pull the latest git tags before pip installing Dask (GH#7308) Genevieve Buckley
Support for Python 3.9 (GH#7289) crusaderky
Add broadcast-based merge implementation (GH#7143) Richard (Rick) Zamora
Add split_every to graph_manipulation (GH#7282) crusaderky
Typo in optimize docs (GH#7306) Julius Busecke
dask.graph_manipulation support for xarray.Dataset (GH#7276) crusaderky
Add plot width and height support for Bokeh 2.3.0 (GH#7297) James Bourbeau
Add NumPy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (GH#6997) Illviljan
Remove “cleanup” task in DataFrame on-disk shuffle (GH#7260) Sinclair Target
Use development version of distributed in CI (GH#7279) James Bourbeau
Moving high level graph pack/unpack Dask (GH#7179) Mads R. B. Kristensen
Improve performance of merge_percentiles (GH#7172) Ashwin Srinath
DOC: add dask-sql and fugue (GH#7129) Ray Bell
Example for working with categoricals and parquet (GH#7085) McToel
Adds tree reduction to bincount (GH#7183) Thomas J. Fan
Improve documentation of name in from_array (GH#7264) Bruce Merry
Fix cumsum for empty partitions (GH#7230) Julia Signell
Add map_blocks example to dask array creation docs (GH#7221) Julia Signell
Fix performance issue in dask.graph_manipulation.wait_on() (GH#7258) crusaderky
Replace coveralls with codecov.io (GH#7246) crusaderky
Pin to a particular black rev in pre-commit (GH#7256) Julia Signell
Minor typo in documentation: array-chunks.rst (GH#7254) Magnus Nord
Fix bugs in Blockwise and ShuffleLayer (GH#7213) Richard (Rick) Zamora
Fix parquet filtering bug for "pyarrow-dataset" with pyarrow-3.0.0 (GH#7200) Richard (Rick) Zamora
graph_manipulation without NumPy (GH#7243) crusaderky
Support for NEP-35 (GH#6738) Peter Andreas Entschev
Avoid running unit tests during doctest CI build (GH#7240) James Bourbeau
Run doctests on CI (GH#7238) Julia Signell
Cleanup code quality on set arithmetics (GH#7196) crusaderky
Add dask.array.delete (GH#7125) Julia Signell
Unpin graphviz now that new conda-forge recipe is built (GH#7235) Julia Signell
Don’t use NumPy 1.20 from conda-forge on Mac (GH#7211) crusaderky
map_overlap: Don’t rechunk axes without overlap (GH#7233) Deepak Cherian
Pin graphviz to avoid issue with latest conda-forge build (GH#7232) Julia Signell
Use html_css_files in docs for custom CSS (GH#7220) James Bourbeau
Graph manipulation: clone, bind, checkpoint, wait_on (GH#7109) crusaderky
Fix handling of filter expressions in parquet pyarrow-dataset engine (GH#7186) Joris Van den Bossche
Extend __setitem__ to more closely match numpy (GH#7033) David Hassell
Clean up Python 2 syntax (GH#7195) crusaderky
Fix regression in Delayed._length (GH#7194) crusaderky
__dask_layers__() tests and tweaks (GH#7177) crusaderky
Properly convert HighLevelGraph in multiprocessing scheduler (GH#7191) Jim Crist-Harif
Don’t fail fast in CI (GH#7188) James Bourbeau

2021.02.0¶

Released on February 5, 2021

Add percentile support for NEP-35 (GH#7162) Peter Andreas Entschev
Added support for Float64 in column assignment (GH#7173) Nils Braun
Coarsen rechunking error (GH#7127) Davis Bennett
Fix upstream CI tests (GH#6896) Julia Signell
Revise HighLevelGraph Mapping API (GH#7160) crusaderky
Update low-level graph spec to use any hashable for keys (GH#7163) James Bourbeau
Generically rebuild a collection with different keys (GH#7142) crusaderky
Make easier to link issues in PRs (GH#7130) Ray Bell
Add dask.array.append (GH#7146) D-Stacks
Allow dask.array.ravel to accept array_like argument (GH#7138) D-Stacks
Fixes link in array design doc (GH#7152) Thomas J. Fan
Fix example of using blockwise for an outer product (GH#7119) Bruce Merry
Deprecate HighlevelGraph.dicts in favor of .layers (GH#7145) Amit Kumar
Align FastParquetEngine with pyarrow engines (GH#7091) Richard (Rick) Zamora
Merge annotations (GH#7102) Ian Rose
Simplify contents of parts list in read_parquet (GH#7066) Richard (Rick) Zamora
check_meta(): use __class__ when checking DataFrame types (GH#7099) Mads R. B. Kristensen
Cache several properties (GH#7104) Illviljan
Fix parquet getitem optimization (GH#7106) Richard (Rick) Zamora
Add cytoolz back to CI environment (GH#7103) James Bourbeau

2021.01.1¶

Released on January 22, 2021

Partially fix cumprod (GH#7089) Julia Signell
Test pandas 1.1.x / 1.2.0 releases and pandas nightly (GH#6996) Joris Van den Bossche
Use assign to avoid SettingWithCopyWarning (GH#7092) Julia Signell
'mode' argument passed to bokeh.output_file() (GH#7034) (GH#7075) patquem
Skip empty partitions when doing groupby.value_counts (GH#7073) Julia Signell
Add error messages to assert_eq() (GH#7083) James Lamb
Make cached properties read-only (GH#7077) Illviljan

2021.01.0¶

Released on January 15, 2021

map_partitions with review comments (GH#6776) Kumar Bharath Prabhu
Make sure that population is a real list (GH#7027) Julia Signell
Propagate storage_options in read_csv (GH#7074) Richard (Rick) Zamora
Remove all BlockwiseIO code (GH#7067) Richard (Rick) Zamora
Fix CI (GH#7069) James Bourbeau
Add option to control rechunking in reshape (GH#6753) Tom Augspurger
Fix linalg.lstsq for complex inputs (GH#7056) Johnnie Gray
Add compression='infer' default to read_csv (GH#6960) Richard (Rick) Zamora
Revert parameter changes in svd_compressed #7003 (GH#7004) Eric Czech
Skip failing s3 test (GH#7064) Martin Durant
Revert BlockwiseIO (GH#7048) Richard (Rick) Zamora
Add some cross-references to DataFrame.to_bag() and Series.to_bag() (GH#7049) Rob Malouf
Rewrite matmul as blockwise without contraction/concatenate (GH#7000) Rafal Wojdyla
Use functools.cached_property in da.shape (GH#7023) Illviljan
Use meta value in series non_empty (GH#6976) Julia Signell
Revert “Temporarly pin sphinx version to 3.3.1 (GH#7002)” (GH#7014) Rafal Wojdyla
Revert python-graphviz pinning (GH#7037) Julia Signell
Accidentally committed print statement (GH#7038) Julia Signell
Pass dropna and observed in agg (GH#6992) Julia Signell
Add index to meta after .str.split with expand (GH#7026) Ruben van de Geer
CI: test pyarrow 2.0 and nightly (GH#7030) Joris Van den Bossche
Temporarily pin python-graphviz in CI (GH#7031) James Bourbeau
Underline section in numpydoc (GH#7013) Matthias Bussonnier
Keep normal optimizations when adding custom optimizations (GH#7016) Matthew Rocklin
Temporarily pin sphinx version to 3.3.1 (GH#7002) Rafal Wojdyla
DOC: Misc formatting (GH#6998) Matthias Bussonnier
Add inline_array option to from_array (GH#6773) Tom Augspurger
Revert “Initial pass at blockwise array creation routines (GH#6931)” (:pr:`6995) James Bourbeau
Set npartitions in set_index (GH#6978) Julia Signell
Upstream config serialization and inheritance (GH#6987) Jacob Tomlinson
Bump the minimum time in test_minimum_time (GH#6988) Martin Durant
Fix pandas dtype inference for read_parquet (GH#6985) Richard (Rick) Zamora
Avoid data loss in set_index with sorted=True (GH#6980) Richard (Rick) Zamora
Bugfix in read_parquet for handling un-named indices with index=False (GH#6969) Richard (Rick) Zamora
Use __class__ when comparing meta data (GH#6981) Mads R. B. Kristensen
Comparing string versions won’t always work (GH#6979) Rafal Wojdyla
Fix GH#6925 (GH#6982) sdementen
Initial pass at blockwise array creation routines (GH#6931) Ian Rose
Simplify has_parallel_type() (GH#6927) Mads R. B. Kristensen
Handle annotation unpacking in BlockwiseIO (GH#6934) Simon Perkins
Avoid deprecated yield_fixture in test_sql.py (GH#6968) Richard (Rick) Zamora
Remove bad graph logic in BlockwiseIO (GH#6933) Richard (Rick) Zamora
Get config item if variable is None (GH#6862) Jacob Tomlinson
Update from_pandas docstring (GH#6957) Richard (Rick) Zamora
Prevent fuse_roots from clobbering annotations (GH#6955) Simon Perkins

2020.12.0¶

Released on December 10, 2020

Highlights¶

Switched to CalVer for versioning scheme.
Introduced new APIs for HighLevelGraph to enable sending high-level representations of task graphs to the distributed scheduler.
Introduced new HighLevelGraph layer objects including BasicLayer, Blockwise, BlockwiseIO, ShuffleLayer, and more.
Added support for applying custom Layer-level annotations like priority, retries, etc. with the dask.annotations context manager.
Updated minimum supported version of pandas to 0.25.0 and NumPy to 1.15.1.
Support for the pyarrow.dataset API to read_parquet.
Several fixes to Dask Array’s SVD.

All changes¶

Make observed kwarg optional (GH#6952) Julia Signell
Min supported pandas 0.25.0 numpy 1.15.1 (GH#6895) Julia Signell
Make order of categoricals unambiguous (GH#6949) Julia Signell
Improve “pyarrow-dataset” statistics performance for read_parquet (GH#6918) Richard (Rick) Zamora
Add observed keyword to groupby (GH#6854) Julia Signell
Make sure include_path_column works when there are multiple partitions per file (GH#6911) Julia Signell
Fix: array.overlap and array.map_overlap block sizes are incorrect when depth is an unsigned bit type (GH#6909) GFleishman
Fix syntax error in HLG docs example (GH#6946) Mark
Return a Bag from sample (GH#6941) Shang Wang
Add ravel_multi_index (GH#6939) Illviljan
Enable parquet metadata collection in parallel (GH#6921) Richard (Rick) Zamora
Avoid using _file in progressbar if it is None (GH#6938) Mark Harfouche
Add Zarr to upstream CI build (GH#6932) James Bourbeau
Introduce BlockwiseIO layer (GH#6878) Richard (Rick) Zamora
Transmit Layer Annotations to Scheduler (GH#6889) Simon Perkins
Update opportunistic caching page to remove experimental warning (GH#6926) Timost
Allow pyarrow >2.0.0 (GH#6772) Richard (Rick) Zamora
Support pyarrow.dataset API for read_parquet (GH#6534) Richard (Rick) Zamora
Add more informative error message to da.coarsen when coarsening factors do not divide shape (GH#6908) Davis Bennett
Only run the cron CI on dask/dask not forks (GH#6905) Jacob Tomlinson
Add annotations to ShuffleLayers (GH#6913) Matthew Rocklin
Temporarily xfail test_from_s3 (GH#6915) James Bourbeau
Added dataframe skew method (GH#6881) Jan Borchmann
Fix dtype in array meta (GH#6893) Julia Signell
Missing name arg in helm install ... (GH#6903) Ruben van de Geer
Fix: exception when reading an item with filters (GH#6901) Martin Durant
Add support for cupyx sparse to dask.array.dot (GH#6846) Akira Naruse
Pin array mindeps up a bit to get the tests to pass [test-mindeps] (GH#6894) Julia Signell
Update/remove pandas and numpy in mindeps (GH#6888) Julia Signell
Fix ArrowEngine bug in use of clear_known_categories (GH#6887) Richard (Rick) Zamora
Fix documentation about task scheduler (GH#6879) Zhengnan Zhao
Add human relative time formatting utility (GH#6883) Jacob Tomlinson
Possible fix for 6864 set_index issue (GH#6866) Richard (Rick) Zamora
BasicLayer: remove dependency arguments (GH#6859) Mads R. B. Kristensen
Serialization of Blockwise (GH#6848) Mads R. B. Kristensen
Address columns=[] bug (GH#6871) Richard (Rick) Zamora
Avoid duplicate parquet schema communication (GH#6841) Richard (Rick) Zamora
Add create_metadata_file utility for existing parquet datasets (GH#6851) Richard (Rick) Zamora
Improve ordering for workloads with a common terminus (GH#6779) Tom Augspurger
Stringify utilities (GH#6852) Mads R. B. Kristensen
Add keyword overwrite=True to to_parquet to remove dangling files when overwriting a pyarrow Dataset. (GH#6825) Greg Hayes
Removed map_tasks() and map_basic_layers() (GH#6853) Mads R. B. Kristensen
Introduce QR iteration to svd_compressed (GH#6813) RogerMoens
__dask_distributed_pack__() now takes a client argument (GH#6850) Mads R. B. Kristensen
Use map_partitions instead of delayed in set_index (GH#6837) Mads R. B. Kristensen
Add doc hit for as_completed().update(futures) (GH#6817) manuels
Bump GHA setup-miniconda version (GH#6847) Jacob Tomlinson
Remove nans when setting sorted index (GH#6829) Rockwell Weiner
Fix transpose of u in SVD (GH#6799) RogerMoens
Migrate to GitHub Actions (GH#6794) Jacob Tomlinson
Fix sphinx currentmodule usage (GH#6839) James Bourbeau
Fix minimum dependencies CI builds (GH#6838) James Bourbeau
Avoid graph materialization during Blockwise culling (GH#6815) Richard (Rick) Zamora
Fixed typo (GH#6834) Devanshu Desai
Use HighLevelGraph.merge in collections_to_dsk (GH#6836) Mads R. B. Kristensen
Respect dtype in svd compression_matrix #2849 (GH#6802) RogerMoens
Add blocksize to task name (GH#6818) Julia Signell
Check for all-NaN partitions (GH#6821) Rockwell Weiner
Change “institutional” SQL doc section to point to main SQL doc (GH#6823) Martin Durant
Fix: DataFrame.join doesn’t accept Series as other (GH#6809) David Katz
Remove to_delayed operations from to_parquet (GH#6801) Richard (Rick) Zamora
Layer annotation docstrings improvements (GH#6806) Simon Perkins
Avro reader (GH#6780) Martin Durant
Rechunk array if smallest chunk size is smaller than depth (GH#6708) Julia Signell
Add Layer Annotations (GH#6767) Simon Perkins
Add “view code” links to documentation (GH#6793) manuels
Add optional IO-subgraph to Blockwise Layers (GH#6715) Richard (Rick) Zamora
Add high level graph pack/unpack for distributed (GH#6786) Mads R. B. Kristensen
Add missing methods of the Dataframe API (GH#6789) Stephannie Jimenez Gacha
Add doc on managing environments (GH#6778) Martin Durant
HLG: get_all_external_keys() (GH#6774) Mads R. B. Kristensen
Avoid rechunking in reshape with chunksize=1 (GH#6748) Tom Augspurger
Try to make categoricals work on join (GH#6205) Julia Signell
Fix some minor typos and trailing whitespaces in array-slice.rst (GH#6771) Magnus Nord
Bugfix for parquet metadata writes of empty dataframe partitions (pyarrow) (GH#6741) Callum Noble
Document meta kwarg in map_blocks and map_overlap. (GH#6763) Peter Andreas Entschev
Begin experimenting with parallel prefix scan for cumsum and cumprod (GH#6675) Erik Welch
Clarify differences in boolean indexing between dask and numpy arrays (GH#6764) Illviljan
Efficient serialization of shuffle layers (GH#6760) James Bourbeau
Config array optimize to skip fusion and return a HLG (GH#6751) Mads R. B. Kristensen
Temporarily use pyarrow<2 in CI (GH#6759) James Bourbeau
Fix meta for min/max reductions (GH#6736) Peter Andreas Entschev
Add 2D possibility to da.linalg.lstsq - mirroring numpy (GH#6749) Pascal Bourgault
CI: Fixed bug causing flaky test failure in pivot (GH#6752) Tom Augspurger
Serialization of layers (GH#6693) Mads R. B. Kristensen
Add attrs property to Series/Dataframe (GH#6742) Illviljan
Removed Mutable Default Argument (GH#6747) Mads R. B. Kristensen
Adjust parquet ArrowEngine to allow more easy subclass for writing (GH#6505) Joris Van den Bossche
Add ShuffleStage HLG Layer (GH#6650) Richard (Rick) Zamora
Handle literal in meta_from_array (GH#6731) Peter Andreas Entschev
Do balanced rechunking even if chunks are the same (GH#6735) Chris Roat
Fix docstring DataFrame.set_index (GH#6739) Gil Forsyth
Ensure HighLevelGraph layers always contain Layer instances (GH#6716) James Bourbeau
Map on HighLevelGraph Layers (GH#6689) Mads R. B. Kristensen
Update overlap *_like function calls and CuPy tests (GH#6728) Peter Andreas Entschev
Fixes for svd with __array_function__ (GH#6727) Peter Andreas Entschev
Added doctest extension for documentation (GH#6397) Jim Circadian
Minor fix to #5628 using @pentschev’s suggestion (GH#6724) John A Kirkham
Change type of Dask array when meta type changes (GH#5628) Matthew Rocklin
Add az (GH#6719) Ray Bell
HLG: get_dependencies() of single keys (GH#6699) Mads R. B. Kristensen
Revert “Revert “Use HighLevelGraph layers everywhere in collections (GH#6510)” (GH#6697)” (GH#6707) Tom Augspurger
Allow *_like array creation functions to respect input array type (GH#6680) Genevieve Buckley
Update dask-sphinx-theme version (GH#6700) Gil Forsyth

2.30.0 / 2020-10-06¶

Array¶

Allow rechunk to evenly split into N chunks (GH#6420) Scott Sievert

2.29.0 / 2020-10-02¶

Array¶

_repr_html_: color sides darker instead of drawing all the lines (GH#6683) Julia Signell
Removes warning from nanstd and nanvar (GH#6667) Thomas J. Fan
Get shape of output from original array - map_overlap (GH#6682) Julia Signell
Replace np.searchsorted with bisect in indexing (GH#6669) Joachim B Haga

Bag¶

Make sure subprocesses have a consistent hash for bag groupby (GH#6660) Itamar Turner-Trauring

Core¶

Revert “Use HighLevelGraph layers everywhere in collections (GH#6510)” (GH#6697) Tom Augspurger
Use pandas.testing (GH#6687) John A Kirkham
Improve 128-bit floating-point skip in tests (GH#6676) Elliott Sales de Andrade

DataFrame¶

Allow setting dataframe items using a bool dataframe (GH#6608) Julia Signell

Documentation¶

Fix typo (GH#6692) garanews
Fix a few typos (GH#6678) Pav A

2.28.0 / 2020-09-25¶

Array¶

Partially reverted changes to Array indexing that produces large changes. This restores the behavior from Dask 2.25.0 and earlier, with a warning when large chunks are produced. A configuration option is provided to avoid creating the large chunks, see Efficiency. (GH#6665) Tom Augspurger
Add meta to to_dask_array (GH#6651) Kyle Nicholson
Fix GH#6631 and GH#6611 (GH#6632) Rafal Wojdyla
Infer object in array reductions (GH#6629) Daniel Saxton
Adding v_based flag for svd_flip (GH#6658) Eric Czech
Fix flakey array mean (GH#6656) Sam Grayson

Core¶

Removed dsk equality check from SubgraphCallable.__eq__ (GH#6666) Mads R. B. Kristensen
Use HighLevelGraph layers everywhere in collections (GH#6510) Mads R. B. Kristensen
Adds hash dunder method to SubgraphCallable for caching purposes (GH#6424) Andrew Fulton
Stop writing commented out config files by default (GH#6647) Matthew Rocklin

DataFrame¶

Add support for collect list aggregation via agg API (GH#6655) Madhur Tandon
Slightly better error message (GH#6657) Julia Signell

2.27.0 / 2020-09-18¶

Array¶

Preserve dtype in svd (GH#6643) Eric Czech

Core¶

store(): create a single HLG layer (GH#6601) Mads R. B. Kristensen
Add pre-commit CI build (GH#6645) James Bourbeau
Update .pre-commit-config to latest black. (GH#6641) Julia Signell
Update super usage to remove Python 2 compatibility (GH#6630) Poruri Sai Rahul
Remove u string prefixes (GH#6633) Poruri Sai Rahul

DataFrame¶

Improve error message for to_sql (GH#6638) Julia Signell
Use empty list as categories (GH#6626) Julia Signell

Documentation¶

Add autofunction to array api docs for more ufuncs (GH#6644) James Bourbeau
Add a number of missing ufuncs to dask.array docs (GH#6642) Ralf Gommers
Add HelmCluster docs (GH#6290) Jacob Tomlinson

2.26.0 / 2020-09-11¶

Array¶

Backend-aware dtype inference for single-chunk svd (GH#6623) Eric Czech
Make array.reduction docstring match for dtype (GH#6624) Martin Durant
Set lower bound on compression level for svd_compressed using rows and cols (GH#6622) Eric Czech
Improve SVD consistency and small array handling (GH#6616) Eric Czech
Add svd_flip #6599 (GH#6613) Eric Czech
Handle sequences containing dask Arrays (GH#6595) Gabe Joseph
Avoid large chunks from getitem with lists (GH#6514) Tom Augspurger
Eagerly slice numpy arrays in from_array (GH#6605) Deepak Cherian
Restore ability to pickle dask arrays (GH#6594) Noah D. Brenowitz
Add SVD support for short-and-fat arrays (GH#6591) Eric Czech
Add simple chunk type registry and defer as appropriate to upcast types (GH#6393) Jon Thielen
Align coarsen chunks by default (GH#6580) Deepak Cherian
Fixup reshape on unknown dimensions and other testing fixes (GH#6578) Ryan Williams

Core¶

Add validation and fixes for HighLevelGraph dependencies (GH#6588) Mads R. B. Kristensen
Fix linting issue (GH#6598) Tom Augspurger
Skip bokeh version 2.0.0 (GH#6572) John A Kirkham

DataFrame¶

Added bytes/row calculation when using meta (GH#6585) McToel
Handle min_count in Series.sum / prod (GH#6618) Daniel Saxton
Update DataFrame.set_index docstring (GH#6549) Timost
Always compute 0 and 1 quantiles during quantile calculations (GH#6564) Erik Welch
Fix wrong path when reading empty csv file (GH#6573) Abdulelah Bin Mahfoodh

Documentation¶

Doc: Troubleshooting dashboard 404 (GH#6215) Kilian Lieret
Fixup extraConfig example (GH#6625) Tom Augspurger
Update supported Python versions (GH#6609) Julia Signell
Document dask/daskhub helm chart (GH#6560) Tom Augspurger

2.25.0 / 2020-08-28¶

Core¶

Compare key hashes in subs() (GH#6559) Mads R. B. Kristensen
Rerun with latest black release (GH#6568) James Bourbeau
License update (GH#6554) Tom Augspurger

DataFrame¶

Add gs read_parquet example (GH#6548) Ray Bell

Documentation¶

Remove version from documentation page names (GH#6558) James Bourbeau
Update kubernetes-helm.rst (GH#6523) David Sheldon
Stop 2020 survey (GH#6547) Tom Augspurger

2.24.0 / 2020-08-22¶

Array¶

Fix setting random seed in tests. (GH#6518) Elliott Sales de Andrade
Support meta in apply gufunc (GH#6521) joshreback
Replace cupy.sparse with cupyx.scipy.sparse (GH#6530) John A Kirkham

Dataframe¶

Bump up tolerance for rolling tests (GH#6502) Julia Signell
Implement DatFrame.__len__ (GH#6515) Tom Augspurger
Infer arrow schema in to_parquet (for ArrowEngine`) (GH#6490) Richard (Rick) Zamora
Fix parquet test when no pyarrow (GH#6524) Martin Durant
Remove problematic filter arguments in ArrowEngine (GH#6527) Richard (Rick) Zamora
Avoid schema validation by default in ArrowEngine (GH#6536) Richard (Rick) Zamora

Core¶

Use unpack_collections in make_blockwise_graph (GH#6517) Thomas J. Fan
Move key_split() from optimization.py to utils.py (GH#6529) Mads R. B. Kristensen
Make tests run on moto server (GH#6528) Martin Durant

2.23.0 / 2020-08-14¶

Array¶

Reduce np.zeros, ones, and full array size with broadcasting (GH#6491) Matthias Bussonnier
Add missing meta= for trim in map_overlap (GH#6494) Peter Andreas Entschev

Bag¶

Bag repartition partition size (GH#6371) joshreback

Core¶

Scalar.__dask_layers__() to return self._name instead of self.key (GH#6507) Mads R. B. Kristensen
Update dependencies correctly in fuse_root optimization (GH#6508) Mads R. B. Kristensen

DataFrame¶

Adds items to dataframe (GH#6503) Thomas J. Fan
Include compression in write_table call (GH#6499) Julia Signell
Fixed warning in nonempty_series (GH#6485) Tom Augspurger
Intelligently determine partitions based on type of first arg (GH#6479) Matthew Rocklin
Fix pyarrow mkdirs (GH#6475) Julia Signell
Fix duplicate parquet output in to_parquet (GH#6451) michaelnarodovitch

Documentation¶

Fix documentation da.histogram (GH#6439) Roberto Panai
Add agg nunique example (GH#6404) Ray Bell
Fixed a few typos in the SQL docs (GH#6489) Mike McCarty
Docs for SQLing (GH#6453) Martin Durant

2.22.0 / 2020-07-31¶

Array¶

Compatibility for NumPy dtype deprecation (GH#6430) Tom Augspurger

Core¶

Implement sizeof for some bytes-like objects (GH#6457) John A Kirkham
HTTP error for new fsspec (GH#6446) Martin Durant
When RecursionError is raised, return uuid from tokenize function (GH#6437) Julia Signell
Install deps of upstream-dev packages (GH#6431) Tom Augspurger
Use updated link in setup.cfg (GH#6426) Zhengnan Zhao

DataFrame¶

Add single quotes around column names if strings (GH#6471) Gil Forsyth
Refactor ArrowEngine for better read_parquet performance (GH#6346) Richard (Rick) Zamora
Add tolist dispatch (GH#6444) GALI PREM SAGAR
Compatibility with pandas 1.1.0rc0 (GH#6429) Tom Augspurger
Multi value pivot table (GH#6428) joshreback
Duplicate argument definitions in to_csv docstring (GH#6411) Jun Han (Johnson) Ooi

Documentation¶

Add utility to docs to convert YAML config to env vars and back (GH#6472) Jacob Tomlinson
Fix parameter server rendering (GH#6466) Scott Sievert
Fixes broken links (GH#6403) Jim Circadian
Complete parameter server implementation in docs (GH#6449) Scott Sievert
Fix typo (GH#6436) Jack Xiaosong Xu

2.21.0 / 2020-07-17¶

Array¶

Correct error message in array.routines.gradient() (GH#6417) johnomotani
Fix blockwise concatenate for array with some dimension=1 (GH#6342) Matthias Bussonnier

Bag¶

Fix bag.take example (GH#6418) Roberto Panai

Core¶

Groups values in optimization pass should only be graph and keys – not an optimization + keys (GH#6409) Benjamin Zaitlen
Call custom optimizations once, with kwargs provided (GH#6382) Clark Zinzow
Include pickle5 for testing on Python 3.7 (GH#6379) John A Kirkham

DataFrame¶

Correct typo in error message (GH#6422) Tom McTiernan
Use pytest.warns to check for UserWarning (GH#6378) Richard (Rick) Zamora
Parse bytes_per_chunk keyword from string (GH#6370) Matthew Rocklin

Documentation¶

Numpydoc formatting (GH#6421) Matthias Bussonnier
Unpin numpydoc following 1.1 release (GH#6407) Gil Forsyth
Numpydoc formatting (GH#6402) Matthias Bussonnier
Add instructions for using conda when installing code for development (GH#6399) Ray Bell
Update visualize docstrings (GH#6383) Zhengnan Zhao

2.20.0 / 2020-07-02¶

Array¶

Register sizeof for numpy zero-strided arrays (GH#6343) Matthias Bussonnier
Use concatenate_lookup in concatenate (GH#6339) John A Kirkham
Fix rechunking of arrays with some zero-length dimensions (GH#6335) Matthias Bussonnier

DataFrame¶

Dispatch iloc` calls to getitem (GH#6355) Gil Forsyth
Handle unnamed pandas RangeIndex in fastparquet engine (GH#6350) Richard (Rick) Zamora
Preserve index when writing partitioned parquet datasets with pyarrow (GH#6282) Richard (Rick) Zamora
Use ignore_index for pandas’ group_split_dispatch (GH#6251) Richard (Rick) Zamora

Documentation¶

Add doc describing argument (GH#6318) asmith26

2.19.0 / 2020-06-19¶

Array¶

Cast chunk sizes to python int dtype (GH#6326) Gil Forsyth
Add shape=None to *_like() array creation functions (GH#6064) Anderson Banihirwe

Core¶

Update expected error msg for protocol difference in fsspec (GH#6331) Gil Forsyth
Fix for floats < 1 in parse_bytes (GH#6311) Gil Forsyth
Fix exception causes all over the codebase (GH#6308) Ram Rachum
Fix duplicated tests (GH#6303) James Lamb
Remove unused testing function (GH#6304) James Lamb

DataFrame¶

Add high-level CSV Subgraph (GH#6262) Gil Forsyth
Fix ValueError when merging an index-only 1-partition dataframe (GH#6309) Krishan Bhasin
Make index.map clear divisions. (GH#6285) Julia Signell

Documentation¶

Add link to 2020 survey (GH#6328) Tom Augspurger
Update bag.rst (GH#6317) Ben Shaver

2.18.1 / 2020-06-09¶

Array¶

Don’t try to set name on full (GH#6299) Julia Signell
Histogram: support lazy values for range/bins (another way) (GH#6252) Gabe Joseph

Core¶

Fix exception causes in utils.py (GH#6302) Ram Rachum
Improve performance of HighLevelGraph construction (GH#6293) Julia Signell

Documentation¶

Now readthedocs builds unrelased features’ docstrings (GH#6295) Antonio Ercole De Luca
Add asyncssh intersphinx mappings (GH#6298) Jacob Tomlinson

2.18.0 / 2020-06-05¶

Array¶

Cast slicing index to dask array if same shape as original (GH#6273) Julia Signell
Fix stack error message (GH#6268) Stephanie Gott
full & full_like: error on non-scalar fill_value (GH#6129) Huite
Support for multiple arrays in map_overlap (GH#6165) Eric Czech
Pad resample divisions so that edges are counted (GH#6255) Julia Signell

Bag¶

Random sampling of k elements from a dask bag #4799 (GH#6239) Antonio Ercole De Luca

DataFrame¶

Add dropna, sort, and ascending to sort_values (GH#5880) Julia Signell
Generalize from_dask_array (GH#6263) GALI PREM SAGAR
Add derived docstring for SeriesGroupby.nunique (GH#6284) Julia Signell
Remove NotImplementedError in resample with rule (GH#6274) Abdulelah Bin Mahfoodh
Add dd.to_sql (GH#6038) Ryan Williams

Documentation¶

Update remote data section (GH#6258) Ray Bell

2.17.2 / 2020-05-28¶

Core¶

Re-add the complete extra (GH#6257) Jim Crist-Harif

DataFrame¶

Raise error if resample isn’t going to give right answer (GH#6244) Julia Signell

2.17.1 / 2020-05-28¶

Array¶

Empty array rechunk (GH#6233) Andrew Fulton

Core¶

Make pyyaml required (GH#6250) Jim Crist-Harif
Fix install commands from ImportError (GH#6238) Gaurav Sheni
Remove issue template (GH#6249) Jacob Tomlinson

DataFrame¶

Pass ignore_index to dd_shuffle from DataFrame.shuffle (GH#6247) Richard (Rick) Zamora
Cope with missing HDF keys (GH#6204) Martin Durant
Generalize describe & quantile apis (GH#5137) GALI PREM SAGAR

2.17.0 / 2020-05-26¶

Array¶

Small improvements to da.pad (GH#6213) Mark Boer
Return tuple if multiple outputs in dask.array.apply_gufunc, add test to check for tuple (GH#6207) Kai Mühlbauer
Support stack with unknown chunksizes (GH#6195) swapna

Bag¶

Random Choice on Bags (GH#6208) Antonio Ercole De Luca

Core¶

Raise warning delayed.visualise() (GH#6216) Amol Umbarkar
Ensure other pickle arguments work (GH#6229) John A Kirkham
Overhaul fuse() config (GH#6198) crusaderky
Update dask.order.order to consider “next” nodes using both FIFO and LIFO (GH#5872) Erik Welch

DataFrame¶

Use 0 as fill_value for more agg methods (GH#6245) Julia Signell
Generalize rearrange_by_column_tasks and add DataFrame.shuffle (GH#6066) Richard (Rick) Zamora
Xfail test_rolling_numba_engine for newer numba and older pandas (GH#6236) James Bourbeau
Generalize fix_overlap (GH#6240) GALI PREM SAGAR
Fix DataFrame.shape with no columns (GH#6237) noreentry
Avoid shuffle when setting a presorted index with overlapping divisions (GH#6226) Krishan Bhasin
Adjust the Parquet engine classes to allow more easily subclassing (GH#6211) Marius van Niekerk
Fix dd.merge_asof with left_on='col' & right_index=True (GH#6192) noreentry
Disable warning for concat (GH#6210) Tung Dang
Move AUTO_BLOCKSIZE out of read_csv signature (GH#6214) Jim Crist-Harif
.loc indexing with callable (GH#6185) Endre Mark Borza
Avoid apply in _compute_sum_of_squares for groupby std agg (GH#6186) Richard (Rick) Zamora
Minor correction to test_parquet (GH#6190) Brian Larsen
Adhering to the passed pat for delimeter join and fix error message (GH#6194) GALI PREM SAGAR
Skip test_to_parquet_with_get if no parquet libs available (GH#6188) Scott Sanderson

Documentation¶

Added documentation for distributed.Event class (GH#6231) Nils Braun
Doc write to remote (GH#6124) Ray Bell

2.16.0 / 2020-05-08¶

Array¶

Fix array general-reduction name (GH#6176) Nick Evans
Replace dim with shape in unravel_index (GH#6155) Julia Signell
Moment: handle all elements being masked (GH#5339) Gabe Joseph

Core¶

Remove Redundant string concatenations in dask code-base (GH#6137) GALI PREM SAGAR
Upstream compat (GH#6159) Tom Augspurger
Ensure sizeof of dict and sequences returns an integer (GH#6179) James Bourbeau
Estimate python collection sizes with random sampling (GH#6154) Florian Jetter
Update test upstream (GH#6146) Tom Augspurger
Skip test for mindeps build (GH#6144) Tom Augspurger
Switch default multiprocessing context to “spawn” (GH#4003) Itamar Turner-Trauring
Update manifest to include dask-schema (GH#6140) Benjamin Zaitlen

DataFrame¶

Harden inconsistent-schema handling in pyarrow-based read_parquet (GH#6160) Richard (Rick) Zamora
Add compute kwargs to methods that write data to disk (GH#6056) Krishan Bhasin
Fix issue where unique returns an index like result from backends (GH#6153) GALI PREM SAGAR
Fix internal error in map_partitions with collections (GH#6103) Tom Augspurger

Documentation¶

Add phase of computation to index TOC (GH#6157) Benjamin Zaitlen
Remove unused imports in scheduling script (GH#6138) James Lamb
Fix indent (GH#6147) Martin Durant
Add Tom’s log config example (GH#6143) Martin Durant

2.15.0 / 2020-04-24¶

Array¶

Update dask.array.from_array to warn when passed a Dask collection (GH#6122) James Bourbeau
Un-numpy like behaviour in dask.array.pad (GH#6042) Mark Boer
Add support for repeats=0 in da.repeat (GH#6080) James Bourbeau

Core¶

Fix yaml layout for schema (GH#6132) Benjamin Zaitlen
Configuration Reference (GH#6069) Benjamin Zaitlen
Add configuration option to turn off task fusion (GH#6087) Matthew Rocklin
Skip pyarrow on windows (GH#6094) Tom Augspurger
Set limit to maximum length of fused key (GH#6057) Lucas Rademaker
Add test against #6062 (GH#6072) Martin Durant
Bump checkout action to v2 (GH#6065) James Bourbeau

DataFrame¶

Generalize categorical calls to support cudf Categorical (GH#6113) GALI PREM SAGAR
Avoid reading _metadata on every worker (GH#6017) Richard (Rick) Zamora
Use group_split_dispatch and ignore_index in apply_concat_apply (GH#6119) Richard (Rick) Zamora
Handle new (dtype) pandas metadata with pyarrow (GH#6090) Richard (Rick) Zamora
Skip test_partition_on_cats_pyarrow if pyarrow is not installed (GH#6112) James Bourbeau
Update DataFrame len to handle columns with the same name (GH#6111) James Bourbeau
ArrowEngine bug fixes and test coverage (GH#6047) Richard (Rick) Zamora
Added mode (GH#5958) Adam Lewis

Documentation¶

Update “helm install” for helm 3 usage (GH#6130) JulianWgs
Extend preload documentation (GH#6077) Matthew Rocklin
Fixed small typo in DataFrame map_partitions() docstring (GH#6115) Eugene Huang
Fix typo: “double” should be times, not plus (GH#6091) David Chudzicki
Fix first line of array.random.* docs (GH#6063) Martin Durant
Add section about Semaphore in distributed (GH#6053) Florian Jetter

2.14.0 / 2020-04-03¶

Array¶

Added np.iscomplexobj implementation (GH#6045) Tom Augspurger

Core¶

Update test_rearrange_disk_cleanup_with_exception to pass without cloudpickle installed (GH#6052) James Bourbeau
Fixed flaky test-rearrange (GH#5977) Tom Augspurger

DataFrame¶

Use _meta_nonempty for dtype casting in stack_partitions (GH#6061) mlondschien
Fix bugs in _metadata creation and filtering in parquet ArrowEngine (GH#6023) Richard (Rick) Zamora

Documentation¶

DOC: Add name caveats (GH#6040) Tom Augspurger

2.13.0 / 2020-03-25¶

Array¶

Support dtype and other keyword arguments in da.random (GH#6030) Matthew Rocklin
Register support for cupy sparse hstack/vstack (GH#5735) Corey J. Nolet
Force self.name to str in dask.array (GH#6002) Chuanzhu Xu

Bag¶

Set rename_fused_keys to None by default in bag.optimize (GH#6000) Lucas Rademaker

Core¶

Copy dict in to_graphviz to prevent overwriting (GH#5996) JulianWgs
Stricter pandas xfail (GH#6024) Tom Augspurger
Fix CI failures (GH#6013) James Bourbeau
Update toolz to 0.8.2 and use tlz (GH#5997) Ryan Grout
Move Windows CI builds to GitHub Actions (GH#5862) James Bourbeau

DataFrame¶

Improve path-related exceptions in read_hdf (GH#6032) psimaj
Fix dtype handling in dd.concat (GH#6006) mlondschien
Handle cudf’s leftsemi and leftanti joins (GH#6025) Richard J Zamora
Remove unused npartitions variable in dd.from_pandas (GH#6019) Daniel Saxton
Added shuffle to DataFrame.random_split (GH#5980) petiop

Documentation¶

Fix indentation in scheduler-overview docs (GH#6022) Matthew Rocklin
Update task graphs in optimize docs (GH#5928) Julia Signell
Optionally get rid of intermediary boxes in visualize, and add more labels (GH#5976) Julia Signell

2.12.0 / 2020-03-06¶

Array¶

Improve reuse of temporaries with numpy (GH#5933) Bruce Merry
Make map_blocks with block_info produce a Blockwise (GH#5896) Bruce Merry
Optimize make_blockwise_graph (GH#5940) Bruce Merry
Fix axes ordering in da.tensordot (GH#5975) Gil Forsyth
Adds empty mode to array.pad (GH#5931) Thomas J. Fan

Core¶

Remove toolz.memoize dependency in dask.utils (GH#5978) Ryan Grout
Close pool leaking subprocess (GH#5979) Tom Augspurger
Pin numpydoc to 0.8.0 (fix double autoescape) (GH#5961) Gil Forsyth
Register deterministic tokenization for range objects (GH#5947) James Bourbeau
Unpin msgpack in CI (GH#5930) JAmes Bourbeau
Ensure dot results are placed in unique files. (GH#5937) Elliott Sales de Andrade
Add remaining optional dependencies to Travis 3.8 CI build environment (GH#5920) James Bourbeau

DataFrame¶

Skip parquet getitem optimization for some keys (GH#5917) Tom Augspurger
Add ignore_index argument to rearrange_by_column code path (GH#5973) Richard J Zamora
Add DataFrame and Series memory_usage_per_partition methods (GH#5971) James Bourbeau
xfail test_describe when using Pandas 0.24.2 (GH#5948) James Bourbeau
Implement dask.dataframe.to_numeric (GH#5929) Julia Signell
Add new error message content when columns are in a different order (GH#5927) Julia Signell
Use shallow copy for assign operations when possible (GH#5740) Richard J Zamora

Documentation¶

Changed above to below in dask.array.triu docs (GH#5984) Henrik Andersson
Array slicing: fix typo in slice_with_int_dask_array error message (GH#5981) Gabe Joseph
Grammar and formatting updates to docstrings (GH#5963) James Lamb
Update develop doc with conda option (GH#5939) Ray Bell
Update title of DataFrame extension docs (GH#5954) James Bourbeau
Fixed typos in documentation (GH#5962) James Lamb
Add original class or module as a kwarg on _bind_* methods (GH#5946) Julia Signell
Add collect list example (GH#5938) Ray Bell
Update optimization doc for python 3 (GH#5926) Julia Signell

2.11.0 / 2020-02-19¶

Array¶

Cache result of Array.shape (GH#5916) Bruce Merry
Improve accuracy of estimate_graph_size for rechunk (GH#5907) Bruce Merry
Skip rechunk steps that do not alter chunking (GH#5909) Bruce Merry
Support dtype and other kwargs in coarsen (GH#5903) Matthew Rocklin
Push chunk override from map_blocks into blockwise (GH#5895) Bruce Merry
Avoid using rewrite_blockwise for a singleton (GH#5890) Bruce Merry
Optimize slices_from_chunks (GH#5891) Bruce Merry
Avoid unnecessary __getitem__ in block() when chunks have correct dimensionality (GH#5884) Thomas Robitaille

Bag¶

Add include_path option for dask.bag.read_text (GH#5836) Yifan Gu
Fixes ValueError in delayed execution of bagged NumPy array (GH#5828) Surya Avala

Core¶

CI: Pin msgpack (GH#5923) Tom Augspurger
Rename test_inner to test_outer (GH#5922) Shiva Raisinghani
quote should quote dicts too (GH#5905) Bruce Merry
Register a normalizer for literal (GH#5898) Bruce Merry
Improve layer name synthesis for non-HLGs (GH#5888) Bruce Merry
Replace flake8 pre-commit-hook with upstream (GH#5892) Julia Signell
Call pip as a module to avoid warnings (GH#5861) Cyril Shcherbin
Close ThreadPool at exit (GH#5852) Tom Augspurger
Remove dask.dataframe import in tokenization code (GH#5855) James Bourbeau

DataFrame¶

Require pandas>=0.23 (GH#5883) Tom Augspurger
Remove lambda from dataframe aggregation (GH#5901) Matthew Rocklin
Fix exception chaining in dataframe/__init__.py (GH#5882) Ram Rachum
Add support for reductions on empty dataframes (GH#5804) Shiva Raisinghani
Expose sort= argument for groupby (GH#5801) Richard J Zamora
Add df.empty property (GH#5711) rockwellw
Use parquet read speed-ups from fastparquet.api.paths_to_cats. (GH#5821) Igor Gotlibovych

Documentation¶

Deprecate doc_wraps (GH#5912) Tom Augspurger
Update array internal design docs for HighLevelGraph era (GH#5889) Bruce Merry
Move over dashboard connection docs (GH#5877) Matthew Rocklin
Move prometheus docs from distributed.dask.org (GH#5876) Matthew Rocklin
Removing duplicated DO block at the end (GH#5878) K.-Michael Aye
map_blocks see also (GH#5874) Tom Augspurger
More derived from (GH#5871) Julia Signell
Fix typo (GH#5866) Yetunde Dada
Fix typo in cloud.rst (GH#5860) Andrew Thomas
Add note pointing to code of conduct and diversity statement (GH#5844) Matthew Rocklin

2.10.1 / 2020-01-30¶

Fix Pandas 1.0 version comparison (GH#5851) Tom Augspurger
Fix typo in distributed diagnostics documentation (GH#5841) Gerrit Holl

2.10.0 / 2020-01-28¶

Support for pandas 1.0’s new BooleanDtype and StringDtype (GH#5815) Tom Augspurger
Compatibility with pandas 1.0’s API breaking changes and deprecations (GH#5792) Tom Augspurger
Fixed non-deterministic tokenization of some extension-array backed pandas objects (GH#5813) Tom Augspurger
Fixed handling of dataclass class objects in collections (GH#5812) Matteo De Wint
Fixed resampling with tz-aware dates when one of the endpoints fell in a non-existent time (GH#5807) dfonnegra
Delay initial Zarr dataset creation until the computation occurs (GH#5797) Chris Roat
Use parquet dataset statistics in more cases with the pyarrow engine (GH#5799) Richard J Zamora
Fixed exception in groupby.std() when some of the keys were large integers (GH#5737) H. Thomson Comer

2.9.2 / 2020-01-16¶

Array¶

Unify chunks in broadcast_arrays (GH#5765) Matthew Rocklin

Core¶

xfail CSV encoding tests (GH#5791) Tom Augspurger
Update order to handle empty dask graph (GH#5789) James Bourbeau
Redo dask.order.order (GH#5646) Erik Welch

DataFrame¶

Add transparent compression for on-disk shuffle with partd (GH#5786) Christian Wesp
Fix repr for empty dataframes (GH#5781) Shiva Raisinghani
Pandas 1.0.0RC0 compat (GH#5784) Tom Augspurger
Remove buggy assertions (GH#5783) Tom Augspurger
Pandas 1.0 compat (GH#5782) Tom Augspurger
Fix bug in pyarrow-based read_parquet on partitioned datasets (GH#5777) Richard J Zamora
Compat for pandas 1.0 (GH#5779) Tom Augspurger
Fix groupby/mean error with with categorical index (GH#5776) Richard J Zamora
Support empty partitions when performing cumulative aggregation (GH#5730) Matthew Rocklin
set_index accepts single-item unnested list (GH#5760) Wes Roach
Fixed partitioning in set index for ordered Categorical (GH#5715) Tom Augspurger

Documentation¶

Note additional use case for normalize_token.register (GH#5766) Thomas A Caswell
Update bag repartition docstring (GH#5772) Timost
Small typos (GH#5771) Maarten Breddels
Fix typo in Task Expectations docs (GH#5767) James Bourbeau
Add docs section on task expectations to graph page (GH#5764) Devin Petersohn

2.9.1 / 2019-12-27¶

Array¶

Support Array.view with dtype=None (GH#5736) Anderson Banihirwe
Add dask.array.nanmedian (GH#5684) Deepak Cherian

Core¶

xfail test_temporary_directory on Python 3.8 (GH#5734) James Bourbeau
Add support for Python 3.8 (GH#5603) James Bourbeau
Use id to dedupe constants in rewrite_blockwise (GH#5696) Jim Crist

DataFrame¶

Raise error when converting a dask dataframe scalar to a boolean (GH#5743) James Bourbeau
Ensure dataframe groupby-variance is greater than zero (GH#5728) Matthew Rocklin
Fix DataFrame.__iter__ (GH#5719) Tom Augspurger
Support Parquet filters in disjunctive normal form, like PyArrow (GH#5656) Matteo De Wint
Auto-detect categorical columns in ArrowEngine-based read_parquet (GH#5690) Richard J Zamora
Skip parquet getitem optimization tests if no engine found (GH#5697) James Bourbeau
Fix independent optimization of parquet-getitem (GH#5613) Tom Augspurger

Documentation¶

Update helm config doc (GH#5750) Ray Bell
Link to examples.dask.org in several places (GH#5733) Tom Augspurger
Add missing ” in performance report example (GH#5724) James Bourbeau
Resolve several documentation build warnings (GH#5685) James Bourbeau
add info on performance_report (GH#5713) Benjamin Zaitlen
Add more docs disclaimers (GH#5710) Julia Signell
Fix simple typo: wihout -> without (GH#5708) Tim Gates
Update numpydoc dependency (GH#5694) James Bourbeau

2.9.0 / 2019-12-06¶

Array¶

Fix da.std to work with NumPy arrays (GH#5681) James Bourbeau

Core¶

Register sizeof functions for Numba and RMM (GH#5668) John A Kirkham
Update meeting time (GH#5682) Tom Augspurger

DataFrame¶

Modify dd.DataFrame.drop to use shallow copy (GH#5675) Richard J Zamora
Fix bug in _get_md_row_groups (GH#5673) Richard J Zamora
Close sqlalchemy engine after querying DB (GH#5629) Krishan Bhasin
Allow dd.map_partitions to not enforce meta (GH#5660) Matthew Rocklin
Generalize concat_unindexed_dataframes to support cudf-backend (GH#5659) Richard J Zamora
Add dataframe resample methods (GH#5636) Benjamin Zaitlen
Compute length of dataframe as length of first column (GH#5635) Matthew Rocklin

Documentation¶

Doc fixup (GH#5665) James Bourbeau
Update doc build instructions (GH#5640) James Bourbeau
Fix ADL link (GH#5639) Ray Bell
Add documentation build (GH#5617) James Bourbeau

2.8.1 / 2019-11-22¶

Array¶

Use auto rechunking in da.rechunk if no value given (GH#5605) Matthew Rocklin

Core¶

Add simple action to activate GH actions (GH#5619) James Bourbeau

DataFrame¶

Fix “file_path_0” bug in aggregate_row_groups (GH#5627) Richard J Zamora
Add chunksize argument to read_parquet (GH#5607) Richard J Zamora
Change test_repartition_npartitions to support arch64 architecture (GH#5620) ossdev07
Categories lost after groupby + agg (GH#5423) Oliver Hofkens
Fixed relative path issue with parquet metadata file (GH#5608) Nuno Gomes Silva
Enable gpu-backed covariance/correlation in dataframes (GH#5597) Richard J Zamora

Documentation¶

Fix institutional faq and unknown doc warnings (GH#5616) James Bourbeau
Add doc for some utils (GH#5609) Tom Augspurger
Removes html_extra_path (GH#5614) James Bourbeau
Fixed See Also referencence (GH#5612) Tom Augspurger

2.8.0 / 2019-11-14¶

Array¶

Implement complete dask.array.tile function (GH#5574) Bouwe Andela
Add median along an axis with automatic rechunking (GH#5575) Matthew Rocklin
Allow da.asarray to chunk inputs (GH#5586) Matthew Rocklin

Bag¶

Use key_split in Bag name (GH#5571) Matthew Rocklin

Core¶

Switch Doctests to Py3.7 (GH#5573) Ryan Nazareth
Relax get_colors test to adapt to new Bokeh release (GH#5576) Matthew Rocklin
Add dask.blockwise.fuse_roots optimization (GH#5451) Matthew Rocklin
Add sizeof implementation for small dicts (GH#5578) Matthew Rocklin
Update fsspec, gcsfs, s3fs (GH#5588) Tom Augspurger

DataFrame¶

Add dropna argument to groupby (GH#5579) Richard J Zamora
Revert “Remove import of dask_cudf, which is now a part of cudf (GH#5568)” (GH#5590) Matthew Rocklin

Documentation¶

Add best practice for dask.compute function (GH#5583) Matthew Rocklin
Create FUNDING.yml (GH#5587) Gina Helfrich
Add screencast for coordination primitives (GH#5593) Matthew Rocklin
Move funding to .github repo (GH#5589) Tom Augspurger
Update calendar link (GH#5569) Tom Augspurger

2.7.0 / 2019-11-08¶

This release drops support for Python 3.5

Array¶

Reuse code for assert_eq util method (GH#5496) Vijayant
Update da.array to always return a dask array (GH#5510) James Bourbeau
Skip transpose on trivial inputs (GH#5523) Ryan Abernathey
Avoid NumPy scalar string representation in tokenize (GH#5527) James Bourbeau
Remove unnecessary tiledb shape constraint (GH#5545) Norman Barker
Removes bytes from sparse array HTML repr (GH#5556) James Bourbeau

Core¶

Drop Python 3.5 (GH#5528) James Bourbeau
Update the use of fixtures in distributed tests (GH#5497) Matthew Rocklin
Changed deprecated bokeh-port to dashboard-address (GH#5507) darindf
Avoid updating with identical dicts in ensure_dict (GH#5501) James Bourbeau
Test Upstream (GH#5516) Tom Augspurger
Accelerate reverse_dict (GH#5479) Ryan Grout
Update test_imports.sh (GH#5534) James Bourbeau
Support cgroups limits on cpu count in multiprocess and threaded schedulers (GH#5499) Albert DeFusco
Update minimum pyarrow version on CI (GH#5562) James Bourbeau
Make cloudpickle optional (GH#5511) crusaderky

DataFrame¶

Add an example of index_col usage (GH#3072) Bruno Bonfils
Explicitly use iloc for row indexing (GH#5500) Krishan Bhasin
Accept dask arrays on columns assignemnt (GH#5224) Henrique Ribeiro-
Implement unique and value_counts for SeriesGroupBy (GH#5358) Scott Sievert
Add sizeof definition for pyarrow tables and columns (GH#5522) Richard J Zamora
Enable row-group task partitioning in pyarrow-based read_parquet (GH#5508) Richard J Zamora
Removes npartitions=’auto’ from dd.merge docstring (GH#5531) James Bourbeau
Apply enforce error message shows non-overlapping columns. (GH#5530) Tom Augspurger
Optimize meta_nonempty for repetitive dtypes (GH#5553) Petio Petrov
Remove import of dask_cudf, which is now a part of cudf (GH#5568) Mads R. B. Kristensen

Documentation¶

Make capitalization more consistent in FAQ docs (GH#5512) Matthew Rocklin
Add CONTRIBUTING.md (GH#5513) Jacob Tomlinson
Document optional dependencies (GH#5456) Prithvi MK
Update helm chart docs to reflect new chart repo (GH#5539) Jacob Tomlinson
Add Resampler to API docs (GH#5551) James Bourbeau
Fix typo in read_sql_table (GH#5554) Eric Dill
Add adaptive deployments screencast [skip ci] (GH#5566) Matthew Rocklin

2.6.0 / 2019-10-15¶

Core¶

Call ensure_dict on graphs before entering toolz.merge (GH#5486) Matthew Rocklin
Consolidating hash dispatch functions (GH#5476) Richard J Zamora

DataFrame¶

Support Python 3.5 in Parquet code (GH#5491) Benjamin Zaitlen
Avoid identity check in warn_dtype_mismatch (GH#5489) Tom Augspurger
Enable unused groupby tests (GH#3480) Jörg Dietrich
Remove old parquet and bcolz dataframe optimizations (GH#5484) Matthew Rocklin
Add getitem optimization for read_parquet (GH#5453) Tom Augspurger
Use _constructor_sliced method to determine Series type (GH#5480) Richard J Zamora
Fix map(series) for unsorted base series index (GH#5459) Justin Waugh
Fix KeyError with Groupby label (GH#5467) Ryan Nazareth

Documentation¶

Use Zoom meeting instead of appear.in (GH#5494) Matthew Rocklin
Added curated list of resources (GH#5460) Javad
Update SSH docs to include SSHCluster (GH#5482) Matthew Rocklin
Update “Why Dask?” page (GH#5473) Matthew Rocklin
Fix typos in docstrings (GH#5469) garanews

2.5.2 / 2019-10-04¶

Array¶

Correct chunk size logic for asymmetric overlaps (GH#5449) Ben Jeffery
Make da.unify_chunks public API (GH#5443) Matthew Rocklin

DataFrame¶

Fix dask.dataframe.fillna handling of Scalar object (GH#5463) Zhenqing Li

Documentation¶

Remove boxes in Spark comparison page (GH#5445) Matthew Rocklin
Add latest presentations (GH#5446) Javad
Update cloud documentation (GH#5444) Matthew Rocklin

2.5.0 / 2019-09-27¶

Core¶

Add sentinel no_default to get_dependencies task (GH#5420) James Bourbeau
Update fsspec version (GH#5415) Matthew Rocklin
Remove PY2 checks (GH#5400) Jim Crist

DataFrame¶

Add option to not check meta in dd.from_delayed (GH#5436) Christopher J. Wright
Fix test_timeseries_nulls_in_schema failures with pyarrow master (GH#5421) Richard J Zamora
Reduce read_metadata output size in pyarrow/parquet (GH#5391) Richard J Zamora
Test numeric edge case for repartition with npartitions. (GH#5433) amerkel2
Unxfail pandas-datareader test (GH#5430) Tom Augspurger
Add DataFrame.pop implementation (GH#5422) Matthew Rocklin
Enable merge/set_index for cudf-based dataframes with cupy values (GH#5322) Richard J Zamora
drop_duplicates support for positional subset parameter (GH#5410) Wes Roach

Documentation¶

Add screencasts to array, bag, dataframe, delayed, futures and setup (GH#5429) (GH#5424) Matthew Rocklin
Fix delimeter parsing documentation (GH#5428) Mahmut Bulut
Update overview image (GH#5404) James Bourbeau

2.4.0 / 2019-09-13¶

Array¶

Adds explicit h5py.File mode (GH#5390) James Bourbeau
Provides method to compute unknown array chunks sizes (GH#5312) Scott Sievert
Ignore runtime warning in Array compute_meta (GH#5356) estebanag
Add _meta to Array.__dask_postpersist__ (GH#5353) Benoit Bovy
Fixup da.asarray and da.asanyarray for datetime64 dtype and xarray objects (GH#5334) Stephan Hoyer
Add shape implementation (GH#5293) Tom Augspurger
Add chunktype to array text repr (GH#5289) James Bourbeau
Array.random.choice: handle array-like non-arrays (GH#5283) Gabe Joseph

Core¶

Remove deprecated code (GH#5401) Jim Crist
Fix funcname when vectorized func has no __name__ (GH#5399) James Bourbeau
Truncate funcname to avoid long key names (GH#5383) Matthew Rocklin
Add support for numpy.vectorize in funcname (GH#5396) James Bourbeau
Fixed HDFS upstream test (GH#5395) Tom Augspurger
Support numbers and None in parse_bytes/timedelta (GH#5384) Matthew Rocklin
Fix tokenizing of subindexes on memmapped numpy arrays (GH#5351) Henry Pinkard
Upstream fixups (GH#5300) Tom Augspurger

DataFrame¶

Allow pandas to cast type of statistics (GH#5402) Richard J Zamora
Preserve index dtype after applying dd.pivot_table (GH#5385) therhaag
Implement explode for Series and DataFrame (GH#5381) Arpit Solanki
set_index on categorical fails with less categories than partitions (GH#5354) Oliver Hofkens
Support output to a single CSV file (GH#5304) Hongjiu Zhang
Add groupby().transform() (GH#5327) Oliver Hofkens
Adding filter kwarg to pyarrow dataset call (GH#5348) Richard J Zamora
Implement and check compression defaults for parquet (GH#5335) Sarah Bird
Pass sqlalchemy params to delayed objects (GH#5332) Arpit Solanki
Fixing schema handling in arrow-parquet (GH#5307) Richard J Zamora
Add support for DF and Series groupby().idxmin/max() (GH#5273) Oliver Hofkens
Add correlation calculation and add test (GH#5296) Benjamin Zaitlen

Documentation¶

Numpy docstring standard has moved (GH#5405) Wes Roach
Reference correct NumPy array name (GH#5403) Wes Roach
Minor edits to Array chunk documentation (GH#5372) Scott Sievert
Add methods to API docs (GH#5387) Tom Augspurger
Add namespacing to configuration example (GH#5374) Matthew Rocklin
Add get_task_stream and profile to the diagnostics page (GH#5375) Matthew Rocklin
Add best practice to load data with Dask (GH#5369) Matthew Rocklin
Update institutional-faq.rst (GH#5345) DomHudson
Add threads and processes note to the best practices (GH#5340) Matthew Rocklin
Update cuDF links (GH#5328) James Bourbeau
Fixed small typo with parentheses placement (GH#5311) Eugene Huang
Update link in reshape docstring (GH#5297) James Bourbeau

2.3.0 / 2019-08-16¶

Array¶

Raise exception when from_array is given a dask array (GH#5280) David Hoese
Avoid adjusting gufunc’s meta dtype twice (GH#5274) Peter Andreas Entschev
Add meta= keyword to map_blocks and add test with sparse (GH#5269) Matthew Rocklin
Add rollaxis and moveaxis (GH#4822) Tobias de Jong
Always increment old chunk index (GH#5256) James Bourbeau
Shuffle dask array (GH#3901) Tom Augspurger
Fix ordering when indexing a dask array with a bool dask array (GH#5151) James Bourbeau

Bag¶

Add workaround for memory leaks in bag generators (GH#5208) Marco Neumann

Core¶

Set strict xfail option (GH#5220) James Bourbeau
test-upstream (GH#5267) Tom Augspurger
Fixed HDFS CI failure (GH#5234) Tom Augspurger
Error nicely if no file size inferred (GH#5231) Jim Crist
A few changes to config.set (GH#5226) Jim Crist
Fixup black string normalization (GH#5227) Jim Crist
Pin NumPy in windows tests (GH#5228) Jim Crist
Ensure parquet tests are skipped if fastparquet and pyarrow not installed (GH#5217) James Bourbeau
Add fsspec to readthedocs (GH#5207) Matthew Rocklin
Bump NumPy and Pandas to 1.17 and 0.25 in CI test (GH#5179) John A Kirkham

DataFrame¶

Fix DataFrame.query docstring (incorrect numexpr API) (GH#5271) Doug Davis
Parquet metadata-handling improvements (GH#5218) Richard J Zamora
Improve messaging around sorted parquet columns for index (GH#5265) Martin Durant
Add rearrange_by_divisions and set_index support for cudf (GH#5205) Richard J Zamora
Fix groupby.std() with integer colum names (GH#5096) Nicolas Hug
Add Series.__iter__ (GH#5071) Blane
Generalize hash_pandas_object to work for non-pandas backends (GH#5184) GALI PREM SAGAR
Add rolling cov (GH#5154) Ivars Geidans
Add columns argument in drop function (GH#5223) Henrique Ribeiro

Documentation¶

Update institutional FAQ doc (GH#5277) Matthew Rocklin
Add draft of institutional FAQ (GH#5214) Matthew Rocklin
Make boxes for dask-spark page (GH#5249) Martin Durant
Add motivation for shuffle docs (GH#5213) Matthew Rocklin
Fix links and API entries for best-practices (GH#5246) Martin Durant
Remove “bytes” (internal data ingestion) doc page (GH#5242) Martin Durant
Redirect from our local distributed page to distributed.dask.org (GH#5248) Matthew Rocklin
Cleanup API page (GH#5247) Matthew Rocklin
Remove excess endlines from install docs (GH#5243) Matthew Rocklin
Remove item list in phases of computation doc (GH#5245) Martin Durant
Remove custom graphs from the TOC sidebar (GH#5241) Matthew Rocklin
Remove experimental status of custom collections (GH#5236) James Bourbeau
Adds table of contents to Why Dask? (GH#5244) James Bourbeau
Moves bag overview to top-level bag page (GH#5240) James Bourbeau
Remove use-cases in favor of stories.dask.org (GH#5238) Matthew Rocklin
Removes redundant TOC information in index.rst (GH#5235) James Bourbeau
Elevate dashboard in distributed diagnostics documentation (GH#5239) Martin Durant
Updates “add” layer in HLG docs example (GH#5237) James Bourbeau
Update GUFunc documentation (GH#5232) Matthew Rocklin

2.2.0 / 2019-08-01¶

Array¶

Use da.from_array(…, asarray=False) if input follows NEP-18 (GH#5074) Matthew Rocklin
Add missing attributes to from_array documentation (GH#5108) Peter Andreas Entschev
Fix meta computation for some reduction functions (GH#5035) Peter Andreas Entschev
Raise informative error in to_zarr if unknown chunks (GH#5148) James Bourbeau
Remove invalid pad tests (GH#5122) Tom Augspurger
Ignore NumPy warnings in compute_meta (GH#5103) Peter Andreas Entschev
Fix kurtosis calc for single dimension input array (GH#5177) @andrethrill
Support Numpy 1.17 in tests (GH#5192) Matthew Rocklin

Bag¶

Supply pool to bag test to resolve intermittent failure (GH#5172) Tom Augspurger

Core¶

Base dask on fsspec (GH#5064) (GH#5121) Martin Durant
Various upstream compatibility fixes (GH#5056) Tom Augspurger
Make distributed tests optional again. (GH#5128) Elliott Sales de Andrade
Fix HDFS in dask (GH#5130) Martin Durant
Ignore some more invalid value warnings. (GH#5140) Elliott Sales de Andrade

DataFrame¶

Fix pd.MultiIndex size estimate (GH#5066) Brett Naul
Generalizing has_known_categories (GH#5090) GALI PREM SAGAR
Refactor Parquet engine (GH#4995) Richard J Zamora
Add divide method to series and dataframe (GH#5094) msbrown47
fix flaky partd test (GH#5111) Tom Augspurger
Adjust is_dataframe_like to adjust for value_counts change (GH#5143) Tom Augspurger
Generalize rolling windows to support non-Pandas dataframes (GH#5149) Nick Becker
Avoid unnecessary aggregation in pivot_table (GH#5173) Daniel Saxton
Add column names to apply_and_enforce error message (GH#5180) Matthew Rocklin
Add schema keyword argument to to_parquet (GH#5150) Sarah Bird
Remove recursion error in accessors (GH#5182) Jim Crist
Allow fastparquet to handle gather_statistics=False for file lists (GH#5157) Richard J Zamora

Documentation¶

Adds NumFOCUS badge to the README (GH#5086) James Bourbeau
Update developer docs [ci skip] (GH#5093) Jim Crist
Document DataFrame.set_index computataion behavior Natalya Rapstine
Use pip install . instead of calling setup.py (GH#5139) Matthias Bussonier
Close user survey (GH#5147) Tom Augspurger
Fix Google Calendar meeting link (GH#5155) Loïc Estève
Add docker image customization example (GH#5171) James Bourbeau
Update remote-data-services after fsspec (GH#5170) Martin Durant
Fix typo in spark.rst (GH#5164) Xavier Holt
Update setup/python docs for async/await API (GH#5163) Matthew Rocklin
Update Local Storage HPC documentation (GH#5165) Matthew Rocklin

2.1.0 / 2019-07-08¶

Array¶

Add recompute= keyword to svd_compressed for lower-memory use (GH#5041) Matthew Rocklin
Change __array_function__ implementation for backwards compatibility (GH#5043) Ralf Gommers
Added dtype and shape kwargs to apply_along_axis (GH#3742) Davis Bennett
Fix reduction with empty tuple axis (GH#5025) Peter Andreas Entschev
Drop size 0 arrays in stack (GH#4978) John A Kirkham

Core¶

Removes index keyword from pandas to_parquet call (GH#5075) James Bourbeau
Fixes upstream dev CI build installation (GH#5072) James Bourbeau
Ensure scalar arrays are not rendered to SVG (GH#5058) Willi Rath
Environment creation overhaul (GH#5038) Tom Augspurger
s3fs, moto compatibility (GH#5033) Tom Augspurger
pytest 5.0 compat (GH#5027) Tom Augspurger

DataFrame¶

Fix compute_meta recursion in blockwise (GH#5048) Peter Andreas Entschev
Remove hard dependency on pandas in get_dummies (GH#5057) GALI PREM SAGAR
Check dtypes unchanged when using DataFrame.assign (GH#5047) asmith26
Fix cumulative functions on tables with more than 1 partition (GH#5034) tshatrov
Handle non-divisible sizes in repartition (GH#5013) George Sakkis
Handles timestamp and preserve_index changes in pyarrow (GH#5018) Richard J Zamora
Fix undefined meta for str.split(expand=False) (GH#5022) Brett Naul
Removed checks used for debugging merge_asof (GH#5011) Cody Johnson
Don’t use type when getting accessor in dataframes (GH#4992) Matthew Rocklin
Add melt as a method of Dask DataFrame (GH#4984) Dustin Tindall
Adds path-like support to to_hdf (GH#5003) James Bourbeau

Documentation¶

Point to latest K8s setup article in JupyterHub docs (GH#5065) Sean McKenna
Changes vizualize to visualize (GH#5061) David Brochart
Fix from_sequence typo in delayed best practices (GH#5045) James Bourbeau
Add user survey link to docs (GH#5026) James Bourbeau
Fixes typo in optimization docs (GH#5015) James Bourbeau
Update community meeting information (GH#5006) Tom Augspurger

2.0.0 / 2019-06-25¶

Array¶

Support automatic chunking in da.indices (GH#4981) James Bourbeau
Err if there are no arrays to stack (GH#4975) John A Kirkham
Asymmetrical Array Overlap (GH#4863) Michael Eaton
Dispatch concatenate where possible within dask array (GH#4669) Hameer Abbasi
Fix tokenization of memmapped numpy arrays on different part of same file (GH#4931) Henry Pinkard
Preserve NumPy condition in da.asarray to preserve output shape (GH#4945) Alistair Miles
Expand foo_like_safe usage (GH#4946) Peter Andreas Entschev
Defer order/casting einsum parameters to NumPy implementation (GH#4914) Peter Andreas Entschev
Remove numpy warning in moment calculation (GH#4921) Matthew Rocklin
Fix meta_from_array to support Xarray test suite (GH#4938) Matthew Rocklin
Cache chunk boundaries for integer slicing (GH#4923) Bruce Merry
Drop size 0 arrays in concatenate (GH#4167) John A Kirkham
Raise ValueError if concatenate is given no arrays (GH#4927) John A Kirkham
Promote types in concatenate using _meta (GH#4925) John A Kirkham
Add chunk type to html repr in Dask array (GH#4895) Matthew Rocklin
Add Dask Array._meta attribute (GH#4543) Peter Andreas Entschev
- Fix _meta slicing of flexible types (GH#4912) Peter Andreas Entschev
- Minor meta construction cleanup in concatenate (GH#4937) Peter Andreas Entschev
- Further relax Array meta checks for Xarray (GH#4944) Matthew Rocklin
- Support meta= keyword in da.from_delayed (GH#4972) Matthew Rocklin
- Concatenate meta along axis (GH#4977) John A Kirkham
- Use meta in stack (GH#4976) John A Kirkham
- Move blockwise_meta to more general compute_meta function (GH#4954) Matthew Rocklin
Alias .partitions to .blocks attribute of dask arrays (GH#4853) Genevieve Buckley
Drop outdated numpy_compat functions (GH#4850) John A Kirkham
Allow da.eye to support arbitrary chunking sizes with chunks=’auto’ (GH#4834) Anderson Banihirwe
Fix CI warnings in dask.array tests (GH#4805) Tom Augspurger
Make map_blocks work with drop_axis + block_info (GH#4831) Bruce Merry
Add SVG image and table in Array._repr_html_ (GH#4794) Matthew Rocklin
ufunc: avoid __array_wrap__ in favor of __array_function__ (GH#4708) Peter Andreas Entschev
Ensure trivial padding returns the original array (GH#4990) John A Kirkham
Test da.block with 0-size arrays (GH#4991) John A Kirkham

Core¶

Drop Python 2.7 (GH#4919) Jim Crist
Quiet dependency installs in CI (GH#4960) Tom Augspurger
Raise on warnings in tests (GH#4916) Tom Augspurger
Add a diagnostics extra to setup.py (includes bokeh) (GH#4924) John A Kirkham
Add newline delimter keyword to OpenFile (GH#4935) btw08
Overload HighLevelGraphs values method (GH#4918) James Bourbeau
Add __await__ method to Dask collections (GH#4901) Matthew Rocklin
Also ignore AttributeErrors which may occur if snappy (not python-snappy) is installed (GH#4908) Mark Bell
Canonicalize key names in config.rename (GH#4903) Ian Bolliger
Bump minimum partd to 0.3.10 (GH#4890) Tom Augspurger
Catch async def SyntaxError (GH#4836) James Bourbeau
catch IOError in ensure_file (GH#4806) Justin Poehnelt
Cleanup CI warnings (GH#4798) Tom Augspurger
Move distributed’s parse and format functions to dask.utils (GH#4793) Matthew Rocklin
Apply black formatting (GH#4983) James Bourbeau
Package license file in wheels (GH#4988) John A Kirkham

DataFrame¶

Add an optional partition_size parameter to repartition (GH#4416) George Sakkis
merge_asof and prefix_reduction (GH#4877) Cody Johnson
Allow dataframes to be indexed by dask arrays (GH#4882) Endre Mark Borza
Avoid deprecated message parameter in pytest.raises (GH#4962) James Bourbeau
Update test_to_records to test with lengths argument(GH#4515) asmith26
Remove pandas pinning in Dataframe accessors (GH#4955) Matthew Rocklin
Fix correlation of series with same names (GH#4934) Philipp S. Sommer
Map Dask Series to Dask Series (GH#4872) Justin Waugh
Warn in dd.merge on dtype warning (GH#4917) mcsoini
Add groupby Covariance/Correlation (GH#4889) Benjamin Zaitlen
keep index name with to_datetime (GH#4905) Ian Bolliger
Add Parallel variance computation for dataframes (GH#4865) Ksenia Bobrova
Add divmod implementation to arrays and dataframes (GH#4884) Henrique Ribeiro
Add documentation for dataframe reshape methods (GH#4896) tpanza
Avoid use of pandas.compat (GH#4881) Tom Augspurger
Added accessor registration for Series, DataFrame, and Index (GH#4829) Tom Augspurger
Add read_function keyword to read_json (GH#4810) Richard J Zamora
Provide full type name in check_meta (GH#4819) Matthew Rocklin
Correctly estimate bytes per row in read_sql_table (GH#4807) Lijo Jose
Adding support of non-numeric data to describe() (GH#4791) Ksenia Bobrova
Scalars for extension dtypes. (GH#4459) Tom Augspurger
Call head before compute in dd.from_delayed (GH#4802) Matthew Rocklin
Add support for rolling operations with larger window that partition size in DataFrames with Time-based index (GH#4796) Jorge Pessoa
Update groupby-apply doc with warning (GH#4800) Tom Augspurger
Change groupby-ness tests in _maybe_slice (GH#4786) Benjamin Zaitlen
Add master best practices document (GH#4745) Matthew Rocklin
Add document for how Dask works with GPUs (GH#4792) Matthew Rocklin
Add cli API docs (GH#4788) James Bourbeau
Ensure concat output has coherent dtypes (GH#4692) Guillaume Lemaitre
Fixes pandas_datareader dependencies installation (GH#4989) James Bourbeau
Accept pathlib.Path as pattern in read_hdf (GH#3335) Jörg Dietrich

Documentation¶

Move CLI API docs to relavant pages (GH#4980) James Bourbeau
Add to_datetime function to dataframe API docs Matthew Rocklin
Add documentation entry for dask.array.ma.average (GH#4970) Bouwe Andela
Add bag.read_avro to bag API docs (GH#4969) James Bourbeau
Fix typo (GH#4968) mbarkhau
Docs: Drop support for Python 2.7 (GH#4932) Hugo
Remove requirement to modify changelog (GH#4915) Matthew Rocklin
Add documentation about meta column order (GH#4887) Tom Augspurger
Add documentation note in DataFrame.shift (GH#4886) Tom Augspurger
Docs: Fix typo (GH#4868) Paweł Kordek
Put do/don’t into boxes for delayed best practice docs (GH#3821) Martin Durant
Doc fixups (GH#2528) Tom Augspurger
Add quansight to paid support doc section (GH#4838) Martin Durant
Add document for custom startup (GH#4833) Matthew Rocklin
Allow utils.derive_from to accept functions, apply across array (GH#4804) Martin Durant
Add “Avoid Large Partitions” section to best practices (GH#4808) Matthew Rocklin
Update URL for joblib to new website hosting their doc (GH#4816) Christian Hudon

1.2.2 / 2019-05-08¶

Array¶

Clarify regions kwarg to array.store (GH#4759) Martin Durant
Add dtype= parameter to da.random.randint (GH#4753) Matthew Rocklin
Use “row major” rather than “C order” in docstring (GH#4452) @asmith26
Normalize Xarray datasets to Dask arrays (GH#4756) Matthew Rocklin
Remove normed keyword in da.histogram (GH#4755) Matthew Rocklin

Bag¶

Add key argument to Bag.distinct (GH#4423) Daniel Severo

Core¶

Add core dask config file (GH#4774) Matthew Rocklin
Add core dask config file to MANIFEST.in (GH#4780) James Bourbeau
Enabling glob with HTTP file-system (GH#3926) Martin Durant
HTTPFile.seek with whence=1 (GH#4751) Martin Durant
Remove config key normalization (GH#4742) Jim Crist

DataFrame¶

Remove explicit references to Pandas in dask.dataframe.groupby (GH#4778) Matthew Rocklin
Add support for group_keys kwarg in DataFrame.groupby() (GH#4771) Brian Chu
Describe doc (GH#4762) Martin Durant
Remove explicit pandas check in cumulative aggregations (GH#4765) Nick Becker
Added meta for read_json and test (GH#4588) Abhinav Ralhan
Add test for dtype casting (GH#4760) Martin Durant
Document alignment in map_partitions (GH#4757) Jim Crist
Implement Series.str.split(expand=True) (GH#4744) Matthew Rocklin

Documentation¶

Tweaks to develop.rst from trying to run tests (GH#4772) Christian Hudon
Add document describing phases of computation (GH#4766) Matthew Rocklin
Point users to Dask-Yarn from spark documentation (GH#4770) Matthew Rocklin
Update images in delayed doc to remove labels (GH#4768) Martin Durant
Explain intermediate storage for dask arrays (GH#4025) John A Kirkham
Specify bash code-block in array best practices (GH#4764) James Bourbeau
Add array best practices doc (GH#4705) Matthew Rocklin
Update optimization docs now that cull is not automatic (GH#4752) Matthew Rocklin

1.2.1 / 2019-04-29¶

Array¶

Fix map_blocks with block_info and broadcasting (GH#4737) Bruce Merry
Make ‘minlength’ keyword argument optional in da.bincount (GH#4684) Genevieve Buckley
Add support for map_blocks with no array arguments (GH#4713) Bruce Merry
Add dask.array.trace (GH#4717) Danilo Horta
Add sizeof support for cupy.ndarray (GH#4715) Peter Andreas Entschev
Add name kwarg to from_zarr (GH#4663) Michael Eaton
Add chunks=’auto’ to from_array (GH#4704) Matthew Rocklin
Raise TypeError if dask array is given as shape for da.ones, zeros, empty or full (GH#4707) Genevieve Buckley
Add TileDB backend (GH#4679) Isaiah Norton

Core¶

Delay long list arguments (GH#4735) Matthew Rocklin
Bump to numpy >= 1.13, pandas >= 0.21.0 (GH#4720) Jim Crist
Remove file “test” (GH#4710) James Bourbeau
Reenable development build, uses upstream libraries (GH#4696) Peter Andreas Entschev
Remove assertion in HighLevelGraph constructor (GH#4699) Matthew Rocklin

DataFrame¶

Change cum-aggregation last-nonnull-value algorithm (GH#4736) Nick Becker
Fixup series-groupby-apply (GH#4738) Jim Crist
Refactor array.percentile and dataframe.quantile to use t-digest (GH#4677) Janne Vuorela
Allow naive concatenation of sorted dataframes (GH#4725) Matthew Rocklin
Fix perf issue in dd.Series.isin (GH#4727) Jim Crist
Remove hard pandas dependency for melt by using methodcaller (GH#4719) Nick Becker
A few dataframe metadata fixes (GH#4695) Jim Crist
Add Dataframe.replace (GH#4714) Matthew Rocklin
Add ‘threshold’ parameter to pd.DataFrame.dropna (GH#4625) Nathan Matare

Documentation¶

Add warning about derived docstrings early in the docstring (GH#4716) Matthew Rocklin
Create dataframe best practices doc (GH#4703) Matthew Rocklin
Uncomment dask_sphinx_theme (GH#4728) James Bourbeau
Fix minor typo fix in a Queue/fire_and_forget example (GH#4709) Matthew Rocklin
Update from_pandas docstring to match signature (GH#4698) James Bourbeau

1.2.0 / 2019-04-12¶

Array¶

Fixed mean() and moment() on sparse arrays (GH#4525) Peter Andreas Entschev
Add test for NEP-18. (GH#4675) Hameer Abbasi
Allow None to say “no chunking” in normalize_chunks (GH#4656) Matthew Rocklin
Fix limit value in auto_chunks (GH#4645) Matthew Rocklin

Core¶

Updated diagnostic bokeh test for compatibility with bokeh>=1.1.0 (GH#4680) Philipp Rudiger
Adjusts codecov’s target/threshold, disable patch (GH#4671) Peter Andreas Entschev
Always start with empty http buffer, not None (GH#4673) Martin Durant

DataFrame¶

Propagate index dtype and name when create dask dataframe from array (GH#4686) Henrique Ribeiro
Fix ordering of quantiles in describe (GH#4647) gregrf
Clean up and document rearrange_column_by_tasks (GH#4674) Matthew Rocklin
Mark some parquet tests xfail (GH#4667) Peter Andreas Entschev
Fix parquet breakages with arrow 0.13.0 (GH#4668) Martin Durant
Allow sample to be False when reading CSV from a remote URL (GH#4634) Ian Rose
Fix timezone metadata inference on parquet load (GH#4655) Martin Durant
Use is_dataframe/index_like in dd.utils (GH#4657) Matthew Rocklin
Add min_count parameter to groupby sum method (GH#4648) Henrique Ribeiro
Correct quantile to handle unsorted quantiles (GH#4650) gregrf

Documentation¶

Add delayed extra dependencies to install docs (GH#4660) James Bourbeau

1.1.5 / 2019-03-29¶

Array¶

Ensure that we use the dtype keyword in normalize_chunks (GH#4646) Matthew Rocklin

Core¶

Use recursive glob in LocalFileSystem (GH#4186) Brett Naul
Avoid YAML deprecation (GH#4603)
Fix CI and add set -e (GH#4605) James Bourbeau
Support builtin sequence types in dask.visualize (GH#4602)
unpack/repack orderedDict (GH#4623) Justin Poehnelt
Add da.random.randint to API docs (GH#4628) James Bourbeau
Add zarr to CI environment (GH#4604) James Bourbeau
Enable codecov (GH#4631) Peter Andreas Entschev

DataFrame¶

Support setting the index (GH#4565)
DataFrame.itertuples accepts index, name kwargs (GH#4593) Dan O’Donovan
Support non-Pandas series in dd.Series.unique (GH#4599) Benjamin Zaitlen
Replace use of explicit type check with ._is_partition_type predicate (GH#4533)
Remove additional pandas warnings in tests (GH#4576)
Check object for name/dtype attributes rather than type (GH#4606)
Fix comparison against pd.Series (GH#4613) amerkel2
Fixing warning from setting categorical codes to floats (GH#4624) Julia Signell
Fix renaming on index to_frame method (GH#4498) Henrique Ribeiro
Fix divisions when joining two single-partition dataframes (GH#4636) Justin Waugh
Warn if partitions overlap in compute_divisions (GH#4600) Brian Chu
Give informative meta= warning (GH#4637) Matthew Rocklin
Add informative error message to Series.__getitem__ (GH#4638) Matthew Rocklin
Add clear exception message when using index or index_col in read_csv (GH#4651) Álvaro Abella Bascarán

Documentation¶

Add documentation for custom groupby aggregations (GH#4571)
Docs dataframe joins (GH#4569)
Specify fork-based contributions (GH#4619) James Bourbeau
correct to_parquet example in docs (GH#4641) Aaron Fowles
Update and secure several references (GH#4649) Søren Fuglede Jørgensen

1.1.4 / 2019-03-08¶

Array¶

Use mask selection in compress (GH#4548) John A Kirkham
Use asarray in extract (GH#4549) John A Kirkham
Use correct dtype when test concatenation. (GH#4539) Elliott Sales de Andrade
Fix CuPy tests or properly marks as xfail (GH#4564) Peter Andreas Entschev

Core¶

Fix local scheduler callback to deal with custom caching (GH#4542) Yu Feng
Use parse_bytes in read_bytes(sample=…) (GH#4554) Matthew Rocklin

DataFrame¶

Fix up groupby-standard deviation again on object dtype keys (GH#4541) Matthew Rocklin
TST/CI: Updates for pandas 0.24.1 (GH#4551) Tom Augspurger
Add ability to control number of unique elements in timeseries (GH#4557) Matthew Rocklin
Add support in read_csv for parameter skiprows for other iterables (GH#4560) @JulianWgs

Documentation¶

DataFrame to Array conversion and unknown chunks (GH#4516) Scott Sievert
Add docs for random array creation (GH#4566) Matthew Rocklin
Fix typo in docstring (GH#4572) Shyam Saladi

1.1.3 / 2019-03-01¶

Array¶

Modify mean chunk functions to return dicts rather than arrays (GH#4513) Matthew Rocklin
Change sparse installation in CI for NumPy/Python2 compatibility (GH#4537) Matthew Rocklin

DataFrame¶

Make merge dispatchable on pandas/other dataframe types (GH#4522) Matthew Rocklin
read_sql_table - datetime index fix and index type checking (GH#4474) Joe Corbett
Use generalized form of index checking (is_index_like) (GH#4531) Benjamin Zaitlen
Add tests for groupby reductions with object dtypes (GH#4535) Matthew Rocklin
Fixes #4467 : Updates time_series for pandas deprecation (GH#4530) @HSR05

Documentation¶

Add missing method to documentation index (GH#4528) Bart Broere

1.1.2 / 2019-02-25¶

Array¶

Fix another unicode/mixed-type edge case in normalize_array (GH#4489) Marco Neumann
Add dask.array.diagonal (GH#4431) Danilo Horta
Call asanyarray in unify_chunks (GH#4506) Jim Crist
Modify moment chunk functions to return dicts (GH#4519) Peter Andreas Entschev

Bag¶

Don’t inline output keys in dask.bag (GH#4464) Jim Crist
Ensure that bag.from_sequence always includes at least one partition (GH#4475) Anderson Banihirwe
Implement out_type for bag.fold (GH#4502) Matthew Rocklin
Remove map from bag keynames (GH#4500) Matthew Rocklin
Avoid itertools.repeat in map_partitions (GH#4507) Matthew Rocklin

DataFrame¶

Fix relative path parsing on windows when using fastparquet (GH#4445) Janne Vuorela
Fix bug in pyarrow and hdfs (GH#4453) (GH#4455) Michał Jastrzębski
df getitem with integer slices is not implemented (GH#4466) Jim Crist
Replace cudf-specific code with dask-cudf import (GH#4470) Matthew Rocklin
Avoid groupby.agg(callable) in groupby-var (GH#4482) Matthew Rocklin
Consider uint types as numerical in check_meta (GH#4485) Marco Neumann
Fix some typos in groupby comments (GH#4494) Daniel Saxton
Add error message around set_index(inplace=True) (GH#4501) Matthew Rocklin
meta_nonempty works with categorical index (GH#4505) Jim Crist
Add module name to expected meta error message (GH#4499) Matthew Rocklin
groupby-nunique works on empty chunk (GH#4504) Jim Crist
Propagate index metadata if not specified (GH#4509) Jim Crist

Documentation¶

Update docs to use from_zarr (GH#4472) John A Kirkham
DOC: add section of Using Other S3-Compatible Services for remote-data-services (GH#4405) Aploium
Fix header level of section in changelog (GH#4483) Bruce Merry
Add quotes to pip install [skip-ci] (GH#4508) James Bourbeau

Core¶

Extend started_cbs AFTER state is initialized (GH#4460) Marco Neumann
Fix bug in HTTPFile._fetch_range with headers (GH#4479) (GH#4480) Ross Petchler
Repeat optimize_blockwise for diamond fusion (GH#4492) Matthew Rocklin

1.1.1 / 2019-01-31¶

Array¶

Add support for cupy.einsum (GH#4402) Johnnie Gray
Provide byte size in chunks keyword (GH#4434) Adam Beberg
Raise more informative error for histogram bins and range (GH#4430) James Bourbeau

DataFrame¶

Lazily register more cudf functions and move to backends file (GH#4396) Matthew Rocklin
Fix ORC tests for pyarrow 0.12.0 (GH#4413) Jim Crist
rearrange_by_column: ensure that shuffle arg defaults to ‘disk’ if it’s None in dask.config (GH#4414) George Sakkis
Implement filters for _read_pyarrow (GH#4415) George Sakkis
Avoid checking against types in is_dataframe_like (GH#4418) Matthew Rocklin
Pass username as ‘user’ when using pyarrow (GH#4438) Roma Sokolov

Delayed¶

Fix DelayedAttr return value (GH#4440) Matthew Rocklin

Documentation¶

Use SVG for pipeline graphic (GH#4406) John A Kirkham
Add doctest-modules to py.test documentation (GH#4427) Daniel Severo

Core¶

Work around psutil 5.5.0 not allowing pickling Process objects Janne Vuorela

1.1.0 / 2019-01-18¶

Array¶

Fix the average function when there is a masked array (GH#4236) Damien Garaud
Add allow_unknown_chunksizes to hstack and vstack (GH#4287) Paul Vecchio
Fix tensordot for 27+ dimensions (GH#4304) Johnnie Gray
Fixed block_info with axes. (GH#4301) Tom Augspurger
Use safe_wraps for matmul (GH#4346) Mark Harfouche
Use chunks=”auto” in array creation routines (GH#4354) Matthew Rocklin
Fix np.matmul in dask.array.Array.__array_ufunc__ (GH#4363) Stephan Hoyer
COMPAT: Re-enable multifield copy->view change (GH#4357) Diane Trout
Calling np.dtype on a delayed object works (GH#4387) Jim Crist
Rework normalize_array for numpy data (GH#4312) Marco Neumann

DataFrame¶

Add fill_value support for series comparisons (GH#4250) James Bourbeau
Add schema name in read_sql_table for empty tables (GH#4268) Mina Farid
Adjust check for bad chunks in map_blocks (GH#4308) Tom Augspurger
Add dask.dataframe.read_fwf (GH#4316) @slnguyen
Use atop fusion in dask dataframe (GH#4229) Matthew Rocklin
Use parallel_types() in from_pandas (GH#4331) Matthew Rocklin
Change DataFrame._repr_data to method (GH#4330) Matthew Rocklin
Install pyarrow fastparquet for Appveyor (GH#4338) Gábor Lipták
Remove explicit pandas checks and provide cudf lazy registration (GH#4359) Matthew Rocklin
Replace isinstance(…, pandas) with is_dataframe_like (GH#4375) Matthew Rocklin
ENH: Support 3rd-party ExtensionArrays (GH#4379) Tom Augspurger
Pandas 0.24.0 compat (GH#4374) Tom Augspurger

Documentation¶

Fix link to ‘map_blocks’ function in array api docs (GH#4258) David Hoese
Add a paragraph on Dask-Yarn in the cloud docs (GH#4260) Jim Crist
Copy edit documentation (GH#4267), (GH#4263), (GH#4262), (GH#4277), (GH#4271), (GH#4279), (GH#4265), (GH#4295), (GH#4293), (GH#4296), (GH#4302), (GH#4306), (GH#4318), (GH#4314), (GH#4309), (GH#4317), (GH#4326), (GH#4325), (GH#4322), (GH#4332), (GH#4333), Miguel Farrajota
Fix typo in code example (GH#4272) Daniel Li
Doc: Update array-api.rst (GH#4259) (GH#4282) Prabakaran Kumaresshan
Update hpc doc (GH#4266) Guillaume Eynard-Bontemps
Doc: Replace from_avro with read_avro in documents (GH#4313) Prabakaran Kumaresshan
Remove reference to “get” scheduler functions in docs (GH#4350) Matthew Rocklin
Fix typo in docstring (GH#4376) Daniel Saxton
Added documentation for dask.dataframe.merge (GH#4382) Jendrik Jördening

Core¶

Avoid recursion in dask.core.get (GH#4219) Matthew Rocklin
Remove verbose flag from pytest setup.cfg (GH#4281) Matthew Rocklin
Support Pytest 4.0 by specifying marks explicitly (GH#4280) Takahiro Kojima
Add High Level Graphs (GH#4092) Matthew Rocklin
Fix SerializableLock locked and acquire methods (GH#4294) Stephan Hoyer
Pin boto3 to earlier version in tests to avoid moto conflict (GH#4276) Martin Durant
Treat None as missing in config when updating (GH#4324) Matthew Rocklin
Update Appveyor to Python 3.6 (GH#4337) Gábor Lipták
Use parse_bytes more liberally in dask.dataframe/bytes/bag (GH#4339) Matthew Rocklin
Add a better error message when cloudpickle is missing (GH#4342) Mark Harfouche
Support pool= keyword argument in threaded/multiprocessing get functions (GH#4351) Matthew Rocklin
Allow updates from arbitrary Mappings in config.update, not only dicts. (GH#4356) Stuart Berg
Move dask/array/top.py code to dask/blockwise.py (GH#4348) Matthew Rocklin
Add has_parallel_type (GH#4395) Matthew Rocklin
CI: Update Appveyor (GH#4381) Tom Augspurger
Ignore non-readable config files (GH#4388) Jim Crist

1.0.0 / 2018-11-28¶

Array¶

Add nancumsum/nancumprod unit tests (GH#4215) crusaderky

DataFrame¶

Add index to to_dask_dataframe docstring (GH#4232) James Bourbeau
Text and fix when appending categoricals with fastparquet (GH#4245) Martin Durant
Don’t reread metadata when passing ParquetFile to read_parquet (GH#4247) Martin Durant

Documentation¶

Copy edit documentation (GH#4222) (GH#4224) (GH#4228) (GH#4231) (GH#4230) (GH#4234) (GH#4235) (GH#4254) Miguel Farrajota
Updated doc for the new scheduler keyword (GH#4251) @milesial

Core¶

Avoid a few warnings (GH#4223) Matthew Rocklin
Remove dask.store module (GH#4221) Matthew Rocklin
Remove AUTHORS.md Jim Crist

0.20.2 / 2018-11-15¶

Array¶

Avoid fusing dependencies of atop reductions (GH#4207) Matthew Rocklin

Dataframe¶

Improve memory footprint for dataframe correlation (GH#4193) Damien Garaud
Add empty DataFrame check to boundary_slice (GH#4212) James Bourbeau

Documentation¶

Copy edit documentation (GH#4197) (GH#4204) (GH#4198) (GH#4199) (GH#4200) (GH#4202) (GH#4209) Miguel Farrajota
Add stats module namespace (GH#4206) James Bourbeau
Fix link in dataframe documentation (GH#4208) James Bourbeau

0.20.1 / 2018-11-09¶

Array¶

Only allocate the result space in wrapped_pad_func (GH#4153) John A Kirkham
Generalize expand_pad_width to expand_pad_value (GH#4150) John A Kirkham
Test da.pad with 2D linear_ramp case (GH#4162) John A Kirkham
Fix import for broadcast_to. (GH#4168) samc0de
Rewrite Dask Array’s pad to add only new chunks (GH#4152) John A Kirkham
Validate index inputs to atop (GH#4182) Matthew Rocklin

Core¶

Dask.config set and get normalize underscores and hyphens (GH#4143) James Bourbeau
Only subs on core collections, not subclasses (GH#4159) Matthew Rocklin
Add block_size=0 option to HTTPFileSystem. (GH#4171) Martin Durant
Add traverse support for dataclasses (GH#4165) Armin Berres
Avoid optimization on sharedicts without dependencies (GH#4181) Matthew Rocklin
Update the pytest version for TravisCI (GH#4189) Damien Garaud
Use key_split rather than funcname in visualize names (GH#4160) Matthew Rocklin

Dataframe¶

Add fix for DataFrame.__setitem__ for index (GH#4151) Anderson Banihirwe
Fix column choice when passing list of files to fastparquet (GH#4174) Martin Durant
Pass engine_kwargs from read_sql_table to sqlalchemy (GH#4187) Damien Garaud

Documentation¶

Fix documentation in Delayed best practices example that returned an empty list (GH#4147) Jonathan Fraine
Copy edit documentation (GH#4164) (GH#4175) (GH#4185) (GH#4192) (GH#4191) (GH#4190) (GH#4180) Miguel Farrajota
Fix typo in docstring (GH#4183) Carlos Valiente

0.20.0 / 2018-10-26¶

Array¶

Fuse Atop operations (GH#3998), (GH#4081) Matthew Rocklin
Support da.asanyarray on dask dataframes (GH#4080) Matthew Rocklin
Remove unnecessary endianness check in datetime test (GH#4113) Elliott Sales de Andrade
Set name=False in array foo_like functions (GH#4116) Matthew Rocklin
Remove dask.array.ghost module (GH#4121) Matthew Rocklin
Fix use of getargspec in dask array (GH#4125) Stephan Hoyer
Adds dask.array.invert (GH#4127), (GH#4131) Anderson Banihirwe
Raise informative error on arg-reduction on unknown chunksize (GH#4128), (GH#4135) Matthew Rocklin
Normalize reversed slices in dask array (GH#4126) Matthew Rocklin

Bag¶

Add bag.to_avro (GH#4076) Martin Durant

Core¶

Pull num_workers from config.get (GH#4086), (GH#4093) James Bourbeau
Fix invalid escape sequences with raw strings (GH#4112) Elliott Sales de Andrade
Raise an error on the use of the get= keyword and set_options (GH#4077) Matthew Rocklin
Add import for Azure DataLake storage, and add docs (GH#4132) Martin Durant
Avoid collections.Mapping/Sequence (GH#4138) Matthew Rocklin

Dataframe¶

Include index keyword in to_dask_dataframe (GH#4071) Matthew Rocklin
add support for duplicate column names (GH#4087) Jan Koch
Implement min_count for the DataFrame methods sum and prod (GH#4090) Bart Broere
Remove pandas warnings in concat (GH#4095) Matthew Rocklin
DataFrame.to_csv header option to only output headers in the first chunk (GH#3909) Rahul Vaidya
Remove Series.to_parquet (GH#4104) Justin Dennison
Avoid warnings and deprecated pandas methods (GH#4115) Matthew Rocklin
Swap ‘old’ and ‘previous’ when reporting append error (GH#4130) Martin Durant

Documentation¶

Copy edit documentation (GH#4073), (GH#4074), (GH#4094), (GH#4097), (GH#4107), (GH#4124), (GH#4133), (GH#4139) Miguel Farrajota
Fix typo in code example (GH#4089) Antonino Ingargiola
Add pycon 2018 presentation (GH#4102) Javad
Quick description for gcsfs (GH#4109) Martin Durant
Fixed typo in docstrings of read_sql_table method (GH#4114) TakaakiFuruse
Make target directories in redirects if they don’t exist (GH#4136) Matthew Rocklin

0.19.4 / 2018-10-09¶

Array¶

Implement apply_gufunc(..., axes=..., keepdims=...) (GH#3985) Markus Gonser

Bag¶

Fix typo in datasets.make_people (GH#4069) Matthew Rocklin

Dataframe¶

Added percentiles options for dask.dataframe.describe method (GH#4067) Zhenqing Li
Add DataFrame.partitions accessor similar to Array.blocks (GH#4066) Matthew Rocklin

Core¶

Pass get functions and Clients through scheduler keyword (GH#4062) Matthew Rocklin

Documentation¶

Fix Typo on hpc example. (missing = in kwarg). (GH#4068) Matthias Bussonier
Extensive copy-editing: (GH#4065), (GH#4064), (GH#4063) Miguel Farrajota

0.19.3 / 2018-10-05¶

Array¶

Make da.RandomState extensible to other modules (GH#4041) Matthew Rocklin
Support unknown dims in ravel no-op case (GH#4055) Jim Crist
Add basic infrastructure for cupy (GH#4019) Matthew Rocklin
Avoid asarray and lock arguments for from_array(getitem) (GH#4044) Matthew Rocklin
Move local imports in corrcoef to global imports (GH#4030) John A Kirkham
Move local indices import to global import (GH#4029) John A Kirkham
Fix-up Dask Array’s fromfunction w.r.t. dtype and kwargs (GH#4028) John A Kirkham
Don’t use dummy expansion for trim_internal in overlapped (GH#3964) Mark Harfouche
Add unravel_index (GH#3958) John A Kirkham

Bag¶

Sort result in Bag.frequencies (GH#4033) Matthew Rocklin
Add support for npartitions=1 edge case in groupby (GH#4050) James Bourbeau
Add new random dataset for people (GH#4018) Matthew Rocklin
Improve performance of bag.read_text on small files (GH#4013) Eric Wolak
Add bag.read_avro (GH#4000) (GH#4007) Martin Durant

Dataframe¶

Added an index parameter to dask.dataframe.from_dask_array() for creating a dask DataFrame from a dask Array with a given index. (GH#3991) Tom Augspurger
Improve sub-classability of dask dataframe (GH#4015) Matthew Rocklin
Fix failing hdfs test [test-hdfs] (GH#4046) Jim Crist
fuse_subgraphs works without normal fuse (GH#4042) Jim Crist
Make path for reading many parquet files without prescan (GH#3978) Martin Durant
Index in dd.from_dask_array (GH#3991) Tom Augspurger
Making skiprows accept lists (GH#3975) Julia Signell
Fail early in fastparquet read for nonexistent column (GH#3989) Martin Durant

Core¶

Add support for npartitions=1 edge case in groupby (GH#4050) James Bourbeau
Automatically wrap large arguments with dask.delayed in map_blocks/partitions (GH#4002) Matthew Rocklin
Fuse linear chains of subgraphs (GH#3979) Jim Crist
Make multiprocessing context configurable (GH#3763) Itamar Turner-Trauring

Documentation¶

Extensive copy-editing (GH#4049), (GH#4034), (GH#4031), (GH#4020), (GH#4021), (GH#4022), (GH#4023), (GH#4016), (GH#4017), (GH#4010), (GH#3997), (GH#3996), Miguel Farrajota
Update shuffle method selection docs (GH#4048) James Bourbeau
Remove docs/source/examples, point to examples.dask.org (GH#4014) Matthew Rocklin
Replace readthedocs links with dask.org (GH#4008) Matthew Rocklin
Updates DataFrame.to_hdf docstring for returned values (GH#3992) James Bourbeau

0.19.2 / 2018-09-17¶

Array¶

apply_gufunc implements automatic infer of functions output dtypes (GH#3936) Markus Gonser
Fix array histogram range error when array has nans (GH#3980) James Bourbeau
Issue 3937 follow up, int type checks. (GH#3956) Yu Feng
from_array: add @martindurant’s explaining of how hashing is done for an array. (GH#3965) Mark Harfouche
Support gradient with coordinate (GH#3949) Keisuke Fujii

Core¶

Fix use of has_keyword with partial in Python 2.7 (GH#3966) Mark Harfouche
Set pyarrow as default for HDFS (GH#3957) Matthew Rocklin

Documentation¶

Use dask_sphinx_theme (GH#3963) Matthew Rocklin
Use JupyterLab in Binder links from main page Matthew Rocklin
DOC: fixed sphinx syntax (GH#3960) Tom Augspurger

0.19.1 / 2018-09-06¶

Array¶

Don’t enforce dtype if result has no dtype (GH#3928) Matthew Rocklin
Fix NumPy issubtype deprecation warning (GH#3939) Bruce Merry
Fix arg reduction tokens to be unique with different arguments (GH#3955) Tobias de Jong
Coerce numpy integers to ints in slicing code (GH#3944) Yu Feng
Linalg.norm ndim along axis partial fix (GH#3933) Tobias de Jong

Dataframe¶

Deterministic DataFrame.set_index (GH#3867) George Sakkis
Fix divisions in read_parquet when dealing with filters #3831 #3930 (GH#3923) (GH#3931) @andrethrill
Fixing returning type in categorical.as_known (GH#3888) Sriharsha Hatwar
Fix DataFrame.assign for callables (GH#3919) Tom Augspurger
Include partitions with no width in repartition (GH#3941) Matthew Rocklin
Don’t constrict stage/k dtype in dataframe shuffle (GH#3942) Matthew Rocklin

Documentation¶

DOC: Add hint on how to render task graphs horizontally (GH#3922) Uwe Korn
Add try-now button to main landing page (GH#3924) Matthew Rocklin

0.19.0 / 2018-08-29¶

Array¶

Support coordinate in gradient (GH#3949) Keisuke Fujii
Fix argtopk split_every bug (GH#3810) crusaderky
Ensure result computing dask.array.isnull() always gives a numpy array (GH#3825) Stephan Hoyer
Support concatenate for scipy.sparse in dask array (GH#3836) Matthew Rocklin
Fix argtopk on 32-bit systems. (GH#3823) Elliott Sales de Andrade
Normalize keys in rechunk (GH#3820) Matthew Rocklin
Allow shape of dask.array to be a numpy array (GH#3844) Mark Harfouche
Fix numpy deprecation warning on tuple indexing (GH#3851) Tobias de Jong
Rename ghost module to overlap (GH#3830) Robert Sare
Re-add the ghost import to da __init__ (GH#3861) Jim Crist
Ensure copy preserves masked arrays (GH#3852) Tobias de Jong

DataFrame¶

Added dtype and sparse keywords to dask.dataframe.get_dummies() (GH#3792) Tom Augspurger
Added dask.dataframe.to_dask_array() for converting a Dask Series or DataFrame to a Dask Array, possibly with known chunk sizes (GH#3884) Tom Augspurger
Changed the behavior for dask.array.asarray() for dask dataframe and series inputs. Previously, the series was eagerly converted to an in-memory NumPy array before creating a dask array with known chunks sizes. This caused unexpectedly high memory usage. Now, no intermediate NumPy array is created, and a Dask array with unknown chunk sizes is returned (GH#3884) Tom Augspurger
DataFrame.iloc (GH#3805) Tom Augspurger
When reading multiple paths, expand globs. (GH#3828) Irina Truong
Added index column name after resample (GH#3833) Eric Bonfadini
Add (lazy) shape property to dataframe and series (GH#3212) Henrique Ribeiro
Fix failing hdfs test [test-hdfs] (GH#3858) Jim Crist
Fixes for pyarrow 0.10.0 release (GH#3860) Jim Crist
Rename to_csv keys for diagnostics (GH#3890) Matthew Rocklin
Match pandas warnings for concat sort (GH#3897) Tom Augspurger
Include filename in read_csv (GH#3908) Julia Signell

Core¶

Better error message on import when missing common dependencies (GH#3771) Danilo Horta
Drop Python 3.4 support (GH#3840) Jim Crist
Remove expired deprecation warnings (GH#3841) Jim Crist
Add DASK_ROOT_CONFIG environment variable (GH#3849) Joe Hamman
Don’t cull in local scheduler, do cull in delayed (GH#3856) Jim Crist
Increase conda download retries (GH#3857) Jim Crist
Add python_requires and Trove classifiers (GH#3855) @hugovk
Fix collections.abc deprecation warnings in Python 3.7.0 (GH#3876) Jan Margeta
Allow dot jpeg to xfail in visualize tests (GH#3896) Matthew Rocklin
Add Python 3.7 to travis.yml (GH#3894) Matthew Rocklin
Add expand_environment_variables to dask.config (GH#3893) Joe Hamman

Docs¶

Fix typo in import statement of diagnostics (GH#3826) John Mrziglod
Add link to YARN docs (GH#3838) Jim Crist
fix of minor typos in landing page index.html (GH#3746) Christoph Moehl
Update delayed-custom.rst (GH#3850) Anderson Banihirwe
DOC: clarify delayed docstring (GH#3709) Scott Sievert
Add new presentations (GH#3880) Javad
Add dask array normalize_chunks to documentation (GH#3878) Daniel Rothenberg
Docs: Fix link to snakeviz (GH#3900) Hans Moritz Günther
Add missing ` to docstring (GH#3915) @rtobar

0.18.2 / 2018-07-23¶

Array¶

Reimplemented argtopk to make it release the GIL (GH#3610) crusaderky
Don’t overlap on non-overlapped dimensions in map_overlap (GH#3653) Matthew Rocklin
Fix linalg.tsqr for dimensions of uncertain length (GH#3662) Jeremy Chen
Break apart uneven array-of-int slicing to separate chunks (GH#3648) Matthew Rocklin
Align auto chunks to provided chunks, rather than shape (GH#3679) Matthew Rocklin
Adds endpoint and retstep support for linspace (GH#3675) James Bourbeau
Implement .blocks accessor (GH#3689) Matthew Rocklin
Add block_info keyword to map_blocks functions (GH#3686) Matthew Rocklin
Slice by dask array of ints (GH#3407) crusaderky
Support dtype in arange (GH#3722) crusaderky
Fix argtopk with uneven chunks (GH#3720) crusaderky
Raise error when replace=False in da.choice (GH#3765) James Bourbeau
Update chunks in Array.__setitem__ (GH#3767) Itamar Turner-Trauring
Add a chunksize convenience property (GH#3777) Jacob Tomlinson
Fix and simplify array slicing behavior when step < 0 (GH#3702) Ziyao Wei
Ensure to_zarr with return_stored True returns a Dask Array (GH#3786) John A Kirkham

Bag¶

Add last_endline optional parameter in to_textfiles (GH#3745) George Sakkis

Dataframe¶

Add aggregate function for rolling objects (GH#3772) Gerome Pistre
Properly tokenize cumulative groupby aggregations (GH#3799) Cloves Almeida

Delayed¶

Add the @ operator to the delayed objects (GH#3691) Mark Harfouche
Add delayed best practices to documentation (GH#3737) Matthew Rocklin
Fix @delayed decorator for methods and add tests (GH#3757) Ziyao Wei

Core¶

Fix extra progressbar (GH#3669) Mike Neish
Allow tasks back onto ordering stack if they have one dependency (GH#3652) Matthew Rocklin
Prefer end-tasks with low numbers of dependencies when ordering (GH#3588) Tom Augspurger
Add assert_eq to top-level modules (GH#3726) Matthew Rocklin
Test that dask collections can hold scipy.sparse arrays (GH#3738) Matthew Rocklin
Fix setup of lz4 decompression functions (GH#3782) Elliott Sales de Andrade
Add datasets module (GH#3780) Matthew Rocklin

0.18.1 / 2018-06-22¶

Array¶

from_array now supports scalar types and nested lists/tuples in input, just like all numpy functions do; it also produces a simpler graph when the input is a plain ndarray (GH#3568) crusaderky
Fix slicing of big arrays due to cumsum dtype bug (GH#3620) Marco Rossi
Add Dask Array implementation of pad (GH#3578) John A Kirkham
Fix array random API examples (GH#3625) James Bourbeau
Add average function to dask array (GH#3640) James Bourbeau
Tokenize ghost_internal with axes (GH#3643) Matthew Rocklin
Add outer for Dask Arrays (GH#3658) John A Kirkham

DataFrame¶

Add Index.to_series method (GH#3613) Henrique Ribeiro
Fix missing partition columns in pyarrow-parquet (GH#3636) Martin Durant

Core¶

Minor tweaks to CI (GH#3629) crusaderky
Add back dask.utils.effective_get (GH#3642) Matthew Rocklin
DASK_CONFIG dictates config write location (GH#3621) Jim Crist
Replace ‘collections’ key in unpack_collections with unique key (GH#3632) Yu Feng
Avoid deepcopy in dask.config.set (GH#3649) Matthew Rocklin

0.18.0 / 2018-06-14¶

Array¶

Add to/from_zarr for Zarr-format datasets and arrays (GH#3460) Martin Durant
Experimental addition of generalized ufunc support, apply_gufunc, gufunc, and as_gufunc (GH#3109) (GH#3526) (GH#3539) Markus Gonser
Avoid unnecessary rechunking tasks (GH#3529) Matthew Rocklin
Compute dtypes at runtime for fft (GH#3511) Matthew Rocklin
Generate UUIDs for all da.store operations (GH#3540) Martin Durant
Correct internal dimension of Dask’s SVD (GH#3517) John A Kirkham
BUG: do not raise IndexError for identity slice in array.vindex (GH#3559) Scott Sievert
Adds isneginf and isposinf (GH#3581) John A Kirkham
Drop Dask Array’s learn module (GH#3580) John A Kirkham
added sfqr (short-and-fat) as a counterpart to tsqr… (GH#3575) Jeremy Chen
Allow 0-width chunks in dask.array.rechunk (GH#3591) Marc Pfister
Document Dask Array’s nan_to_num in public API (GH#3599) John A Kirkham
Show block example (GH#3601) John A Kirkham
Replace token= keyword with name= in map_blocks (GH#3597) Matthew Rocklin
Disable locking in to_zarr (needed for using to_zarr in a distributed context) (GH#3607) John A Kirkham
Support Zarr Arrays in to_zarr/from_zarr (GH#3561) John A Kirkham
Added recursion to array/linalg/tsqr to better manage the single core bottleneck (GH#3586) Jeremy Chan (GH#3396) crusaderky

Dataframe¶

Add to/read_json (GH#3494) Martin Durant
Adds index to unsupported arguments for DataFrame.rename method (GH#3522) James Bourbeau
Adds support to subset Dask DataFrame columns using numpy.ndarray, pandas.Series, and pandas.Index objects (GH#3536) James Bourbeau
Raise error if meta columns do not match dataframe (GH#3485) Christopher Ren
Add index to unsupprted argument for DataFrame.rename (GH#3522) James Bourbeau
Adds support for subsetting DataFrames with pandas Index/Series and numpy ndarrays (GH#3536) James Bourbeau
Dataframe sample method docstring fix (GH#3566) James Bourbeau
fixes dd.read_json to infer file compression (GH#3594) Matt Lee
Adds n to sample method (GH#3606) James Bourbeau
Add fastparquet ParquetFile object support (GH#3573) @andrethrill

Bag¶

Rename method= keyword to shuffle= in bag.groupby (GH#3470) Matthew Rocklin

Core¶

Replace get= keyword with scheduler= keyword (GH#3448) Matthew Rocklin
Add centralized dask.config module to handle configuration for all Dask subprojects (GH#3432) (GH#3513) (GH#3520) Matthew Rocklin
Add dask-ssh CLI Options and Description. (GH#3476) @beomi
Read whole files fix regardless of header for HTTP (GH#3496) Martin Durant
Adds synchronous scheduler syntax to debugging docs (GH#3509) James Bourbeau
Replace dask.set_options with dask.config.set (GH#3502) Matthew Rocklin
Update sphinx readthedocs-theme (GH#3516) Matthew Rocklin
Introduce “auto” value for normalize_chunks (GH#3507) Matthew Rocklin
Fix check in configuration with env=None (GH#3562) Simon Perkins
Update sizeof definitions (GH#3582) Matthew Rocklin
Remove –verbose flag from travis-ci (GH#3477) Matthew Rocklin
Remove “da.random” from random array keys (GH#3604) Matthew Rocklin

0.17.5 / 2018-05-16¶

Array¶

Fix rechunk with chunksize of -1 in a dict (GH#3469) Stephan Hoyer
einsum now accepts the split_every parameter (GH#3471) crusaderky
Improved slicing performance (GH#3479) Yu Feng

DataFrame¶

Compatibility with pandas 0.23.0 (GH#3499) Tom Augspurger

0.17.4 / 2018-05-03¶

Dataframe¶

Add support for indexing Dask DataFrames with string subclasses (GH#3461) James Bourbeau
Allow using both sorted_index and chunksize in read_hdf (GH#3463) Pierre Bartet
Pass filesystem to arrow piece reader (GH#3466) Martin Durant
Switches to using dask.compat string_types (GH#3462) James Bourbeau

0.17.3 / 2018-05-02¶

Array¶

Add einsum for Dask Arrays (GH#3412) Simon Perkins
Add piecewise for Dask Arrays (GH#3350) John A Kirkham
Fix handling of nan in broadcast_shapes (GH#3356) John A Kirkham
Add isin for dask arrays (GH#3363). Stephan Hoyer
Overhauled topk for Dask Arrays: faster algorithm, particularly for large k’s; added support for multiple axes, recursive aggregation, and an option to pick the bottom k elements instead. (GH#3395) crusaderky
The topk API has changed from topk(k, array) to the more conventional topk(array, k). The legacy API still works but is now deprecated. (GH#2965) crusaderky
New function argtopk for Dask Arrays (GH#3396) crusaderky
Fix handling partial depth and boundary in map_overlap (GH#3445) John A Kirkham
Add gradient for Dask Arrays (GH#3434) John A Kirkham

DataFrame¶

Allow t as shorthand for table in to_hdf for pandas compatibility (GH#3330) Jörg Dietrich
Added top level isna method for Dask DataFrames (GH#3294) Christopher Ren
Fix selection on partition column on read_parquet for engine="pyarrow" (GH#3207) Uwe Korn
Added DataFrame.squeeze method (GH#3366) Christopher Ren
Added infer_divisions option to read_parquet to specify whether read engines should compute divisions (GH#3387) Jon Mease
Added support for inferring division for engine="pyarrow" (GH#3387) Jon Mease
Provide more informative error message for meta= errors (GH#3343) Matthew Rocklin
add orc reader (GH#3284) Martin Durant
Default compression for parquet now always Snappy, in line with pandas (GH#3373) Martin Durant
Fixed bug in Dask DataFrame and Series comparisons with NumPy scalars (GH#3436) James Bourbeau
Remove outdated requirement from repartition docstring (GH#3440) Jörg Dietrich
Fixed bug in aggregation when only a Series is selected (GH#3446) Jörg Dietrich
Add default values to make_timeseries (GH#3421) Matthew Rocklin

Core¶

Support traversing collections in persist, visualize, and optimize (GH#3410) Jim Crist
Add schedule= keyword to compute and persist. This replaces common use of the get= keyword (GH#3448) Matthew Rocklin

0.17.2 / 2018-03-21¶

Array¶

Add broadcast_arrays for Dask Arrays (GH#3217) John A Kirkham
Add bitwise_* ufuncs (GH#3219) John A Kirkham
Add optional axis argument to squeeze (GH#3261) John A Kirkham
Validate inputs to atop (GH#3307) Matthew Rocklin
Avoid calls to astype in concatenate if all parts have the same dtype (GH#3301) Martin Durant

DataFrame¶

Fixed bug in shuffle due to aggressive truncation (GH#3201) Matthew Rocklin
Support specifying categorical columns on read_parquet with categories=[…] for engine="pyarrow" (GH#3177) Uwe Korn
Add dd.tseries.Resampler.agg (GH#3202) Richard Postelnik
Support operations that mix dataframes and arrays (GH#3230) Matthew Rocklin
Support extra Scalar and Delayed args in dd.groupby._Groupby.apply (GH#3256) Gabriele Lanaro

Bag¶

Support joining against single-partitioned bags and delayed objects (GH#3254) Matthew Rocklin

Core¶

Fixed bug when using unexpected but hashable types for keys (GH#3238) Daniel Collins
Fix bug in task ordering so that we break ties consistently with the key name (GH#3271) Matthew Rocklin
Avoid sorting tasks in order when the number of tasks is very large (GH#3298) Matthew Rocklin

0.17.1 / 2018-02-22¶

Array¶

Corrected dimension chunking in indices (GH#3166, GH#3167) Simon Perkins
Inline store_chunk calls for store’s return_stored option (GH#3153) John A Kirkham
Compatibility with struct dtypes for NumPy 1.14.1 release (GH#3187) Matthew Rocklin

DataFrame¶

Bugfix to allow column assignment of pandas datetimes(GH#3164) Max Epstein

Core¶

New file-system for HTTP(S), allowing direct loading from specific URLs (GH#3160) Martin Durant
Fix bug when tokenizing partials with no keywords (GH#3191) Matthew Rocklin
Use more recent LZ4 API (GH#3157) Thrasibule
Introduce output stream parameter for progress bar (GH#3185) Dieter Weber

0.17.0 / 2018-02-09¶

Array¶

Added a support object-type arrays for nansum, nanmin, and nanmax (GH#3133) Keisuke Fujii
Update error handling when len is called with empty chunks (GH#3058) Xander Johnson
Fixes a metadata bug with store’s return_stored option (GH#3064) John A Kirkham
Fix a bug in optimization.fuse_slice to properly handle when first input is None (GH#3076) James Bourbeau
Support arrays with unknown chunk sizes in percentile (GH#3107) Matthew Rocklin
Tokenize scipy.sparse arrays and np.matrix (GH#3060) Roman Yurchak

DataFrame¶

Support month timedeltas in repartition(freq=…) (GH#3110) Matthew Rocklin
Avoid mutation in dataframe groupby tests (GH#3118) Matthew Rocklin
read_csv, read_table, and read_parquet accept iterables of paths (GH#3124) Jim Crist
Deprecates the dd.to_delayed function in favor of the existing method (GH#3126) Jim Crist
Return dask.arrays from df.map_partitions calls when the UDF returns a numpy array (GH#3147) Matthew Rocklin
Change handling of columns and index in dd.read_parquet to be more consistent, especially in handling of multi-indices (GH#3149) Jim Crist
fastparquet append=True allowed to create new dataset (GH#3097) Martin Durant
dtype rationalization for sql queries (GH#3100) Martin Durant

Bag¶

Document bag.map_paritions function may receive either a list or generator. (GH#3150) Nir

Core¶

Change default task ordering to prefer nodes with few dependents and then many downstream dependencies (GH#3056) Matthew Rocklin
Add color= option to visualize to color by task order (GH#3057) (GH#3122) Matthew Rocklin
Deprecate dask.bytes.open_text_files (GH#3077) Jim Crist
Remove short-circuit hdfs reads handling due to maintenance costs. May be re-added in a more robust manner later (GH#3079) Jim Crist
Add dask.base.optimize for optimizing multiple collections without computing. (GH#3071) Jim Crist
Rename dask.optimize module to dask.optimization (GH#3071) Jim Crist
Change task ordering to do a full traversal (GH#3066) Matthew Rocklin
Adds an optimize_graph keyword to all to_delayed methods to allow controlling whether optimizations occur on conversion. (GH#3126) Jim Crist
Support using pyarrow for hdfs integration (GH#3123) Jim Crist
Move HDFS integration and tests into dask repo (GH#3083) Jim Crist
Remove write_bytes (GH#3116) Jim Crist

0.16.1 / 2018-01-09¶

Array¶

Fix handling of scalar percentile values in percentile (GH#3021) James Bourbeau
Prevent bool() coercion from calling compute (GH#2958) Albert DeFusco
Add matmul (GH#2904) John A Kirkham
Support N-D arrays with matmul (GH#2909) John A Kirkham
Add vdot (GH#2910) John A Kirkham
Explicit chunks argument for broadcast_to (GH#2943) Stephan Hoyer
Add meshgrid (GH#2938) John A Kirkham and (GH#3001) Markus Gonser
Preserve singleton chunks in fftshift/ifftshift (GH#2733) John A Kirkham
Fix handling of negative indexes in vindex and raise errors for out of bounds indexes (GH#2967) Stephan Hoyer
Add flip, flipud, fliplr (GH#2954) John A Kirkham
Add float_power ufunc (GH#2962) (GH#2969) John A Kirkham
Compatibility for changes to structured arrays in the upcoming NumPy 1.14 release (GH#2964) Tom Augspurger
Add block (GH#2650) John A Kirkham
Add frompyfunc (GH#3030) Jim Crist
Add the return_stored option to store for chaining stored results (GH#2980) John A Kirkham

DataFrame¶

Fixed naming bug in cumulative aggregations (GH#3037) Martijn Arts
Fixed dd.read_csv when names is given but header is not set to None (GH#2976) Martijn Arts
Fixed dd.read_csv so that passing instances of CategoricalDtype in dtype will result in known categoricals (GH#2997) Tom Augspurger
Prevent bool() coercion from calling compute (GH#2958) Albert DeFusco
DataFrame.read_sql() (GH#2928) to an empty database tables returns an empty dask dataframe Apostolos Vlachopoulos
Compatibility for reading Parquet files written by PyArrow 0.8.0 (GH#2973) Tom Augspurger
Correctly handle the column name (df.columns.name) when reading in dd.read_parquet (GH#2973) Tom Augspurger
Fixed dd.concat losing the index dtype when the data contained a categorical (GH#2932) Tom Augspurger
Add dd.Series.rename (GH#3027) Jim Crist
DataFrame.merge() now supports merging on a combination of columns and the index (GH#2960) Jon Mease
Removed the deprecated dd.rolling* methods, in preparation for their removal in the next pandas release (GH#2995) Tom Augspurger
Fix metadata inference bug in which single-partition series were mistakenly special cased (GH#3035) Jim Crist
Add support for Series.str.cat (GH#3028) Jim Crist

Core¶

Improve 32-bit compatibility (GH#2937) Matthew Rocklin
Change task prioritization to avoid upwards branching (GH#3017) Matthew Rocklin

0.16.0 / 2017-11-17¶

This is a major release. It includes breaking changes, new protocols, and a large number of bug fixes.

Array¶

Add atleast_1d, atleast_2d, and atleast_3d (GH#2760) (GH#2765) John A Kirkham
Add allclose (GH#2771) by John A Kirkham
Remove random.different_seeds from Dask Array API docs (GH#2772) John A Kirkham
Deprecate vnorm in favor of dask.array.linalg.norm (GH#2773) John A Kirkham
Reimplement unique to be lazy (GH#2775) John A Kirkham
Support broadcasting of Dask Arrays with 0-length dimensions (GH#2784) John A Kirkham
Add asarray and asanyarray to Dask Array API docs (GH#2787) James Bourbeau
Support unique’s return_* arguments (GH#2779) John A Kirkham
Simplify _unique_internal (GH#2850) (GH#2855) John A Kirkham
Avoid removing some getter calls in array optimizations (GH#2826) Jim Crist

DataFrame¶

Support pyarrow in dd.to_parquet (GH#2868) Jim Crist
Fixed DataFrame.quantile and Series.quantile returning nan when missing values are present (GH#2791) Tom Augspurger
Fixed DataFrame.quantile losing the result .name when q is a scalar (GH#2791) Tom Augspurger
Fixed dd.concat return a dask.Dataframe when concatenating a single series along the columns, matching pandas’ behavior (GH#2800) James Munroe
Fixed default inplace parameter for DataFrame.eval to match the pandas defualt for pandas >= 0.21.0 (GH#2838) Tom Augspurger
Fix exception when calling DataFrame.set_index on text column where one of the partitions was empty (GH#2831) Jesse Vogt
Do not raise exception when calling DataFrame.set_index on empty dataframe (GH#2827) Jesse Vogt
Fixed bug in Dataframe.fillna when filling with a Series value (GH#2810) Tom Augspurger
Deprecate old argument ordering in dd.to_parquet to better match convention of putting the dataframe first (GH#2867) Jim Crist
df.astype(categorical_dtype -> known categoricals (GH#2835) Jim Crist
Test against Pandas release candidate (GH#2814) Tom Augspurger
Add more tests for read_parquet(engine=’pyarrow’) (GH#2822) Uwe Korn
Remove unnecessary map_partitions in aggregate (GH#2712) Christopher Prohm
Fix bug calling sample on empty partitions (GH#2818) @xwang777
Error nicely when parsing dates in read_csv (GH#2863) Jim Crist
Cleanup handling of passing filesystem objects to PyArrow readers (GH#2527) @fjetter
Support repartitioning even if there are no divisions (GH#2873) @Ced4
Support reading/writing to hdfs using pyarrow in dd.to_parquet (GH#2894, GH#2881) Jim Crist

Core¶

Allow tuples as sharedict keys (GH#2763) Matthew Rocklin
Calling compute within a dask.distributed task defaults to distributed scheduler (GH#2762) Matthew Rocklin
Auto-import gcsfs when gcs:// protocol is used (GH#2776) Matthew Rocklin
Fully remove dask.async module, use dask.local instead (GH#2828) Thomas Caswell
Compatibility with bokeh 0.12.10 (GH#2844) Tom Augspurger
Reduce test memory usage (GH#2782) Jim Crist
Add Dask collection interface (GH#2748) Jim Crist
Update Dask collection interface during XArray integration (GH#2847) Matthew Rocklin
Close resource profiler process on __exit__ (GH#2871) Jim Crist
Fix S3 tests (GH#2875) Jim Crist
Fix port for bokeh dashboard in docs (GH#2889) Ian Hopkinson
Wrap Dask filesystems for PyArrow compatibility (GH#2881) Jim Crist

0.15.4 / 2017-10-06¶

Array¶

da.random.choice now works with array arguments (GH#2781)
Support indexing in arrays with np.int (fixes regression) (GH#2719)
Handle zero dimension with rechunking (GH#2747)
Support -1 as an alias for “size of the dimension” in chunks (GH#2749)
Call mkdir in array.to_npy_stack (GH#2709)

DataFrame¶

Added the .str accessor to Categoricals with string categories (GH#2743)
Support int96 (spark) datetimes in parquet writer (GH#2711)
Pass on file scheme to fastparquet (GH#2714)
Support Pandas 0.21 (GH#2737)

Bag¶

Add tree reduction support for foldby (GH#2710)

Core¶

Drop s3fs from pip install dask[complete] (GH#2750)

0.15.3 / 2017-09-24¶

Array¶

Add masked arrays (GH#2301)
Add *_like array creation functions (GH#2640)
Indexing with unsigned integer array (GH#2647)
Improved slicing with boolean arrays of different dimensions (GH#2658)
Support literals in top and atop (GH#2661)
Optional axis argument in cumulative functions (GH#2664)
Improve tests on scalars with assert_eq (GH#2681)
Fix norm keepdims (GH#2683)
Add ptp (GH#2691)
Add apply_along_axis (GH#2690) and apply_over_axes (GH#2702)

DataFrame¶

Added Series.str[index] (GH#2634)
Allow the groupby by param to handle columns and index levels (GH#2636)
DataFrame.to_csv and Bag.to_textfiles now return the filenames to
which they have written (GH#2655)
Fix combination of partition_on and append in to_parquet (GH#2645)
Fix for parquet file schemes (GH#2667)
Repartition works with mixed categoricals (GH#2676)

Core¶

python setup.py test now runs tests (GH#2641)
Added new cheatsheet (GH#2649)
Remove resize tool in Bokeh plots (GH#2688)

0.15.2 / 2017-08-25¶

Array¶

Remove spurious keys from map_overlap graph (GH#2520)
where works with non-bool condition and scalar values (GH#2543) (GH#2549)
Improve compress (GH#2541) (GH#2545) (GH#2555)
Add argwhere, _nonzero, and where(cond) (GH#2539)
Generalize vindex in dask.array to handle multi-dimensional indices (GH#2573)
Add choose method (GH#2584)
Split code into reorganized files (GH#2595)
Add linalg.norm (GH#2597)
Add diff, ediff1d (GH#2607), (GH#2609)
Improve dtype inference and reflection (GH#2571)

Bag¶

Remove deprecated Bag behaviors (GH#2525)

DataFrame¶

Support callables in assign (GH#2513)
better error messages for read_csv (GH#2522)
Add dd.to_timedelta (GH#2523)
Verify metadata in from_delayed (GH#2534) (GH#2591)
Add DataFrame.isin (GH#2558)
Read_hdf supports iterables of files (GH#2547)

Core¶

Remove bare except: blocks everywhere (GH#2590)

0.15.1 / 2017-07-08¶

Add storage_options to to_textfiles and to_csv (GH#2466)
Rechunk and simplify rfftfreq (GH#2473), (GH#2475)
Better support ndarray subclasses (GH#2486)
Import star in dask.distributed (GH#2503)
Threadsafe cache handling with tokenization (GH#2511)

0.15.0 / 2017-06-09¶

Array¶

Add dask.array.stats submodule (GH#2269)
Support ufunc.outer (GH#2345)
Optimize fancy indexing by reducing graph overhead (GH#2333) (GH#2394)
Faster array tokenization using alternative hashes (GH#2377)
Added the matmul @ operator (GH#2349)
Improved coverage of the numpy.fft module (GH#2320) (GH#2322) (GH#2327) (GH#2323)
Support NumPy’s __array_ufunc__ protocol (GH#2438)

Bag¶

Fix bug where reductions on bags with no partitions would fail (GH#2324)
Add broadcasting and variadic db.map top-level function. Also remove auto-expansion of tuples as map arguments (GH#2339)
Rename Bag.concat to Bag.flatten (GH#2402)

DataFrame¶

Parquet improvements (GH#2277) (GH#2422)

Core¶

Move dask.async module to dask.local (GH#2318)
Support callbacks with nested scheduler calls (GH#2397)
Support pathlib.Path objects as uris (GH#2310)

0.14.3 / 2017-05-05¶

DataFrame¶

Pandas 0.20.0 support

0.14.2 / 2017-05-03¶

Array¶

Add da.indices (GH#2268), da.tile (GH#2153), da.roll (GH#2135)
Simultaneously support drop_axis and new_axis in da.map_blocks (GH#2264)
Rechunk and concatenate work with unknown chunksizes (GH#2235) and (GH#2251)
Support non-numpy container arrays, notably sparse arrays (GH#2234)
Tensordot contracts over multiple axes (GH#2186)
Allow delayed targets in da.store (GH#2181)
Support interactions against lists and tuples (GH#2148)
Constructor plugins for debugging (GH#2142)
Multi-dimensional FFTs (single chunk) (GH#2116)

Bag¶

to_dataframe enforces consistent types (GH#2199)

DataFrame¶

Set_index always fully sorts the index (GH#2290)
Support compatibility with pandas 0.20.0 (GH#2249), (GH#2248), and (GH#2246)
Support Arrow Parquet reader (GH#2223)
Time-based rolling windows (GH#2198)
Repartition can now create more partitions, not just less (GH#2168)

Core¶

Always use absolute paths when on POSIX file system (GH#2263)
Support user provided graph optimizations (GH#2219)
Refactor path handling (GH#2207)
Improve fusion performance (GH#2129), (GH#2131), and (GH#2112)

0.14.1 / 2017-03-22¶

Array¶

Micro-optimize optimizations (GH#2058)
Change slicing optimizations to avoid fusing raw numpy arrays (GH#2075) (GH#2080)
Dask.array operations now work on numpy arrays (GH#2079)
Reshape now works in a much broader set of cases (GH#2089)
Support deepcopy python protocol (GH#2090)
Allow user-provided FFT implementations in da.fft (GH#2093)

DataFrame¶

Fix to_parquet with empty partitions (GH#2020)
Optional npartitions='auto' mode in set_index (GH#2025)
Optimize shuffle performance (GH#2032)
Support efficient repartitioning along time windows like repartition(freq='12h') (GH#2059)
Improve speed of categorize (GH#2010)
Support single-row dataframe arithmetic (GH#2085)
Automatically avoid shuffle when setting index with a sorted column (GH#2091)
Improve handling of integer-na handling in read_csv (GH#2098)

Delayed¶

Repeated attribute access on delayed objects uses the same key (GH#2084)

Core¶

Improve naming of nodes in dot visuals to avoid generic apply (GH#2070)
Ensure that worker processes have different random seeds (GH#2094)

0.14.0 / 2017-02-24¶

Array¶

Fix corner cases with zero shape and misaligned values in arange (GH#1902), (GH#1904), (GH#1935), (GH#1955), (GH#1956)
Improve concatenation efficiency (GH#1923)
Avoid hashing in from_array if name is provided (GH#1972)

Bag¶

Repartition can now increase number of partitions (GH#1934)
Fix bugs in some reductions with empty partitions (GH#1939), (GH#1950), (GH#1953)

DataFrame¶

Support non-uniform categoricals (GH#1877), (GH#1930)
Groupby cumulative reductions (GH#1909)
DataFrame.loc indexing now supports lists (GH#1913)
Improve multi-level groupbys (GH#1914)
Improved HTML and string repr for DataFrames (GH#1637)
Parquet append (GH#1940)
Add dd.demo.daily_stock function for teaching (GH#1992)

Delayed¶

Add traverse= keyword to delayed to optionally avoid traversing nested data structures (GH#1899)
Support Futures in from_delayed functions (GH#1961)
Improve serialization of decorated delayed functions (GH#1969)

Core¶

Improve windows path parsing in corner cases (GH#1910)
Rename tasks when fusing (GH#1919)
Add top level persist function (GH#1927)
Propagate errors= keyword in byte handling (GH#1954)
Dask.compute traverses Python collections (GH#1975)
Structural sharing between graphs in dask.array and dask.delayed (GH#1985)

0.13.0 / 2017-01-02¶

Array¶

Mandatory dtypes on dask.array. All operations maintain dtype information and UDF functions like map_blocks now require a dtype= keyword if it can not be inferred. (GH#1755)
Support arrays without known shapes, such as arises when slicing arrays with arrays or converting dataframes to arrays (GH#1838)
Support mutation by setting one array with another (GH#1840)
Tree reductions for covariance and correlations. (GH#1758)
Add SerializableLock for better use with distributed scheduling (GH#1766)
Improved atop support (GH#1800)
Rechunk optimization (GH#1737), (GH#1827)

Bag¶

Avoid wrong results when recomputing the same groupby twice (GH#1867)

DataFrame¶

Add map_overlap for custom rolling operations (GH#1769)
Add shift (GH#1773)
Add Parquet support (GH#1782) (GH#1792) (GH#1810), (GH#1843), (GH#1859), (GH#1863)
Add missing methods combine, abs, autocorr, sem, nsmallest, first, last, prod, (GH#1787)
Approximate nunique (GH#1807), (GH#1824)
Reductions with multiple output partitions (for operations like drop_duplicates) (GH#1808), (GH#1823) (GH#1828)
Add delitem and copy to DataFrames, increasing mutation support (GH#1858)

Delayed¶

Changed behaviour for delayed(nout=0) and delayed(nout=1): delayed(nout=1) does not default to out=None anymore, and delayed(nout=0) is also enabled. I.e. functions with return tuples of length 1 or 0 can be handled correctly. This is especially handy, if functions with a variable amount of outputs are wrapped by delayed. E.g. a trivial example: delayed(lambda *args: args, nout=len(vals))(*vals)

Core¶

Refactor core byte ingest (GH#1768), (GH#1774)
Improve import time (GH#1833)

0.12.0 / 2016-11-03¶

DataFrame¶

Return a series when functions given to dataframe.map_partitions return scalars (GH#1515)
Fix type size inference for series (GH#1513)
dataframe.DataFrame.categorize no longer includes missing values in the categories. This is for compatibility with a pandas change (GH#1565)
Fix head parser error in dataframe.read_csv when some lines have quotes (GH#1495)
Add dataframe.reduction and series.reduction methods to apply generic row-wise reduction to dataframes and series (GH#1483)
Add dataframe.select_dtypes, which mirrors the pandas method (GH#1556)
dataframe.read_hdf now supports reading Series (GH#1564)
Support Pandas 0.19.0 (GH#1540)
Implement select_dtypes (GH#1556)
String accessor works with indexes (GH#1561)
Add pipe method to dask.dataframe (GH#1567)
Add indicator keyword to merge (GH#1575)
Support Series in read_hdf (GH#1575)
Support Categories with missing values (GH#1578)
Support inplace operators like df.x += 1 (GH#1585)
Str accessor passes through args and kwargs (GH#1621)
Improved groupby support for single-machine multiprocessing scheduler (GH#1625)
Tree reductions (GH#1663)
Pivot tables (GH#1665)
Add clip (GH#1667), align (GH#1668), combine_first (GH#1725), and any/all (GH#1724)
Improved handling of divisions on dask-pandas merges (GH#1666)
Add groupby.aggregate method (GH#1678)
Add dd.read_table function (GH#1682)
Improve support for multi-level columns (GH#1697) (GH#1712)
Support 2d indexing in loc (GH#1726)
Extend resample to include DataFrames (GH#1741)
Support dask.array ufuncs on dask.dataframe objects (GH#1669)

Array¶

Add information about how dask.array chunks argument work (GH#1504)
Fix field access with non-scalar fields in dask.array (GH#1484)
Add concatenate= keyword to atop to concatenate chunks of contracted dimensions
Optimized slicing performance (GH#1539) (GH#1731)
Extend atop with a concatenate= (GH#1609) new_axes= (GH#1612) and adjust_chunks= (GH#1716) keywords
Add clip (GH#1610) swapaxes (GH#1611) round (GH#1708) repeat
Automatically align chunks in atop-backed operations (GH#1644)
Cull dask.arrays on slicing (GH#1709)

Bag¶

Fix issue with callables in bag.from_sequence being interpreted as tasks (GH#1491)
Avoid non-lazy memory use in reductions (GH#1747)

Administration¶

Added changelog (GH#1526)
Create new threadpool when operating from thread (GH#1487)
Unify example documentation pages into one (GH#1520)
Add versioneer for git-commit based versions (GH#1569)
Pass through node_attr and edge_attr keywords in dot visualization (GH#1614)
Add continuous testing for Windows with Appveyor (GH#1648)
Remove use of multiprocessing.Manager (GH#1653)
Add global optimizations keyword to compute (GH#1675)
Micro-optimize get_dependencies (GH#1722)

0.11.0 / 2016-08-24¶

Major Points¶

DataFrames now enforce knowing full metadata (columns, dtypes) everywhere. Previously we would operate in an ambiguous state when functions lost dtype information (such as apply). Now all dataframes always know their dtypes and raise errors asking for information if they are unable to infer (which they usually can). Some internal attributes like _pd and _pd_nonempty have been moved.

The internals of the distributed scheduler have been refactored to transition tasks between explicit states. This improves resilience, reasoning about scheduling, plugin operation, and logging. It also makes the scheduler code easier to understand for newcomers.

Breaking Changes¶

The distributed.s3 and distributed.hdfs namespaces are gone. Use protocols in normal methods like read_text('s3://...' instead.
Dask.array.reshape now errs in some cases where previously it would have create a very large number of tasks

0.10.2 / 2016-07-27¶

More Dataframe shuffles now work in distributed settings, ranging from setting-index to hash joins, to sorted joins and groupbys.
Dask passes the full test suite when run when under in Python’s optimized-OO mode.
On-disk shuffles were found to produce wrong results in some highly-concurrent situations, especially on Windows. This has been resolved by a fix to the partd library.
Fixed a growth of open file descriptors that occurred under large data communications
Support ports in the --bokeh-whitelist option ot dask-scheduler to better routing of web interface messages behind non-trivial network settings
Some improvements to resilience to worker failure (though other known failures persist)
You can now start an IPython kernel on any worker for improved debugging and analysis
Improvements to dask.dataframe.read_hdf, especially when reading from multiple files and docs

0.10.0 / 2016-06-13¶

Major Changes¶

This version drops support for Python 2.6
Conda packages are built and served from conda-forge
The dask.distributed executables have been renamed from dfoo to dask-foo. For example dscheduler is renamed to dask-scheduler
Both Bag and DataFrame include a preliminary distributed shuffle.

Bag¶

Add task-based shuffle for distributed groupbys
Add accumulate for cumulative reductions

DataFrame¶

Add a task-based shuffle suitable for distributed joins, groupby-applys, and set_index operations. The single-machine shuffle remains untouched (and much more efficient.)
Add support for new Pandas rolling API with improved communication performance on distributed systems.
Add groupby.std/var
Pass through S3/HDFS storage options in read_csv
Improve categorical partitioning
Add eval, info, isnull, notnull for dataframes

Distributed¶

Rename executables like dscheduler to dask-scheduler
Improve scheduler performance in the many-fast-tasks case (important for shuffling)
Improve work stealing to be aware of expected function run-times and data sizes. The drastically increases the breadth of algorithms that can be efficiently run on the distributed scheduler without significant user expertise.
Support maximum buffer sizes in streaming queues
Improve Windows support when using the Bokeh diagnostic web interface
Support compression of very-large-bytestrings in protocol
Support clean cancellation of submitted futures in Joblib interface

Other¶

All dask-related projects (dask, distributed, s3fs, hdfs, partd) are now building conda packages on conda-forge.
Change credential handling in s3fs to only pass around delegated credentials if explicitly given secret/key. The default now is to rely on managed environments. This can be changed back by explicitly providing a keyword argument. Anonymous mode must be explicitly declared if desired.

0.9.0 / 2016-05-11¶

API Changes¶

dask.do and dask.value have been renamed to dask.delayed
dask.bag.from_filenames has been renamed to dask.bag.read_text
All S3/HDFS data ingest functions like db.from_s3 or distributed.s3.read_csv have been moved into the plain read_text, read_csv functions, which now support protocols, like dd.read_csv('s3://bucket/keys*.csv')

Array¶

Add support for scipy.LinearOperator
Improve optional locking to on-disk data structures
Change rechunk to expose the intermediate chunks

Bag¶

Rename from_filenames to read_text
Remove from_s3 in favor of read_text('s3://...')

DataFrame¶

Fixed numerical stability issue for correlation and covariance
Allow no-hash from_pandas for speedy round-trips to and from-pandas objects
Generally reengineered read_csv to be more in line with Pandas behavior
Support fast set_index operations for sorted columns

Delayed¶

Rename do/value to delayed
Rename to/from_imperative to to/from_delayed

Distributed¶

Move s3 and hdfs functionality into the dask repository
Adaptively oversubscribe workers for very fast tasks
Improve PyPy support
Improve work stealing for unbalanced workers
Scatter data efficiently with tree-scatters

Other¶

Add lzma/xz compression support
Raise a warning when trying to split unsplittable compression types, like gzip or bz2
Improve hashing for single-machine shuffle operations
Add new callback method for start state
General performance tuning

0.8.1 / 2016-03-11¶

Array¶

Bugfix for range slicing that could periodically lead to incorrect results.
Improved support and resiliency of arg reductions (argmin, argmax, etc.)

Bag¶

Add zip function

DataFrame¶

Add corr and cov functions
Add melt function
Bugfixes for io to bcolz and hdf5

0.8.0 / 2016-02-20¶

Array¶

Changed default array reduction split from 32 to 4
Linear algebra, tril, triu, LU, inv, cholesky, solve, solve_triangular, eye, lstsq, diag, corrcoef.

Bag¶

Add tree reductions
Add range function
drop from_hdfs function (better functionality now exists in hdfs3 and distributed projects)

DataFrame¶

Refactor dask.dataframe to include a full empty pandas dataframe as metadata. Drop the .columns attribute on Series
Add Series categorical accessor, series.nunique, drop the .columns attribute for series.
read_csv fixes (multi-column parse_dates, integer column names, etc. )
Internal changes to improve graph serialization

Other¶

Documentation updates
Add from_imperative and to_imperative functions for all collections
Aesthetic changes to profiler plots
Moved the dask project to a new dask organization

0.7.6 / 2016-01-05¶

Array¶

Improve thread safety
Tree reductions
Add view, compress, hstack, dstack, vstack methods
map_blocks can now remove and add dimensions

DataFrame¶

Improve thread safety
Extend sampling to include replacement options

Imperative¶

Removed optimization passes that fused results.

Core¶

Removed dask.distributed
Improved performance of blocked file reading
Serialization improvements
Test Python 3.5

0.7.4 / 2015-10-23¶

This was mostly a bugfix release. Some notable changes:

Fix minor bugs associated with the release of numpy 1.10 and pandas 0.17
Fixed a bug with random number generation that would cause repeated blocks due to the birthday paradox
Use locks in dask.dataframe.read_hdf by default to avoid concurrency issues
Change dask.get to point to dask.async.get_sync by default
Allow visualization functions to accept general graphviz graph options like rankdir=’LR’
Add reshape and ravel to dask.array
Support the creation of dask.arrays from dask.imperative objects

Deprecation¶

This release also includes a deprecation warning for dask.distributed, which will be removed in the next version.

Future development in distributed computing for dask is happening here: https://distributed.dask.org . General feedback on that project is most welcome from this community.

0.7.3 / 2015-09-25¶

Diagnostics¶

A utility for profiling memory and cpu usage has been added to the dask.diagnostics module.

DataFrame¶

This release improves coverage of the pandas API. Among other things it includes nunique, nlargest, quantile. Fixes encoding issues with reading non-ascii csv files. Performance improvements and bug fixes with resample. More flexible read_hdf with globbing. And many more. Various bug fixes in dask.imperative and dask.bag.

0.7.0 / 2015-08-15¶

DataFrame¶

This release includes significant bugfixes and alignment with the Pandas API. This has resulted both from use and from recent involvement by Pandas core developers.

New operations: query, rolling operations, drop
Improved operations: quantiles, arithmetic on full dataframes, dropna, constructor logic, merge/join, elemwise operations, groupby aggregations

Bag¶

Fixed a bug in fold where with a null default argument

Array¶

New operations: da.fft module, da.image.imread

Infrastructure¶

The array and dataframe collections create graphs with deterministic keys. These tend to be longer (hash strings) but should be consistent between computations. This will be useful for caching in the future.
All collections (Array, Bag, DataFrame) inherit from common subclass

0.6.1 / 2015-07-23¶

Distributed¶

Improved (though not yet sufficient) resiliency for dask.distributed when workers die

DataFrame¶

Improved writing to various formats, including to_hdf, to_castra, and to_csv
Improved creation of dask DataFrames from dask Arrays and Bags
Improved support for categoricals and various other methods

Array¶

Various bug fixes
Histogram function

Scheduling¶

Added tie-breaking ordering of tasks within parallel workloads to better handle and clear intermediate results

Other¶

Added the dask.do function for explicit construction of graphs with normal python code
Traded pydot for graphviz library for graph printing to support Python3
There is also a gitter chat room and a stackoverflow tag

Development Guidelines

Configuration

Changelog

Contents

Changelog¶

2024.4.2¶

Highlights¶

Trivial Merge Implementation¶

Auto-partitioning in read_parquet¶

2024.4.1¶

2024.4.0¶

Highlights¶

Query planning fixes¶

GPU metric dashboard fixes¶

2024.3.1¶

2024.3.0¶

Highlights¶

Query planning¶

Sunset of Pandas 1.X support¶

2024.2.1¶

Highlights¶

Allow silencing dask.DataFrame deprecation warning¶

More robust distributed scheduler for rare key collisions¶

More robust adaptive scaling on large clusters¶

2024.2.0¶

Highlights¶

Deprecate Dask DataFrame implementation¶

Improved tokenization¶

2024.1.1¶

Highlights¶

Pandas 2.2 and Scipy 1.12 support¶

Deprecations¶

2024.1.0¶

Highlights¶

Partial rechunks within P2P¶

Fastparquet engine deprecated¶

Improved serialization for arbitrary data¶

Additional deprecations¶

2023.12.1¶

Highlights¶

Logical Query Planning now available for Dask DataFrames¶

Dtype inference in read_parquet¶

Scheduling improvements to reduce memory usage¶

Improved P2P-based merging robustness and performance¶

Removed disabling pickle option¶

2023.12.0¶

Highlights¶

PipInstall restart and environment variables¶

Bokeh 3.3.0 compatibility¶

2023.11.0¶

Highlights¶

Zero-copy P2P Array Rechunking¶

Deprecating PyArrow <14.0.1¶

Improved PyArrow filesystem for Parquet¶

Improve Type Reconciliation in P2P Shuffling¶

2023.10.1¶

Highlights¶

Python 3.12¶

2023.10.0¶

Highlights¶

Reduced memory pressure for multi array reductions¶

Improved P2P shuffling robustness¶

Reduced scheduler CPU load for large graphs¶

2023.9.3¶

Highlights¶

Restore previous configuration override behavior¶

Complex dtypes in Dask Array reductions¶

2023.9.2¶

Highlights¶

P2P shuffling now raises when outdated PyArrow is installed¶

Deprecation cycle for admin.traceback.shorten¶

2023.9.1¶

Enhancements¶

Bug Fixes¶

Maintenance¶

2023.9.0¶

Bug Fixes¶

Documentation¶

Maintenance¶

2023.8.1¶

Enhancements¶

Bug Fixes¶

Auto-partitioning in `read_parquet`¶

Dtype inference in `read_parquet`¶