API¶
Dataframe¶
DataFrame (dsk, name, meta, divisions) |
Parallel Pandas DataFrame |
DataFrame.add (other[, axis, level, fill_value]) |
Addition of dataframe and other, element-wise (binary operator add). |
DataFrame.append (other) |
Append rows of other to the end of this frame, returning a new object. |
DataFrame.apply (func[, axis, args, meta]) |
Parallel version of pandas.DataFrame.apply |
DataFrame.assign (\*\*kwargs) |
Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones. |
DataFrame.astype (dtype) |
Cast object to input numpy.dtype |
DataFrame.categorize ([columns, index, ...]) |
Convert columns of the DataFrame to category dtype. |
DataFrame.columns |
|
DataFrame.compute (\*\*kwargs) |
Compute this dask collection |
DataFrame.corr ([method, min_periods, ...]) |
Compute pairwise correlation of columns, excluding NA/null values |
DataFrame.count ([axis, split_every]) |
Return Series with number of non-NA/null observations over requested axis. |
DataFrame.cov ([min_periods, split_every]) |
Compute pairwise covariance of columns, excluding NA/null values |
DataFrame.cummax ([axis, skipna]) |
Return cumulative max over requested axis. |
DataFrame.cummin ([axis, skipna]) |
Return cumulative minimum over requested axis. |
DataFrame.cumprod ([axis, skipna]) |
Return cumulative product over requested axis. |
DataFrame.cumsum ([axis, skipna]) |
Return cumulative sum over requested axis. |
DataFrame.describe ([split_every]) |
Generate various summary statistics, excluding NaN values. |
DataFrame.div (other[, axis, level, fill_value]) |
Floating division of dataframe and other, element-wise (binary operator truediv). |
DataFrame.drop (labels[, axis, errors]) |
Return new object with labels in requested axis removed. |
DataFrame.drop_duplicates ([split_every, ...]) |
Return DataFrame with duplicate rows removed, optionally only |
DataFrame.dropna ([how, subset]) |
Return object with labels on given axis omitted where alternately any |
DataFrame.dtypes |
Return data types |
DataFrame.fillna ([value, method, limit, axis]) |
Fill NA/NaN values using the specified method |
DataFrame.floordiv (other[, axis, level, ...]) |
Integer division of dataframe and other, element-wise (binary operator floordiv). |
DataFrame.get_partition (n) |
Get a dask DataFrame/Series representing the nth partition. |
DataFrame.groupby ([by]) |
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns. |
DataFrame.head ([n, npartitions, compute]) |
First n rows of the dataset |
DataFrame.index |
Return dask Index instance |
DataFrame.iterrows () |
Iterate over DataFrame rows as (index, Series) pairs. |
DataFrame.itertuples () |
Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple. |
DataFrame.join (other[, on, how, lsuffix, ...]) |
Join columns with other DataFrame either on index or on a key column. |
DataFrame.known_divisions |
Whether divisions are already known |
DataFrame.loc |
Purely label-location based indexer for selection by label. |
DataFrame.map_partitions (func, \*args, \*\*kwargs) |
Apply Python function on each DataFrame partition. |
DataFrame.mask (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other. |
DataFrame.max ([axis, skipna, split_every]) |
This method returns the maximum of the values in the object. |
DataFrame.mean ([axis, skipna, split_every]) |
Return the mean of the values for the requested axis |
DataFrame.merge (right[, how, on, left_on, ...]) |
Merge DataFrame objects by performing a database-style join operation by columns or indexes. |
DataFrame.min ([axis, skipna, split_every]) |
This method returns the minimum of the values in the object. |
DataFrame.mod (other[, axis, level, fill_value]) |
Modulo of dataframe and other, element-wise (binary operator mod). |
DataFrame.mul (other[, axis, level, fill_value]) |
Multiplication of dataframe and other, element-wise (binary operator mul). |
DataFrame.ndim |
Return dimensionality |
DataFrame.nlargest ([n, columns, split_every]) |
Get the rows of a DataFrame sorted by the n largest values of columns. |
DataFrame.npartitions |
Return number of partitions |
DataFrame.pow (other[, axis, level, fill_value]) |
Exponential power of dataframe and other, element-wise (binary operator pow). |
DataFrame.quantile ([q, axis]) |
Approximate row-wise and precise column-wise quantiles of DataFrame |
DataFrame.query (expr, \*\*kwargs) |
Filter dataframe with complex expression |
DataFrame.radd (other[, axis, level, fill_value]) |
Addition of dataframe and other, element-wise (binary operator radd). |
DataFrame.random_split (frac[, random_state]) |
Pseudorandomly split dataframe into different pieces row-wise |
DataFrame.rdiv (other[, axis, level, fill_value]) |
Floating division of dataframe and other, element-wise (binary operator rtruediv). |
DataFrame.rename ([index, columns]) |
Alter axes input function or functions. |
DataFrame.repartition ([divisions, ...]) |
Repartition dataframe along new divisions |
DataFrame.reset_index ([drop]) |
Reset the index to the default index. |
DataFrame.rfloordiv (other[, axis, level, ...]) |
Integer division of dataframe and other, element-wise (binary operator rfloordiv). |
DataFrame.rmod (other[, axis, level, fill_value]) |
Modulo of dataframe and other, element-wise (binary operator rmod). |
DataFrame.rmul (other[, axis, level, fill_value]) |
Multiplication of dataframe and other, element-wise (binary operator rmul). |
DataFrame.rpow (other[, axis, level, fill_value]) |
Exponential power of dataframe and other, element-wise (binary operator rpow). |
DataFrame.rsub (other[, axis, level, fill_value]) |
Subtraction of dataframe and other, element-wise (binary operator rsub). |
DataFrame.rtruediv (other[, axis, level, ...]) |
Floating division of dataframe and other, element-wise (binary operator rtruediv). |
DataFrame.sample (frac[, replace, random_state]) |
Random sample of items |
DataFrame.set_index (other[, drop, sorted, ...]) |
Set the DataFrame index (row labels) using an existing column |
DataFrame.std ([axis, skipna, ddof, split_every]) |
Return sample standard deviation over requested axis. |
DataFrame.sub (other[, axis, level, fill_value]) |
Subtraction of dataframe and other, element-wise (binary operator sub). |
DataFrame.sum ([axis, skipna, split_every]) |
Return the sum of the values for the requested axis |
DataFrame.tail ([n, compute]) |
Last n rows of the dataset |
DataFrame.to_bag ([index]) |
Convert to a dask Bag of tuples of each row. |
DataFrame.to_csv (filename, \*\*kwargs) |
See dd.to_csv docstring for more information |
DataFrame.to_delayed () |
See dd.to_delayed docstring for more information |
DataFrame.to_hdf (path_or_buf, key[, mode, ...]) |
See dd.to_hdf docstring for more information |
DataFrame.to_records ([index]) |
|
DataFrame.truediv (other[, axis, level, ...]) |
Floating division of dataframe and other, element-wise (binary operator truediv). |
DataFrame.values |
Return a dask.array of the values of this dataframe |
DataFrame.var ([axis, skipna, ddof, split_every]) |
Return unbiased variance over requested axis. |
DataFrame.visualize ([filename, format, ...]) |
Render the computation of this object’s task graph using graphviz. |
DataFrame.where (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other. |
Series¶
Series (dsk, name, meta, divisions) |
Parallel Pandas Series |
Series.add (other[, level, fill_value, axis]) |
Addition of series and other, element-wise (binary operator add). |
Series.align (other[, join, axis, fill_value]) |
Align two object on their axes with the |
Series.all ([axis, skipna, split_every]) |
Return whether all elements are True over requested axis |
Series.any ([axis, skipna, split_every]) |
Return whether any element is True over requested axis |
Series.append (other) |
Concatenate two or more Series. |
Series.apply (func[, convert_dtype, meta, args]) |
Parallel version of pandas.Series.apply |
Series.astype (dtype) |
Cast object to input numpy.dtype |
Series.autocorr ([lag, split_every]) |
Lag-N autocorrelation |
Series.between (left, right[, inclusive]) |
Return boolean Series equivalent to left <= series <= right. |
Series.bfill ([axis, limit]) |
Synonym for NDFrame.fillna(method=’bfill’) .. |
Series.cat |
|
Series.clear_divisions () |
Forget division information |
Series.clip ([lower, upper, out]) |
Trim values at input threshold(s). |
Series.clip_lower (threshold) |
Return copy of the input with values below given value(s) truncated. |
Series.clip_upper (threshold) |
Return copy of input with values above given value(s) truncated. |
Series.compute (\*\*kwargs) |
Compute this dask collection |
Series.copy () |
Make a copy of the dataframe |
Series.corr (other[, method, min_periods, ...]) |
Compute correlation with other Series, excluding missing values |
Series.count ([split_every]) |
Return number of non-NA/null observations in the Series |
Series.cov (other[, min_periods, split_every]) |
Compute covariance with Series, excluding missing values |
Series.cummax ([axis, skipna]) |
Return cumulative max over requested axis. |
Series.cummin ([axis, skipna]) |
Return cumulative minimum over requested axis. |
Series.cumprod ([axis, skipna]) |
Return cumulative product over requested axis. |
Series.cumsum ([axis, skipna]) |
Return cumulative sum over requested axis. |
Series.describe ([split_every]) |
Generate various summary statistics, excluding NaN values. |
Series.diff ([periods, axis]) |
1st discrete difference of object |
Series.div (other[, level, fill_value, axis]) |
Floating division of series and other, element-wise (binary operator truediv). |
Series.drop_duplicates ([split_every, split_out]) |
Return DataFrame with duplicate rows removed, optionally only |
Series.dropna () |
Return Series without null values |
Series.dt |
|
Series.dtype |
Return data type |
Series.eq (other[, level, axis]) |
Equal to of series and other, element-wise (binary operator eq). |
Series.ffill ([axis, limit]) |
Synonym for NDFrame.fillna(method=’ffill’) .. |
Series.fillna ([value, method, limit, axis]) |
Fill NA/NaN values using the specified method |
Series.first (offset) |
Convenience method for subsetting initial periods of time series data based on a date offset. |
Series.floordiv (other[, level, fill_value, axis]) |
Integer division of series and other, element-wise (binary operator floordiv). |
Series.ge (other[, level, axis]) |
Greater than or equal to of series and other, element-wise (binary operator ge). |
Series.get_partition (n) |
Get a dask DataFrame/Series representing the nth partition. |
Series.groupby ([by]) |
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns. |
Series.gt (other[, level, axis]) |
Greater than of series and other, element-wise (binary operator gt). |
Series.head ([n, npartitions, compute]) |
First n rows of the dataset |
Series.idxmax ([axis, skipna, split_every]) |
Return index of first occurrence of maximum over requested axis. |
Series.idxmin ([axis, skipna, split_every]) |
Return index of first occurrence of minimum over requested axis. |
Series.isin (values) |
Return a boolean Series showing whether each element in the Series is exactly contained in the passed sequence of values . |
Series.isnull () |
Return a boolean same-sized object indicating if the values are null. |
Series.iteritems () |
Lazily iterate over (index, value) tuples |
Series.known_divisions |
Whether divisions are already known |
Series.last (offset) |
Convenience method for subsetting final periods of time series data based on a date offset. |
Series.le (other[, level, axis]) |
Less than or equal to of series and other, element-wise (binary operator le). |
Series.loc |
Purely label-location based indexer for selection by label. |
Series.lt (other[, level, axis]) |
Less than of series and other, element-wise (binary operator lt). |
Series.map (arg[, na_action, meta]) |
Map values of Series using input correspondence (which can be |
Series.map_overlap (func, before, after, ...) |
Apply a function to each partition, sharing rows with adjacent partitions. |
Series.map_partitions (func, \*args, \*\*kwargs) |
Apply Python function on each DataFrame partition. |
Series.mask (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other. |
Series.max ([axis, skipna, split_every]) |
This method returns the maximum of the values in the object. |
Series.mean ([axis, skipna, split_every]) |
Return the mean of the values for the requested axis |
Series.memory_usage ([index, deep]) |
Memory usage of the Series |
Series.min ([axis, skipna, split_every]) |
This method returns the minimum of the values in the object. |
Series.mod (other[, level, fill_value, axis]) |
Modulo of series and other, element-wise (binary operator mod). |
Series.mul (other[, level, fill_value, axis]) |
Multiplication of series and other, element-wise (binary operator mul). |
Series.nbytes |
Number of bytes |
Series.ndim |
Return dimensionality |
Series.ne (other[, level, axis]) |
Not equal to of series and other, element-wise (binary operator ne). |
Series.nlargest ([n, split_every]) |
Return the largest n elements. |
Series.notnull () |
Return a boolean same-sized object indicating if the values are not null. |
Series.nsmallest ([n, split_every]) |
Return the smallest n elements. |
Series.nunique ([split_every]) |
Return number of unique elements in the object. |
Series.nunique_approx ([split_every]) |
Approximate number of unique rows. |
Series.persist (\*\*kwargs) |
Persist this dask collection into memory |
Series.pipe (func, \*args, \*\*kwargs) |
Apply func(self, *args, **kwargs) |
Series.pow (other[, level, fill_value, axis]) |
Exponential power of series and other, element-wise (binary operator pow). |
Series.prod ([axis, skipna, split_every]) |
Return the product of the values for the requested axis |
Series.quantile ([q]) |
Approximate quantiles of Series |
Series.radd (other[, level, fill_value, axis]) |
Addition of series and other, element-wise (binary operator radd). |
Series.random_split (frac[, random_state]) |
Pseudorandomly split dataframe into different pieces row-wise |
Series.rdiv (other[, level, fill_value, axis]) |
Floating division of series and other, element-wise (binary operator rtruediv). |
Series.reduction (chunk[, aggregate, ...]) |
Generic row-wise reductions. |
Series.repartition ([divisions, npartitions, ...]) |
Repartition dataframe along new divisions |
Series.resample (rule[, how, closed, label]) |
Convenience method for frequency conversion and resampling of time series. |
Series.reset_index ([drop]) |
Reset the index to the default index. |
Series.rolling (window[, min_periods, freq, ...]) |
Provides rolling transformations. |
Series.round ([decimals]) |
Round each value in a Series to the given number of decimals. |
Series.sample (frac[, replace, random_state]) |
Random sample of items |
Series.sem ([axis, skipna, ddof, split_every]) |
Return unbiased standard error of the mean over requested axis. |
Series.shift ([periods, freq, axis]) |
Shift index by desired number of periods with an optional time freq |
Series.size |
Size of the series |
Series.std ([axis, skipna, ddof, split_every]) |
Return sample standard deviation over requested axis. |
Series.str |
|
Series.sub (other[, level, fill_value, axis]) |
Subtraction of series and other, element-wise (binary operator sub). |
Series.sum ([axis, skipna, split_every]) |
Return the sum of the values for the requested axis |
Series.to_bag ([index]) |
Craeate a Dask Bag from a Series |
Series.to_csv (filename, \*\*kwargs) |
See dd.to_csv docstring for more information |
Series.to_delayed () |
See dd.to_delayed docstring for more information |
Series.to_frame ([name]) |
Convert Series to DataFrame |
Series.to_hdf (path_or_buf, key[, mode, ...]) |
See dd.to_hdf docstring for more information |
Series.to_parquet (path, \*args, \*\*kwargs) |
See dd.to_parquet docstring for more information |
Series.to_string ([max_rows]) |
Render a string representation of the Series |
Series.to_timestamp ([freq, how, axis]) |
Cast to DatetimeIndex of timestamps, at beginning of period |
Series.truediv (other[, level, fill_value, axis]) |
Floating division of series and other, element-wise (binary operator truediv). |
Series.unique ([split_every, split_out]) |
Return Series of unique values in the object. |
Series.value_counts ([split_every, split_out]) |
Returns object containing counts of unique values. |
Series.values |
Return a dask.array of the values of this dataframe |
Series.var ([axis, skipna, ddof, split_every]) |
Return unbiased variance over requested axis. |
Series.visualize ([filename, format, ...]) |
Render the computation of this object’s task graph using graphviz. |
Series.where (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other. |
Groupby Operations¶
DataFrameGroupBy.aggregate (arg[, ...]) |
Aggregate using input function or dict of {column -> |
DataFrameGroupBy.apply (func[, meta]) |
Parallel version of pandas GroupBy.apply |
DataFrameGroupBy.count ([split_every, split_out]) |
Compute count of group, excluding missing values |
DataFrameGroupBy.cumcount ([axis]) |
Number each item in each group from 0 to the length of that group - 1. |
DataFrameGroupBy.cumprod ([axis]) |
Cumulative product for each group |
DataFrameGroupBy.cumsum ([axis]) |
Cumulative sum for each group |
DataFrameGroupBy.get_group (key) |
Constructs NDFrame from group with provided name |
DataFrameGroupBy.max ([split_every, split_out]) |
Compute max of group values |
DataFrameGroupBy.mean ([split_every, split_out]) |
Compute mean of groups, excluding missing values |
DataFrameGroupBy.min ([split_every, split_out]) |
Compute min of group values |
DataFrameGroupBy.size ([split_every, split_out]) |
Compute group sizes |
DataFrameGroupBy.std ([ddof, split_every, ...]) |
Compute standard deviation of groups, excluding missing values |
DataFrameGroupBy.sum ([split_every, split_out]) |
Compute sum of group values |
DataFrameGroupBy.var ([ddof, split_every, ...]) |
Compute variance of groups, excluding missing values |
SeriesGroupBy.aggregate (arg[, split_every, ...]) |
Apply aggregation function or functions to groups, yielding most likely |
SeriesGroupBy.apply (func[, meta]) |
Parallel version of pandas GroupBy.apply |
SeriesGroupBy.count ([split_every, split_out]) |
Compute count of group, excluding missing values |
SeriesGroupBy.cumcount ([axis]) |
Number each item in each group from 0 to the length of that group - 1. |
SeriesGroupBy.cumprod ([axis]) |
Cumulative product for each group |
SeriesGroupBy.cumsum ([axis]) |
Cumulative sum for each group |
SeriesGroupBy.get_group (key) |
Constructs NDFrame from group with provided name |
SeriesGroupBy.max ([split_every, split_out]) |
Compute max of group values |
SeriesGroupBy.mean ([split_every, split_out]) |
Compute mean of groups, excluding missing values |
SeriesGroupBy.min ([split_every, split_out]) |
Compute min of group values |
SeriesGroupBy.nunique ([split_every, split_out]) |
|
SeriesGroupBy.size ([split_every, split_out]) |
Compute group sizes |
SeriesGroupBy.std ([ddof, split_every, split_out]) |
Compute standard deviation of groups, excluding missing values |
SeriesGroupBy.sum ([split_every, split_out]) |
Compute sum of group values |
SeriesGroupBy.var ([ddof, split_every, split_out]) |
Compute variance of groups, excluding missing values |
Rolling Operations¶
rolling.map_overlap (func, df, before, after, ...) |
Apply a function to each partition, sharing rows with adjacent partitions. |
rolling.rolling_apply (arg, window, \*args, ...) |
Generic moving function application. |
rolling.rolling_count (arg, window, \*args, ...) |
Rolling count of number of non-NaN observations inside provided window. |
rolling.rolling_kurt (arg, window, \*args, ...) |
Unbiased moving kurtosis. |
rolling.rolling_max (arg, window, \*args, \*\*kwargs) |
Moving maximum. |
rolling.rolling_mean (arg, window, \*args, ...) |
Moving mean. |
rolling.rolling_median (arg, window, \*args, ...) |
Moving median. |
rolling.rolling_min (arg, window, \*args, \*\*kwargs) |
Moving minimum. |
rolling.rolling_quantile (arg, window, \*args, ...) |
Moving quantile. |
rolling.rolling_skew (arg, window, \*args, ...) |
Unbiased moving skewness. |
rolling.rolling_std (arg, window, \*args, \*\*kwargs) |
Moving standard deviation. |
rolling.rolling_sum (arg, window, \*args, \*\*kwargs) |
Moving sum. |
rolling.rolling_var (arg, window, \*args, \*\*kwargs) |
Moving variance. |
rolling.rolling_window (arg, window, \*\*kwargs) |
Applies a moving window of type window_type and size window on the data. |
Create DataFrames¶
read_csv (urlpath[, blocksize, collection, ...]) |
Read CSV files into a Dask.DataFrame |
read_table (urlpath[, blocksize, collection, ...]) |
Read delimited files into a Dask.DataFrame |
read_parquet (path[, columns, filters, ...]) |
Read ParquetFile into a Dask DataFrame |
read_hdf (pattern, key[, start, stop, ...]) |
Read HDF files into a Dask DataFrame |
read_sql_table (table, uri, index_col[, ...]) |
Create dataframe from an SQL table. |
from_array (x[, chunksize, columns]) |
Read any slicable array into a Dask Dataframe |
from_bcolz (x[, chunksize, categorize, ...]) |
Read BColz CTable into a Dask Dataframe |
from_dask_array (x[, columns]) |
Create a Dask DataFrame from a Dask Array. |
from_delayed (dfs[, meta, divisions, prefix]) |
Create Dask DataFrame from many Dask Delayed objects |
from_pandas (data[, npartitions, chunksize, ...]) |
Construct a Dask DataFrame from a Pandas DataFrame |
dask.bag.core.Bag.to_dataframe ([meta, columns]) |
Create Dask Dataframe from a Dask Bag |
Store DataFrames¶
to_csv (df, filename[, name_function, ...]) |
Store Dask DataFrame to CSV files |
to_parquet (path, df[, compression, ...]) |
Store Dask.dataframe to Parquet files |
to_hdf (df, path, key[, mode, append, get, ...]) |
Store Dask Dataframe to Hierarchical Data Format (HDF) files |
to_records (df) |
Create Dask Array from a Dask Dataframe |
to_bag (df[, index]) |
Create Dask Bag from a Dask DataFrame |
to_delayed (df) |
Create Dask Delayed objects from a Dask Dataframe |
DataFrame Methods¶
-
class
dask.dataframe.
DataFrame
(dsk, name, meta, divisions)¶ Parallel Pandas DataFrame
Do not use this class directly. Instead use functions like
dd.read_csv
,dd.read_parquet
, ordd.from_pandas
.Parameters: - dask (dict) – The dask graph to compute this DataFrame
- name (str) – The key prefix that specifies which keys in the dask comprise this particular DataFrame
- meta (pandas.DataFrame) – An empty
pandas.DataFrame
with names, dtypes, and index matching the expected output. - divisions (tuple of index values) – Values along which we partition our blocks on the index
-
abs
()¶ Return an object with absolute value taken–only applicable to objects that are all numeric.
Returns: abs Return type: type of caller
-
add
(other, axis='columns', level=None, fill_value=None)¶ Addition of dataframe and other, element-wise (binary operator add).
Equivalent to
dataframe + other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
align
(other, join='outer', axis=None, fill_value=None)¶ Align two object on their axes with the specified join method for each axis Index
Parameters: - other (DataFrame or Series) –
- join ({'outer', 'inner', 'left', 'right'}, default 'outer') –
- axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None)
- level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level
- copy (boolean, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
- fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value
- method (str, default None) –
- limit (int, default None) –
- fill_axis ({0 or 'index', 1 or 'columns'}, default 0) – Filling axis, method and limit
- broadcast_axis ({0 or 'index', 1 or 'columns'}, default None) –
Broadcast values along this axis, if aligning two objects of different dimensions
New in version 0.17.0.
Returns: (left, right) –
Aligned objects
Dask doesn’t support the following argument(s).
- level
- copy
- method
- limit
- fill_axis
- broadcast_axis
Return type: (DataFrame, type of other)
-
all
(axis=None, skipna=True, split_every=False)¶ Return whether all elements are True over requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- bool_only (boolean, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Returns: all – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- bool_only
- level
Return type: Series or DataFrame (if level specified)
-
any
(axis=None, skipna=True, split_every=False)¶ Return whether any element is True over requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- bool_only (boolean, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Returns: any – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- bool_only
- level
Return type: Series or DataFrame (if level specified)
-
append
(other)¶ Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
Parameters: - other (DataFrame or Series/dict-like object, or list of these) – The data to append.
- ignore_index (boolean, default False) – If True, do not use the index labels.
- verify_integrity (boolean, default False) – If True, raise ValueError on creating index with duplicates.
Returns: appended
Return type: Notes
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.
See also
pandas.concat()
- General function to concatenate DataFrame, Series or Panel objects
Examples
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) >>> df A B 0 1 2 1 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) >>> df.append(df2) A B 0 1 2 1 3 4 0 5 6 1 7 8
With ignore_index set to True:
>>> df.append(df2, ignore_index=True) A B 0 1 2 1 3 4 2 5 6 3 7 8
Dask doesn’t support the following argument(s).
- ignore_index
- verify_integrity
-
apply
(func, axis=0, args=(), meta='__no_default__', **kwds)¶ Parallel version of pandas.DataFrame.apply
This mimics the pandas version except for the following:
- Only
axis=1
is supported (and must be specified explicitly). - The user should provide output metadata via the meta keyword.
Parameters: - func (function) – Function to apply to each column/row
- axis ({0 or 'index', 1 or 'columns'}, default 0) –
- 0 or ‘index’: apply function to each column (NOT SUPPORTED)
- 1 or ‘columns’: apply function to each row
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
. - args (tuple) – Positional arguments to pass to function in addition to the array/series
- keyword arguments will be passed as keywords to the function (Additional) –
Returns: applied
Return type: Series or DataFrame
Examples
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
Apply a function to row-wise passing in extra arguments in
args
andkwargs
:>>> def myadd(row, a, b=1): ... return row.sum() + a + b >>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5)
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with name
'x'
, and dtypefloat64
:>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5, meta=('x', 'f8'))
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.apply(lambda row: row + 1, axis=1, meta=ddf)
See also
dask.DataFrame.map_partitions()
- Only
-
applymap
(func, meta='__no_default__')¶ Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame
Parameters: func (function) – Python function, returns a single value from a single value Examples
>>> df = pd.DataFrame(np.random.randn(3, 3)) >>> df 0 1 2 0 -0.029638 1.081563 1.280300 1 0.647747 0.831136 -1.549481 2 0.513416 -0.884417 0.195343 >>> df = df.applymap(lambda x: '%.2f' % x) >>> df 0 1 2 0 -0.03 1.08 1.28 1 0.65 0.83 -1.55 2 0.51 -0.88 0.20
Returns: applied Return type: DataFrame See also
DataFrame.apply()
- For operations on rows/columns
-
assign
(**kwargs)¶ Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
New in version 0.16.0.
Parameters: kwargs (keyword, value pairs) – keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned. Returns: df – A new DataFrame with the new columns in addition to all the existing columns. Return type: DataFrame Notes
Since
kwargs
is a dictionary, the order of your arguments may not be preserved. The make things predicatable, the columns are inserted in alphabetical order, at the end of your DataFrame. Assigning multiple columns within the sameassign
is possible, but you cannot reference other columns created within the sameassign
call.Examples
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Where the value is a callable, evaluated on df:
>>> df.assign(ln_A = lambda x: np.log(x.A)) A B ln_A 0 1 0.426905 0.000000 1 2 -0.780949 0.693147 2 3 -0.418711 1.098612 3 4 -0.269708 1.386294 4 5 -0.274002 1.609438 5 6 -0.500792 1.791759 6 7 1.649697 1.945910 7 8 -1.495604 2.079442 8 9 0.549296 2.197225 9 10 -0.758542 2.302585
Where the value already exists and is inserted:
>>> newcol = np.log(df['A']) >>> df.assign(ln_A=newcol) A B ln_A 0 1 0.426905 0.000000 1 2 -0.780949 0.693147 2 3 -0.418711 1.098612 3 4 -0.269708 1.386294 4 5 -0.274002 1.609438 5 6 -0.500792 1.791759 6 7 1.649697 1.945910 7 8 -1.495604 2.079442 8 9 0.549296 2.197225 9 10 -0.758542 2.302585
-
astype
(dtype)¶ Cast object to input numpy.dtype Return a copy when copy = True (be really careful with this!)
Parameters: - dtype (data type, or dict of column name -> data type) – Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- raise_on_error (raise on invalid input) –
- kwargs (keyword arguments to pass on to the constructor) –
Returns: casted
Return type: type of caller
Notes
Dask doesn’t support the following argument(s).
- copy
- raise_on_error
-
bfill
(axis=None, limit=None)¶ Synonym for NDFrame.fillna(method=’bfill’) .. rubric:: Notes
Dask doesn’t support the following argument(s).
- inplace
- downcast
-
categorize
(columns=None, index=None, split_every=None, **kwargs)¶ Convert columns of the DataFrame to category dtype.
Parameters: - columns (list, optional) – A list of column names to convert to categoricals. By default any column with an object dtype is converted to a categorical, and any unknown categoricals are made known.
- index (bool, optional) – Whether to categorize the index. By default, object indices are converted to categorical, and unknown categorical indices are made known. Set True to always categorize the index, False to never.
- split_every (int, optional) – Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 16.
- kwargs – Keyword arguments are passed on to compute.
-
clear_divisions
()¶ Forget division information
-
clip
(lower=None, upper=None, out=None)¶ Trim values at input threshold(s).
Parameters: - lower (float or array_like, default None) –
- upper (float or array_like, default None) –
- axis (int or string axis name, optional) – Align object with lower and upper along the given axis.
Returns: clipped
Return type: Examples
>>> df 0 1 0 0.335232 -1.256177 1 -1.367855 0.746646 2 0.027753 -1.176076 3 0.230930 -0.679613 4 1.261967 0.570967 >>> df.clip(-1.0, 0.5) 0 1 0 0.335232 -1.000000 1 -1.000000 0.500000 2 0.027753 -1.000000 3 0.230930 -0.679613 4 0.500000 0.500000 >>> t 0 -0.3 1 -0.2 2 -0.1 3 0.0 4 0.1 dtype: float64 >>> df.clip(t, t + 1, axis=0) 0 1 0 0.335232 -0.300000 1 -0.200000 0.746646 2 0.027753 -0.100000 3 0.230930 0.000000 4 1.100000 0.570967
Notes
Dask doesn’t support the following argument(s).
- axis
-
clip_lower
(threshold)¶ Return copy of the input with values below given value(s) truncated.
Parameters: - threshold (float or array_like) –
- axis (int or string axis name, optional) – Align object with threshold along the given axis.
See also
Returns: clipped Return type: same type as input Notes
Dask doesn’t support the following argument(s).
- axis
-
clip_upper
(threshold)¶ Return copy of input with values above given value(s) truncated.
Parameters: - threshold (float or array_like) –
- axis (int or string axis name, optional) – Align object with threshold along the given axis.
See also
Returns: clipped Return type: same type as input Notes
Dask doesn’t support the following argument(s).
- axis
-
combine
(other, func, fill_value=None, overwrite=True)¶ Add two DataFrame objects and do not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frame’s value (which might be NaN as well)
Parameters: Returns: result
Return type:
-
combine_first
(other)¶ Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
Parameters: other (DataFrame) – Examples
a’s values prioritized, use values from b to fill holes:
>>> a.combine_first(b)
Returns: combined Return type: DataFrame
-
compute
(**kwargs)¶ Compute this dask collection
This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a NumPy array and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.
Parameters: - get (callable, optional) – A scheduler
get
function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults. - optimize_graph (bool, optional) – If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
- kwargs – Extra keywords to forward to the scheduler
get
function.
- get (callable, optional) – A scheduler
-
copy
()¶ Make a copy of the dataframe
This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data
-
corr
(method='pearson', min_periods=None, split_every=False)¶ Compute pairwise correlation of columns, excluding NA/null values
Parameters: - method ({'pearson', 'kendall', 'spearman'}) –
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
Returns: y
Return type: - method ({'pearson', 'kendall', 'spearman'}) –
-
count
(axis=None, split_every=False)¶ Return Series with number of non-NA/null observations over requested axis. Works with non-floating point data as well (detects NaN and None)
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame
- numeric_only (boolean, default False) – Include only float, int, boolean data
Returns: count
Return type: Series (or DataFrame if level specified)
Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
-
cov
(min_periods=None, split_every=False)¶ Compute pairwise covariance of columns, excluding NA/null values
Parameters: min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result. Returns: y Return type: DataFrame Notes
y contains the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1 (unbiased estimator).
-
cummax
(axis=None, skipna=True)¶ Return cumulative max over requested axis.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cummax
Return type:
-
cummin
(axis=None, skipna=True)¶ Return cumulative minimum over requested axis.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cummin
Return type:
-
cumprod
(axis=None, skipna=True)¶ Return cumulative product over requested axis.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cumprod
Return type:
-
cumsum
(axis=None, skipna=True)¶ Return cumulative sum over requested axis.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cumsum
Return type:
-
describe
(split_every=False)¶ Generate various summary statistics, excluding NaN values.
Parameters: - percentiles (array-like, optional) – The percentiles to include in the output. Should all be in the interval [0, 1]. By default percentiles is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
- exclude (include,) –
Specify the form of the returned result. Either:
- None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.
- A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=[‘O’])
- If include is the string ‘all’, the output column-set will match the input one.
Returns: summary
Return type: NDFrame of summary statistics
Notes
The output DataFrame index depends on the requested dtypes:
For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles.
For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items.
For mixed dtypes, the index will be the union of the corresponding output types. Non-applicable entries will be filled with NaN. Note that mixed-dtype outputs can only be returned from mixed-dtype inputs and appropriate use of the include/exclude arguments.
If multiple values have the highest count, then the count and most common pair will be arbitrarily chosen from among those with the highest count.
The include, exclude arguments are ignored for Series.
See also
DataFrame.select_dtypes()
- Extra Notes ———– Dask doesn’t support the following argument(s). * percentiles * include * exclude
-
diff
(periods=1, axis=0)¶ 1st discrete difference of object
Parameters: - periods (int, default 1) – Periods to shift for forming difference
- axis ({0 or 'index', 1 or 'columns'}, default 0) –
Take difference over rows (0) or columns (1).
Returns: diffed
Return type:
-
div
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
drop
(labels, axis=0, errors='raise')¶ Return new object with labels in requested axis removed.
Parameters: - labels (single label or list-like) –
- axis (int or axis name) –
- level (int or level name, default None) – For MultiIndex
- inplace (bool, default False) – If True, do operation inplace and return None.
- errors ({'ignore', 'raise'}, default 'raise') –
If ‘ignore’, suppress error and existing labels are dropped.
New in version 0.16.1.
Returns: dropped
Return type: type of caller
Notes
Dask doesn’t support the following argument(s).
- level
- inplace
-
drop_duplicates
(split_every=None, split_out=1, **kwargs)¶ Return DataFrame with duplicate rows removed, optionally only considering certain columns
Parameters: - subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns
- keep ({'first', 'last', False}, default 'first') –
first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.- False : Drop all duplicates.
- take_last (deprecated) –
- inplace (boolean, default False) – Whether to drop duplicates in place or to return a copy
Returns: deduplicated
Return type:
-
dropna
(how='any', subset=None)¶ Return object with labels on given axis omitted where alternately any or all of the data are missing
Parameters: - axis ({0 or 'index', 1 or 'columns'}, or tuple/list thereof) – Pass tuple or list to drop on multiple axes
- how ({'any', 'all'}) –
- any : if any NA values are present, drop that label
- all : if all values are NA, drop that label
- thresh (int, default None) – int value : require that many non-NA values
- subset (array-like) – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include
- inplace (boolean, default False) – If True, do operation inplace and return None.
Returns: dropped
Return type: Notes
Dask doesn’t support the following argument(s).
- axis
- thresh
- inplace
-
eq
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods eq
-
eval
(expr, inplace=None, **kwargs)¶ Evaluate an expression in the context of the calling DataFrame instance.
Parameters: - expr (string) – The expression string to evaluate.
- inplace (bool) –
If the expression contains an assignment, whether to return a new DataFrame or mutate the existing.
WARNING: inplace=None currently falls back to to True, but in a future version, will default to False. Use inplace=True explicitly rather than relying on the default.
New in version 0.18.0.
- kwargs (dict) – See the documentation for
eval()
for complete details on the keyword arguments accepted byquery()
.
Returns: ret
Return type: ndarray, scalar, or pandas object
See also
pandas.DataFrame.query()
,pandas.DataFrame.assign()
,pandas.eval()
Notes
For more details see the API documentation for
eval()
. For detailed examples see enhancing performance with eval.Examples
>>> from numpy.random import randn >>> from pandas import DataFrame >>> df = DataFrame(randn(10, 2), columns=list('ab')) >>> df.eval('a + b') >>> df.eval('c = a + b')
-
ffill
(axis=None, limit=None)¶ Synonym for NDFrame.fillna(method=’ffill’) .. rubric:: Notes
Dask doesn’t support the following argument(s).
- inplace
- downcast
-
fillna
(value=None, method=None, limit=None, axis=None)¶ Fill NA/NaN values using the specified method
Parameters: - value (scalar, dict, Series, or DataFrame) – Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
- method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) – Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
- axis ({0 or 'index', 1 or 'columns'}) –
- inplace (boolean, default False) – If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).
- limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.
- downcast (dict, default is None) – a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
See also
reindex()
,asfreq()
Returns: filled – .. rubric:: Notes Dask doesn’t support the following argument(s).
- inplace
- downcast
Return type: DataFrame
-
first
(offset)¶ Convenience method for subsetting initial periods of time series data based on a date offset.
Parameters: offset (string, DateOffset, dateutil.relativedelta) – Examples
ts.first(‘10D’) -> First 10 days
Returns: subset Return type: type of caller
-
floordiv
(other, axis='columns', level=None, fill_value=None)¶ Integer division of dataframe and other, element-wise (binary operator floordiv).
Equivalent to
dataframe // other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
ge
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods ge
-
get_dtype_counts
()¶ Return the counts of dtypes in this object.
-
get_ftype_counts
()¶ Return the counts of ftypes in this object.
-
get_partition
(n)¶ Get a dask DataFrame/Series representing the nth partition.
-
groupby
(by=None, **kwargs)¶ Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Parameters: - by (mapping function / list of functions, dict, Series, or tuple /) – list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups
- axis (int, default 0) –
- level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels
- as_index (boolean, default True) – For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
- sort (boolean, default True) – Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
- group_keys (boolean, default True) – When calling apply, add group keys to index to identify pieces
- squeeze (boolean, default False) – reduce the dimensionality of the return type if possible, otherwise return a consistent type
Examples
DataFrame results
>>> data.groupby(func, axis=0).mean() >>> data.groupby(['col1', 'col2'])['col3'].mean()
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean()
Returns: Return type: GroupBy object Notes
Dask doesn’t support the following argument(s).
- axis
- level
- as_index
- sort
- group_keys
- squeeze
-
gt
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods gt
-
head
(n=5, npartitions=1, compute=True)¶ First n rows of the dataset
Parameters: - n (int, optional) – The number of rows to return. Default is 5.
- npartitions (int, optional) – Elements are only taken from the first
npartitions
, with a default of 1. If there are fewer thann
rows in the firstnpartitions
a warning will be raised and any found rows returned. Pass -1 to use all partitions. - compute (bool, optional) – Whether to compute the result, default is True.
-
idxmax
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be first index.
Returns: idxmax
Return type: Notes
This method is the DataFrame version of
ndarray.argmax
.See also
-
idxmin
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: idxmin
Return type: Notes
This method is the DataFrame version of
ndarray.argmin
.See also
-
info
(buf=None, verbose=False, memory_usage=False)¶ Concise summary of a Dask DataFrame.
-
isin
(values)¶ Return boolean DataFrame showing whether each element in the DataFrame is contained in values.
Parameters: values (iterable, Series, DataFrame or dictionary) – The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dictionary, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match. Returns: Return type: DataFrame of booleans Examples
When
values
is a list:>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']}) >>> df.isin([1, 3, 12, 'a']) A B 0 True True 1 False False 2 True False
When
values
is a dict:>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]}) >>> df.isin({'A': [1, 3], 'B': [4, 7, 12]}) A B 0 True False # Note that B didn't match the 1 here. 1 False True 2 True True
When
values
is a Series or DataFrame:>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']}) >>> other = DataFrame({'A': [1, 3, 3, 2], 'B': ['e', 'f', 'f', 'e']}) >>> df.isin(other) A B 0 True False 1 False False # Column A in `other` has a 3, but not at index 1. 2 True True
-
isnull
()¶ Return a boolean same-sized object indicating if the values are null.
See also
notnull()
- boolean inverse of isnull
-
iterrows
()¶ Iterate over DataFrame rows as (index, Series) pairs.
Notes
Because
iterrows
returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) >>> row = next(df.iterrows())[1] >>> row int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row['int'].dtype) float64 >>> print(df['int'].dtype) int64
To preserve dtypes while iterating over the rows, it is better to use
itertuples()
which returns namedtuples of the values and which is generally faster thaniterrows
.You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Returns: it – A generator that iterates over the rows of the frame. Return type: generator See also
itertuples()
- Iterate over DataFrame rows as namedtuples of the values.
iteritems()
- Iterate over (column name, Series) pairs.
-
itertuples
()¶ Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.
Parameters: - index (boolean, default True) – If True, return the index as the first element of the tuple.
- name (string, default "Pandas") – The name of the returned namedtuples or None to return regular tuples.
Notes
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
See also
iterrows()
- Iterate over DataFrame rows as (index, Series) pairs.
iteritems()
- Iterate over (column name, Series) pairs.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b']) >>> df col1 col2 a 1 0.1 b 2 0.2 >>> for row in df.itertuples(): ... print(row) ... Pandas(Index='a', col1=1, col2=0.10000000000000001) Pandas(Index='b', col1=2, col2=0.20000000000000001)
Dask doesn’t support the following argument(s).
- index
- name
-
join
(other, on=None, how='left', lsuffix='', rsuffix='', npartitions=None, shuffle=None)¶ Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.
Parameters: - other (DataFrame, Series with name field set, or list of DataFrame) – Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame
- on (column name, tuple/list of column names, or array-like) – Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation
- how ({'left', 'right', 'outer', 'inner'}, default: 'left') –
How to handle the operation of the two objects.
- left: use calling frame’s index (or column if on is specified)
- right: use other frame’s index
- outer: form union of calling frame’s index (or column if on is
- specified) with other frame’s index
- inner: form intersection of calling frame’s index (or column if
- on is specified) with other frame’s index
- lsuffix (string) – Suffix to use from left frame’s overlapping columns
- rsuffix (string) – Suffix to use from right frame’s overlapping columns
- sort (boolean, default False) – Order result DataFrame lexicographically by the join key. If False, preserves the index order of the calling (left) DataFrame
Notes
on, lsuffix, and rsuffix options are not supported when passing a list of DataFrame objects
Examples
>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> caller A key 0 A0 K0 1 A1 K1 2 A2 K2 3 A3 K3 4 A4 K4 5 A5 K5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], ... 'B': ['B0', 'B1', 'B2']})
>>> other B key 0 B0 K0 1 B1 K1 2 B2 K2
Join DataFrames using their indexes.
>>> caller.join(other, lsuffix='_caller', rsuffix='_other')
>>> A key_caller B key_other 0 A0 K0 B0 K0 1 A1 K1 B1 K1 2 A2 K2 B2 K2 3 A3 K3 NaN NaN 4 A4 K4 NaN NaN 5 A5 K5 NaN NaN
If we want to join using the key columns, we need to set key to be the index in both caller and other. The joined DataFrame will have key as its index.
>>> caller.set_index('key').join(other.set_index('key'))
>>> A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in the caller. This method preserves the original caller’s index in the result.
>>> caller.join(other.set_index('key'), on='key')
>>> A key B 0 A0 K0 B0 1 A1 K1 B1 2 A2 K2 B2 3 A3 K3 NaN 4 A4 K4 NaN 5 A5 K5 NaN
See also
DataFrame.merge()
- For column(s)-on-columns(s) operations
Returns: - joined (DataFrame)
- Extra Notes
- ———–
- Dask doesn’t support the following argument(s).
- * sort
-
last
(offset)¶ Convenience method for subsetting final periods of time series data based on a date offset.
Parameters: offset (string, DateOffset, dateutil.relativedelta) – Examples
ts.last(‘5M’) -> Last 5 months
Returns: subset Return type: type of caller
-
le
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods le
-
lt
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods lt
-
map_overlap
(func, before, after, *args, **kwargs)¶ Apply a function to each partition, sharing rows with adjacent partitions.
This can be useful for implementing windowing functions such as
df.rolling(...).mean()
ordf.diff()
.Parameters: - func (function) – Function applied to each partition.
- before (int) – The number of rows to prepend to partition
i
from the end of partitioni - 1
. - after (int) – The number of rows to append to partition
i
from the beginning of partitioni + 1
. - kwargs (args,) – Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
Notes
Given positive integers
before
andafter
, and a functionfunc
,map_overlap
does the following:- Prepend
before
rows to each partitioni
from the end of partitioni - 1
. The first partition has no rows prepended. - Append
after
rows to each partitioni
from the beginning of partitioni + 1
. The last partition has no rows appended. - Apply
func
to each partition, passing in any extraargs
andkwargs
if provided. - Trim
before
rows from the beginning of all but the first partition. - Trim
after
rows from the end of all but the last partition.
Note that the index and divisions are assumed to remain unchanged.
Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to
df.rolling(2).sum()
:>>> ddf.compute() x y 0 1 1.0 1 2 2.0 2 4 3.0 3 7 4.0 4 11 5.0 >>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute() x y 0 NaN NaN 1 3.0 3.0 2 6.0 5.0 3 11.0 7.0 4 18.0 9.0
The pandas
diff
method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls todf.diff
to each partition after prepending/appending that many rows, depending on sign:>>> def diff(df, periods=1): ... before, after = (periods, 0) if periods > 0 else (0, -periods) ... return df.map_overlap(lambda df, periods=1: df.diff(periods), ... periods, 0, periods=periods) >>> diff(ddf, 1).compute() x y 0 NaN NaN 1 1.0 1.0 2 2.0 1.0 3 3.0 1.0 4 4.0 1.0
If you have a
DatetimeIndex
, you can use a timedelta for time- based windows. >>> ts = pd.Series(range(10), index=pd.date_range(‘2017’, periods=10)) >>> dts = dd.from_pandas(ts, npartitions=2) >>> dts.map_overlap(lambda df: df.rolling(‘2D’).sum(), ... pd.Timedelta(‘2D’), 0).compute() 2017-01-01 0.0 2017-01-02 1.0 2017-01-03 3.0 2017-01-04 5.0 2017-01-05 7.0 2017-01-06 9.0 2017-01-07 11.0 2017-01-08 13.0 2017-01-09 15.0 2017-01-10 17.0 dtype: float64
-
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Note that the index and divisions are assumed to remain unchanged.
Parameters: - func (function) – Function applied to each partition.
- kwargs (args,) – Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
One can use
map_partitions
to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:
>>> def myadd(df, a, b=1): ... return df.x + df.y + a + b >>> res = ddf.map_partitions(myadd, 1, b=2) >>> res.dtype dtype('float64')
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with no name, and dtype
float64
:>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))
Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:
>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y)) >>> res.dtypes x int64 y float64 z float64 dtype: object
As before, the output metadata can also be specified manually. This time we pass in a
dict
, as the output is a DataFrame:>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y), ... meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)
Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:
>>> ddf.map_partitions(func).clear_divisions()
-
mask
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
Parameters: - cond (boolean NDFrame, array or callable) –
If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as cond.
- other (scalar, NDFrame, or callable) –
If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as other.
- inplace (boolean, default False) – Whether to perform the operation in place on the data
- axis (alignment axis if needed, default None) –
- level (alignment level if needed, default None) –
- try_cast (boolean, default False) – try to cast the result back to the input type (if possible),
- raise_on_error (boolean, default True) – Whether to raise on invalid data types (e.g. trying to where on strings)
Returns: wh
Return type: same type as caller
Notes
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isFalse
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
mask
documentation in indexing.Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
See also
DataFrame.where()
- Extra Notes ———– Dask doesn’t support the following argument(s). * inplace * axis * level * try_cast * raise_on_error
- cond (boolean NDFrame, array or callable) –
-
max
(axis=None, skipna=True, split_every=False)¶ - This method returns the maximum of the values in the object.
- If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: max – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
mean
(axis=None, skipna=True, split_every=False)¶ Return the mean of the values for the requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: mean – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
memory_usage
(index=True, deep=False)¶ Memory usage of DataFrame columns.
Parameters: Returns: sizes – A series with column names as index and memory usage of columns with units of bytes.
Return type: Notes
Memory usage does not include memory consumed by elements that are not components of the array if deep=False
See also
numpy.ndarray.nbytes()
-
merge
(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle=None)¶ Merge DataFrame objects by performing a database-style join operation by columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
Parameters: - right (DataFrame) –
- how ({'left', 'right', 'outer', 'inner'}, default 'inner') –
- left: use only keys from left frame (SQL: left outer join)
- right: use only keys from right frame (SQL: right outer join)
- outer: use union of keys from both frames (SQL: full outer join)
- inner: use intersection of keys from both frames (SQL: inner join)
- on (label or list) – Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
- left_on (label or list, or array-like) – Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
- right_on (label or list, or array-like) – Field names to join on in right DataFrame or vector/list of vectors per left_on docs
- left_index (boolean, default False) – Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels
- right_index (boolean, default False) – Use the index from the right DataFrame as the join key. Same caveats as left_index
- sort (boolean, default False) – Sort the join keys lexicographically in the result DataFrame
- suffixes (2-length sequence (tuple, list, ...)) – Suffix to apply to overlapping column names in the left and right side, respectively
- copy (boolean, default True) – If False, do not copy data unnecessarily
- indicator (boolean or string, default False) –
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.
New in version 0.17.0.
Examples
>>> A >>> B lkey value rkey value 0 foo 1 0 foo 5 1 bar 2 1 bar 6 2 baz 3 2 qux 7 3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 4 foo 5 2 bar 2 bar 6 3 bar 2 bar 8 4 baz 3 NaN NaN 5 NaN NaN qux 7
Returns: merged – The output type will the be same as ‘left’, if it is a subclass of DataFrame. Return type: DataFrame See also
merge_ordered()
,merge_asof()
Dask doesn’t support the following argument(s).
- sort
- copy
-
min
(axis=None, skipna=True, split_every=False)¶ - This method returns the minimum of the values in the object.
- If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: min – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
mod
(other, axis='columns', level=None, fill_value=None)¶ Modulo of dataframe and other, element-wise (binary operator mod).
Equivalent to
dataframe % other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
mul
(other, axis='columns', level=None, fill_value=None)¶ Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
ne
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods ne
-
nlargest
(n=5, columns=None, split_every=None)¶ Get the rows of a DataFrame sorted by the n largest values of columns.
New in version 0.17.0.
Parameters: - n (int) – Number of items to retrieve
- columns (list or str) – Column name or names to order by
- keep ({'first', 'last', False}, default 'first') – Where there are duplicate values:
-
first
: take the first occurrence. -last
: take the last occurrence.
Returns: Return type: Examples
>>> df = DataFrame({'a': [1, 10, 8, 11, -1], ... 'b': list('abdce'), ... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]}) >>> df.nlargest(3, 'a') a b c 3 11 c 3 1 10 b 2 2 8 d NaN
Notes
Dask doesn’t support the following argument(s).
- keep
-
notnull
()¶ Return a boolean same-sized object indicating if the values are not null.
See also
isnull()
- boolean inverse of notnull
-
nsmallest
(n=5, columns=None, split_every=None)¶ Get the rows of a DataFrame sorted by the n smallest values of columns.
New in version 0.17.0.
Parameters: - n (int) – Number of items to retrieve
- columns (list or str) – Column name or names to order by
- keep ({'first', 'last', False}, default 'first') – Where there are duplicate values:
-
first
: take the first occurrence. -last
: take the last occurrence.
Returns: Return type: Examples
>>> df = DataFrame({'a': [1, 10, 8, 11, -1], ... 'b': list('abdce'), ... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]}) >>> df.nsmallest(3, 'a') a b c 4 -1 e 4 0 1 a 1 2 8 d NaN
Notes
Dask doesn’t support the following argument(s).
- keep
-
nunique_approx
(split_every=None)¶ Approximate number of unique rows.
This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.
Parameters: split_every (int, optional) – Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8. Returns: Return type: a float representing the approximate number of elements
-
persist
(**kwargs)¶ Persist this dask collection into memory
See
dask.base.persist
for full docstring
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs)
New in version 0.16.2.
Parameters: - func (function) – function to apply to the NDFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the NDFrame. - args (positional arguments passed into
func
.) – - kwargs (a dictionary of keyword arguments passed into
func
.) –
Returns: object
Return type: the return type of
func
.Notes
Use
.pipe
when chaining together functions that expect on Series or DataFrames. Instead of writing>>> f(g(h(df), arg1=a), arg2=b, arg3=c)
You can write
>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe(f, arg2=b, arg3=c) ... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe((f, 'arg2'), arg1=a, arg3=c) ... )
- func (function) – function to apply to the NDFrame.
-
pivot_table
(index=None, columns=None, values=None, aggfunc='mean')¶ Create a spreadsheet-style pivot table as a DataFrame. Target
columns
must have category dtype to infer result’scolumns
.index
,columns
,values
andaggfunc
must be all scalar.Parameters: - values (scalar) – column to aggregate
- index (scalar) – column to be index
- columns (scalar) – column to be columns
- aggfunc ({'mean', 'sum', 'count'}, default 'mean') –
Returns: table
Return type:
-
pow
(other, axis='columns', level=None, fill_value=None)¶ Exponential power of dataframe and other, element-wise (binary operator pow).
Equivalent to
dataframe ** other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
prod
(axis=None, skipna=True, split_every=False)¶ Return the product of the values for the requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: prod – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
quantile
(q=0.5, axis=0)¶ Approximate row-wise and precise column-wise quantiles of DataFrame
Parameters: - q (list/array of floats, default 0.5 (50%)) – Iterable of numbers ranging from 0 to 1 for the desired quantiles
- axis ({0, 1, 'index', 'columns'} (default 0)) – 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
-
query
(expr, **kwargs)¶ Filter dataframe with complex expression
Blocked version of pd.DataFrame.query
This is like the sequential version except that this will also happen in many threads. This may conflict with
numexpr
which will use multiple threads itself. We recommend that you set numexpr to use a single threadimport numexpr numexpr.set_nthreads(1)See also
-
radd
(other, axis='columns', level=None, fill_value=None)¶ Addition of dataframe and other, element-wise (binary operator radd).
Equivalent to
other + dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
random_split
(frac, random_state=None)¶ Pseudorandomly split dataframe into different pieces row-wise
Parameters: - frac (list) – List of floats that should sum to one.
- random_state (int or np.random.RandomState) – If int create a new RandomState with this as the seed
- draw from the passed RandomState (Otherwise) –
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5])
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)
See also
dask.DataFrame.sample()
-
rdiv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
reduction
(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)¶ Generic row-wise reductions.
Parameters: - chunk (callable) – Function to operate on each partition. Should return a
pandas.DataFrame
,pandas.Series
, or a scalar. - aggregate (callable, optional) –
Function to operate on the concatenated result of
chunk
. If not specified, defaults tochunk
. Used to do the final aggregation in a tree reduction.The input to
aggregate
depends on the output ofchunk
. If the output ofchunk
is a:- scalar: Input is a Series, with one row per partition.
- Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
- DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.
Should return a
pandas.DataFrame
,pandas.Series
, or a scalar. - combine (callable, optional) – Function to operate on intermediate concatenated results of
chunk
in a tree-reduction. If not provided, defaults toaggregate
. The input/output requirements should match that ofaggregate
described above. - meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
. - token (str, optional) – The name to use for the output keys.
- split_every (int, optional) – Group partitions into groups of this size while performing a
tree-reduction. If set to False, no tree-reduction will be used,
and all intermediates will be concatenated and passed to
aggregate
. Default is 8. - chunk_kwargs (dict, optional) – Keyword arguments to pass on to
chunk
only. - aggregate_kwargs (dict, optional) – Keyword arguments to pass on to
aggregate
only. - combine_kwargs (dict, optional) – Keyword arguments to pass on to
combine
only. - kwargs – All remaining keywords will be passed to
chunk
,combine
, andaggregate
.
Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)}) >>> ddf = dd.from_pandas(df, npartitions=4)
Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:
>>> res = ddf.reduction(lambda x: x.count(), ... aggregate=lambda x: x.sum()) >>> res.compute() x 50 y 50 dtype: int64
Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).
>>> def count_greater(x, value=0): ... return (x >= value).sum() >>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(), ... chunk_kwargs={'value': 25}) >>> res.compute() 25
Aggregate both the sum and count of a Series at the same time:
>>> def sum_and_count(x): ... return pd.Series({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum()) >>> res.compute() count 50 sum 1225 dtype: int64
Doing the same, but for a DataFrame. Here
chunk
returns a DataFrame, meaning the input toaggregate
is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.>>> def sum_and_count(x): ... return pd.DataFrame({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.reduction(sum_and_count, ... aggregate=lambda x: x.groupby(level=0).sum()) >>> res.compute() count sum x 50 1225 y 50 3725
- chunk (callable) – Function to operate on each partition. Should return a
-
rename
(index=None, columns=None)¶ Alter axes input function or functions. Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error. Alternatively, change
Series.name
with a scalar value (Series only).Parameters: - columns (index,) – Scalar or list-like will alter the
Series.name
attribute, and raise on DataFrame or Panel. dict-like or functions are transformations to apply to that axis’ values - copy (boolean, default True) – Also copy underlying data
- inplace (boolean, default False) – Whether to return a new DataFrame. If True then value of copy is ignored.
Returns: renamed
Return type: DataFrame (new object)
See also
pandas.NDFrame.rename_axis()
Examples
>>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3 dtype: int64 >>> s.rename("my_name") # scalar, changes Series.name 0 1 1 2 2 3 Name: my_name, dtype: int64 >>> s.rename(lambda x: x ** 2) # function, changes labels 0 1 1 2 4 3 dtype: int64 >>> s.rename({1: 3, 2: 5}) # mapping, changes labels 0 1 3 2 5 3 dtype: int64 >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df.rename(2) ... TypeError: 'int' object is not callable >>> df.rename(index=str, columns={"A": "a", "B": "c"}) a c 0 1 4 1 2 5 2 3 6 >>> df.rename(index=str, columns={"A": "a", "C": "c"}) a B 0 1 4 1 2 5 2 3 6
- columns (index,) – Scalar or list-like will alter the
-
repartition
(divisions=None, npartitions=None, freq=None, force=False)¶ Repartition dataframe along new divisions
Parameters: - divisions (list, optional) – List of partitions to be used. If specified npartitions will be ignored.
- npartitions (int, optional) – Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.
- freq (str, pd.Timedelta) – A period on which to partition timeseries data like
'7D'
or'12h'
orpd.Timedelta(hours=12)
. Assumes a datetime index. - force (bool, default False) – Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.
Examples
>>> df = df.repartition(npartitions=10) >>> df = df.repartition(divisions=[0, 5, 10, 20]) >>> df = df.repartition(freq='7d')
-
resample
(rule, how=None, closed=None, label=None)¶ Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Parameters: - rule (string) – the offset string or object representing target conversion
- axis (int, optional, default 0) –
- closed ({'right', 'left'}) – Which side of bin interval is closed
- label ({'right', 'left'}) – Which bin edge label to label bucket with
- convention ({'start', 'end', 's', 'e'}) –
- loffset (timedelta) – Adjust the resampled time labels
- base (int, default 0) – For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
- on (string, optional) –
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
New in version 0.19.0.
- level (string or int, optional) –
For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.
New in version 0.19.0.
- learn more about the offset strings, please see `this link (To) –
- <http (//pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.) –
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label``2000-01-01 00:03:00`` does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] #select first 5 rows 2000-01-01 00:00:00 0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaN
values using thepad
method.>>> series.resample('30S').pad()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaN
values using thebfill
method.>>> series.resample('30S').bfill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply
>>> def custom_resampler(array_like): ... return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler) 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
Notes
Dask doesn’t support the following argument(s).
- axis
- fill_method
- convention
- kind
- loffset
- limit
- base
- on
- level
-
reset_index
(drop=False)¶ Reset the index to the default index.
Note that unlike in
pandas
, the resetdask.dataframe
index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g.index1 = [0, ..., 10], index2 = [0, ...]
). This is due to the inability to statically know the full length of the index.For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.
Parameters: drop (boolean, default False) – Do not try to insert index into dataframe columns.
-
rfloordiv
(other, axis='columns', level=None, fill_value=None)¶ Integer division of dataframe and other, element-wise (binary operator rfloordiv).
Equivalent to
other // dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
rmod
(other, axis='columns', level=None, fill_value=None)¶ Modulo of dataframe and other, element-wise (binary operator rmod).
Equivalent to
other % dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
rmul
(other, axis='columns', level=None, fill_value=None)¶ Multiplication of dataframe and other, element-wise (binary operator rmul).
Equivalent to
other * dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
rolling
(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)¶ Provides rolling transformations.
Parameters: - window (int, str, offset) –
Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a
DatetimeIndex
Changed in version 0.15.0: Now accepts offsets and string offset aliases
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- center (boolean, default False) – Set the labels at the center of the window.
- win_type (string, default None) – Provide a window type. The recognized window types are identical to pandas.
- axis (int, default 0) –
Returns: Return type: a Rolling object on which to call a method to compute a statistic
Notes
The freq argument is not supported.
- window (int, str, offset) –
-
round
(decimals=0)¶ Round a DataFrame to a variable number of decimal places.
New in version 0.17.0.
Parameters: decimals (int, dict, Series) – Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored. Examples
>>> df = pd.DataFrame(np.random.random([3, 3]), ... columns=['A', 'B', 'C'], index=['first', 'second', 'third']) >>> df A B C first 0.028208 0.992815 0.173891 second 0.038683 0.645646 0.577595 third 0.877076 0.149370 0.491027 >>> df.round(2) A B C first 0.03 0.99 0.17 second 0.04 0.65 0.58 third 0.88 0.15 0.49 >>> df.round({'A': 1, 'C': 2}) A B C first 0.0 0.992815 0.17 second 0.0 0.645646 0.58 third 0.9 0.149370 0.49 >>> decimals = pd.Series([1, 0, 2], index=['A', 'B', 'C']) >>> df.round(decimals) A B C first 0.0 1 0.17 second 0.0 1 0.58 third 0.9 0 0.49
Returns: Return type: DataFrame object See also
numpy.around()
,Series.round()
-
rpow
(other, axis='columns', level=None, fill_value=None)¶ Exponential power of dataframe and other, element-wise (binary operator rpow).
Equivalent to
other ** dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
rsub
(other, axis='columns', level=None, fill_value=None)¶ Subtraction of dataframe and other, element-wise (binary operator rsub).
Equivalent to
other - dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
rtruediv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
sample
(frac, replace=False, random_state=None)¶ Random sample of items
Parameters: - frac (float, optional) – Fraction of axis items to return.
- replace (boolean, optional) – Sample with or without replacement. Default = False.
- random_state (int or
np.random.RandomState
) – If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState
-
select_dtypes
(include=None, exclude=None)¶ Return a subset of a DataFrame including/excluding columns based on their
dtype
.Parameters: exclude (include,) – A list of dtypes or strings to be included/excluded. You must pass in a non-empty sequence for at least one of these.
Raises: ValueError
– * If both ofinclude
andexclude
are empty * Ifinclude
andexclude
have overlapping elements * If any kind of string dtype is passed in.TypeError
– * If either ofinclude
orexclude
is not a sequence
Returns: subset – The subset of the frame including the dtypes in
include
and excluding the dtypes inexclude
.Return type: Notes
- To select all numeric types use the numpy dtype
numpy.number
- To select strings you must use the
object
dtype, but note that this will return all object dtype columns - See the numpy dtype hierarchy
- To select Pandas categorical dtypes, use ‘category’
Examples
>>> df = pd.DataFrame({'a': np.random.randn(6).astype('f4'), ... 'b': [True, False] * 3, ... 'c': [1.0, 2.0] * 3}) >>> df a b c 0 0.3962 True 1 1 0.1459 False 2 2 0.2623 True 1 3 0.0764 False 2 4 -0.9703 True 1 5 -1.2094 False 2 >>> df.select_dtypes(include=['float64']) c 0 1 1 2 2 1 3 2 4 1 5 2 >>> df.select_dtypes(exclude=['floating']) b 0 True 1 False 2 True 3 False 4 True 5 False
-
sem
(axis=None, skipna=None, ddof=1, split_every=False)¶ Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- ddof (int, default 1) – degrees of freedom
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: sem – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
set_index
(other, drop=True, sorted=False, npartitions=None, divisions=None, **kwargs)¶ Set the DataFrame index (row labels) using an existing column
This realigns the dataset to be sorted by a new column. This can have a significant impact on performance, because joins, groupbys, lookups, etc. are all much faster on that column. However, this performance increase comes with a cost, sorting a parallel dataset requires expensive shuffles. Often we
set_index
once directly after data ingest and filtering and then perform many cheap computations off of the sorted dataset.This function operates exactly like
pandas.set_index
except with different performance costs (it is much more expensive). Under normal operation this function does an initial pass over the index column to compute approximate qunatiles to serve as future divisions. It then passes over the data a second time, splitting up each input partition into several pieces and sharing those pieces to all of the output partitions now in sorted order.In some cases we can alleviate those costs, for example if your dataset is sorted already then we can avoid making many small pieces or if you know good values to split the new index column then we can avoid the initial pass over the data. For example if your new index is a datetime index and your data is already sorted by day then this entire operation can be done for free. You can control these options with the following parameters.
Parameters: - df (Dask DataFrame) –
- index (string or Dask Series) –
- npartitions (int, None, or 'auto') – The ideal number of output partitions. If None use the same as the input. If ‘auto’ then decide by memory use.
- shuffle (string, optional) – Either
'disk'
for single-node operation or'tasks'
for distributed operation. Will be inferred by your current scheduler. - sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to False
- divisions (list, optional) – Known values on which to separate index values of the partitions.
See http://dask.pydata.org/en/latest/dataframe-design.html#partitions
Defaults to computing this with a single pass over the data. Note
that if
sorted=True
, specified divisions are assumed to match the existing partitions in the data. If this is untrue, you should leave divisions empty and callrepartition
afterset_index
. - compute (bool) – Whether or not to trigger an immediate computation. Defaults to False.
Examples
>>> df2 = df.set_index('x') >>> df2 = df.set_index(d.x) >>> df2 = df.set_index(d.timestamp, sorted=True)
A common case is when we have a datetime column that we know to be sorted and is cleanly divided by day. We can set this index for free by specifying both that the column is pre-sorted and the particular divisions along which is is separated
>>> import pandas as pd >>> divisions = pd.date_range('2000', '2010', freq='1D') >>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions)
-
shift
(periods=1, freq=None, axis=0)¶ Shift index by desired number of periods with an optional time freq
Parameters: - periods (int) – Number of periods to move, can be positive or negative
- freq (DateOffset, timedelta, or time rule string, optional) – Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes.
- axis ({0 or 'index', 1 or 'columns'}) –
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Returns: shifted Return type: DataFrame
-
std
(axis=None, skipna=True, ddof=1, split_every=False)¶ Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- ddof (int, default 1) – degrees of freedom
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: std – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
sub
(other, axis='columns', level=None, fill_value=None)¶ Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
sum
(axis=None, skipna=True, split_every=False)¶ Return the sum of the values for the requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: sum – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
tail
(n=5, compute=True)¶ Last n rows of the dataset
Caveat, the only checks the last n rows of the last partition.
-
to_bag
(index=False)¶ Convert to a dask Bag of tuples of each row.
Parameters: index (bool, optional) – If True, the index is included as the first element of each tuple. Default is False.
-
to_csv
(filename, **kwargs)¶ See dd.to_csv docstring for more information
-
to_delayed
()¶ See dd.to_delayed docstring for more information
-
to_hdf
(path_or_buf, key, mode='a', append=False, get=None, **kwargs)¶ See dd.to_hdf docstring for more information
-
to_html
(max_rows=5)¶ Render a DataFrame as an HTML table.
to_html-specific options:
- bold_rows : boolean, default True
- Make the row labels bold in the output
- classes : str or list or tuple, default None
- CSS class(es) to apply to the resulting html table
- escape : boolean, default True
- Convert the characters <, >, and & to HTML-safe sequences.=
- max_rows : int, optional
- Maximum number of rows to show before truncating. If None, show all.
- max_cols : int, optional
- Maximum number of columns to show before truncating. If None, show all.
- decimal : string, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe
New in version 0.18.0.
- border : int
A
border=border
attribute is included in the opening <table> tag. Defaultpd.options.html.border
.New in version 0.19.0.
Parameters: - buf (StringIO-like, optional) – buffer to write to
- columns (sequence, optional) – the subset of columns to write; default None writes all columns
- col_space (int, optional) – the minimum width of each column
- header (bool, optional) – whether to print column labels, default True
- index (bool, optional) – whether to print index (row) labels, default True
- na_rep (string, optional) – string representation of NAN to use, default ‘NaN’
- formatters (list or dict of one-parameter functions, optional) – formatter functions to apply to columns’ elements by position or name, default None. The result of each function must be a unicode string. List must be of length equal to the number of columns.
- float_format (one-parameter function, optional) – formatter function to apply to columns’ elements if they are floats, default None. The result of this function must be a unicode string.
- sparsify (bool, optional) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row, default True
- index_names (bool, optional) – Prints the names of the indexes, default True
- line_width (int, optional) – Width to wrap a line in characters, default no wrap
- justify ({'left', 'right'}, default None) – Left or right-justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box.
Returns: formatted – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- buf
- columns
- col_space
- header
- index
- na_rep
- formatters
- float_format
- sparsify
- index_names
- justify
- bold_rows
- classes
- escape
- max_cols
- show_dimensions
- notebook
- decimal
- border
Return type: string (or unicode, depending on data and options)
-
to_parquet
(path, *args, **kwargs)¶ See dd.to_parquet docstring for more information
-
to_string
(max_rows=5)¶ Render a DataFrame to a console-friendly tabular output.
Parameters: - buf (StringIO-like, optional) – buffer to write to
- columns (sequence, optional) – the subset of columns to write; default None writes all columns
- col_space (int, optional) – the minimum width of each column
- header (bool, optional) – whether to print column labels, default True
- index (bool, optional) – whether to print index (row) labels, default True
- na_rep (string, optional) – string representation of NAN to use, default ‘NaN’
- formatters (list or dict of one-parameter functions, optional) – formatter functions to apply to columns’ elements by position or name, default None. The result of each function must be a unicode string. List must be of length equal to the number of columns.
- float_format (one-parameter function, optional) – formatter function to apply to columns’ elements if they are floats, default None. The result of this function must be a unicode string.
- sparsify (bool, optional) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row, default True
- index_names (bool, optional) – Prints the names of the indexes, default True
- line_width (int, optional) – Width to wrap a line in characters, default no wrap
- justify ({'left', 'right'}, default None) – Left or right-justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box.
Returns: formatted – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- buf
- columns
- col_space
- header
- index
- na_rep
- formatters
- float_format
- sparsify
- index_names
- justify
- line_width
- max_cols
- show_dimensions
Return type: string (or unicode, depending on data and options)
-
to_timestamp
(freq=None, how='start', axis=0)¶ Cast to DatetimeIndex of timestamps, at beginning of period
Parameters: - freq (string, default frequency of PeriodIndex) – Desired frequency
- how ({'s', 'e', 'start', 'end'}) – Convention for converting period to timestamp; start of period vs. end
- axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert (the index by default)
- copy (boolean, default True) – If false then underlying input data is not copied
Returns: df
Return type: DataFrame with DatetimeIndex
Notes
Dask doesn’t support the following argument(s).
- copy
-
truediv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series, DataFrame, or constant) –
- axis ({0, 1, 'index', 'columns'}) – For Series input, axis to match Series index on
- fill_value (None or float value, default None) – Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Notes
Mismatched indices will be unioned together
Returns: result Return type: DataFrame See also
-
var
(axis=None, skipna=True, ddof=1, split_every=False)¶ Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- ddof (int, default 1) – degrees of freedom
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: var – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
visualize
(filename='mydask', format=None, optimize_graph=False, **kwargs)¶ Render the computation of this object’s task graph using graphviz.
Requires
graphviz
to be installed.Parameters: - filename (str or None, optional) – The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.
- format ({'png', 'pdf', 'dot', 'svg', 'jpeg', 'jpg'}, optional) – Format in which to write output file. Default is ‘png’.
- optimize_graph (bool, optional) – If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
- **kwargs – Additional keyword arguments to forward to
to_graphviz
.
Returns: result – See dask.dot.dot_graph for more information.
Return type: IPython.diplay.Image, IPython.display.SVG, or None
See also
dask.base.visualize()
,dask.dot.dot_graph()
Notes
For more information on optimization see here:
-
where
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Parameters: - cond (boolean NDFrame, array or callable) –
If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as cond.
- other (scalar, NDFrame, or callable) –
If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as other.
- inplace (boolean, default False) – Whether to perform the operation in place on the data
- axis (alignment axis if needed, default None) –
- level (alignment level if needed, default None) –
- try_cast (boolean, default False) – try to cast the result back to the input type (if possible),
- raise_on_error (boolean, default True) – Whether to raise on invalid data types (e.g. trying to where on strings)
Returns: wh
Return type: same type as caller
Notes
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isTrue
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
where
documentation in indexing.Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
See also
DataFrame.mask()
- Extra Notes ———– Dask doesn’t support the following argument(s). * inplace * axis * level * try_cast * raise_on_error
- cond (boolean NDFrame, array or callable) –
-
dtypes
¶ Return data types
-
index
¶ Return dask Index instance
-
known_divisions
¶ Whether divisions are already known
-
loc
¶ Purely label-location based indexer for selection by label.
>>> df.loc["b"] >>> df.loc["b":"d"]
-
ndim
¶ Return dimensionality
-
npartitions
¶ Return number of partitions
-
size
¶ Size of the series
-
values
¶ Return a dask.array of the values of this dataframe
Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.
Series Methods¶
-
class
dask.dataframe.
Series
(dsk, name, meta, divisions)¶ Parallel Pandas Series
Do not use this class directly. Instead use functions like
dd.read_csv
,dd.read_parquet
, ordd.from_pandas
.Parameters: - dsk (dict) – The dask graph to compute this Series
- _name (str) – The key prefix that specifies which keys in the dask comprise this particular Series
- meta (pandas.Series) – An empty
pandas.Series
with names, dtypes, and index matching the expected output. - divisions (tuple of index values) – Values along which we partition our blocks on the index
See also
-
abs
()¶ Return an object with absolute value taken–only applicable to objects that are all numeric.
Returns: abs Return type: type of caller
-
add
(other, level=None, fill_value=None, axis=0)¶ Addition of series and other, element-wise (binary operator add).
Equivalent to
series + other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
align
(other, join='outer', axis=None, fill_value=None)¶ Align two object on their axes with the specified join method for each axis Index
Parameters: - other (DataFrame or Series) –
- join ({'outer', 'inner', 'left', 'right'}, default 'outer') –
- axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None)
- level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level
- copy (boolean, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
- fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value
- method (str, default None) –
- limit (int, default None) –
- fill_axis ({0, 'index'}, default 0) – Filling axis, method and limit
- broadcast_axis ({0, 'index'}, default None) –
Broadcast values along this axis, if aligning two objects of different dimensions
New in version 0.17.0.
Returns: (left, right) –
Aligned objects
Dask doesn’t support the following argument(s).
- level
- copy
- method
- limit
- fill_axis
- broadcast_axis
Return type: (Series, type of other)
-
all
(axis=None, skipna=True, split_every=False)¶ Return whether all elements are True over requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- bool_only (boolean, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Returns: all – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- bool_only
- level
Return type: Series or DataFrame (if level specified)
-
any
(axis=None, skipna=True, split_every=False)¶ Return whether any element is True over requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- bool_only (boolean, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Returns: any – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- bool_only
- level
Return type: Series or DataFrame (if level specified)
-
append
(other)¶ Concatenate two or more Series.
Parameters: - to_append (Series or list/tuple of Series) –
- ignore_index (boolean, default False) –
If True, do not use the index labels.
- verify_integrity (boolean, default False) – If True, raise Exception on creating index with duplicates
Returns: appended
Return type: Examples
>>> s1 = pd.Series([1, 2, 3]) >>> s2 = pd.Series([4, 5, 6]) >>> s3 = pd.Series([4, 5, 6], index=[3,4,5]) >>> s1.append(s2) 0 1 1 2 2 3 0 4 1 5 2 6 dtype: int64
>>> s1.append(s3) 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64
With ignore_index set to True:
>>> s1.append(s2, ignore_index=True) 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64
With verify_integrity set to True:
>>> s1.append(s2, verify_integrity=True) ValueError: Indexes have overlapping values: [0, 1, 2]
Notes
Dask doesn’t support the following argument(s).
- to_append
- ignore_index
- verify_integrity
-
apply
(func, convert_dtype=True, meta='__no_default__', args=(), **kwds)¶ Parallel version of pandas.Series.apply
Parameters: - func (function) – Function to apply
- convert_dtype (boolean, default True) – Try to find better dtype for elementwise function results. If False, leave as dtype=object.
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
. - args (tuple) – Positional arguments to pass to function in addition to the value.
- keyword arguments will be passed as keywords to the function. (Additional) –
Returns: applied
Return type: Series or DataFrame if func returns a Series.
Examples
>>> import dask.dataframe as dd >>> s = pd.Series(range(5), name='x') >>> ds = dd.from_pandas(s, npartitions=2)
Apply a function elementwise across the Series, passing in extra arguments in
args
andkwargs
:>>> def myadd(x, a, b=1): ... return x + a + b >>> res = ds.apply(myadd, args=(2,), b=1.5)
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with name
'x'
, and dtypefloat64
:>>> res = ds.apply(myadd, args=(2,), b=1.5, meta=('x', 'f8'))
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ds.apply(lambda x: x + 1, meta=ds)
See also
dask.Series.map_partitions()
-
astype
(dtype)¶ Cast object to input numpy.dtype Return a copy when copy = True (be really careful with this!)
Parameters: - dtype (data type, or dict of column name -> data type) – Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- raise_on_error (raise on invalid input) –
- kwargs (keyword arguments to pass on to the constructor) –
Returns: casted
Return type: type of caller
Notes
Dask doesn’t support the following argument(s).
- copy
- raise_on_error
-
autocorr
(lag=1, split_every=False)¶ Lag-N autocorrelation
Parameters: lag (int, default 1) – Number of lags to apply before performing autocorrelation. Returns: autocorr Return type: float
-
between
(left, right, inclusive=True)¶ Return boolean Series equivalent to left <= series <= right. NA values will be treated as False
Parameters: - left (scalar) – Left boundary
- right (scalar) – Right boundary
Returns: is_between
Return type:
-
bfill
(axis=None, limit=None)¶ Synonym for NDFrame.fillna(method=’bfill’) .. rubric:: Notes
Dask doesn’t support the following argument(s).
- inplace
- downcast
-
clear_divisions
()¶ Forget division information
-
clip
(lower=None, upper=None, out=None)¶ Trim values at input threshold(s).
Parameters: - lower (float or array_like, default None) –
- upper (float or array_like, default None) –
- axis (int or string axis name, optional) – Align object with lower and upper along the given axis.
Returns: clipped
Return type: Examples
>>> df 0 1 0 0.335232 -1.256177 1 -1.367855 0.746646 2 0.027753 -1.176076 3 0.230930 -0.679613 4 1.261967 0.570967 >>> df.clip(-1.0, 0.5) 0 1 0 0.335232 -1.000000 1 -1.000000 0.500000 2 0.027753 -1.000000 3 0.230930 -0.679613 4 0.500000 0.500000 >>> t 0 -0.3 1 -0.2 2 -0.1 3 0.0 4 0.1 dtype: float64 >>> df.clip(t, t + 1, axis=0) 0 1 0 0.335232 -0.300000 1 -0.200000 0.746646 2 0.027753 -0.100000 3 0.230930 0.000000 4 1.100000 0.570967
Notes
Dask doesn’t support the following argument(s).
- axis
-
clip_lower
(threshold)¶ Return copy of the input with values below given value(s) truncated.
Parameters: - threshold (float or array_like) –
- axis (int or string axis name, optional) – Align object with threshold along the given axis.
See also
Returns: clipped Return type: same type as input Notes
Dask doesn’t support the following argument(s).
- axis
-
clip_upper
(threshold)¶ Return copy of input with values above given value(s) truncated.
Parameters: - threshold (float or array_like) –
- axis (int or string axis name, optional) – Align object with threshold along the given axis.
See also
Returns: clipped Return type: same type as input Notes
Dask doesn’t support the following argument(s).
- axis
-
combine
(other, func, fill_value=None)¶ Perform elementwise binary operation on two Series using given function with optional fill value when an index is missing from one Series or the other
Parameters: - other (Series or scalar value) –
- func (function) –
- fill_value (scalar value) –
Returns: result
Return type:
-
combine_first
(other)¶ Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Parameters: other (Series) – Returns: y Return type: Series
-
compute
(**kwargs)¶ Compute this dask collection
This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a NumPy array and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.
Parameters: - get (callable, optional) – A scheduler
get
function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults. - optimize_graph (bool, optional) – If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
- kwargs – Extra keywords to forward to the scheduler
get
function.
- get (callable, optional) – A scheduler
-
copy
()¶ Make a copy of the dataframe
This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data
-
corr
(other, method='pearson', min_periods=None, split_every=False)¶ Compute correlation with other Series, excluding missing values
Parameters: - other (Series) –
- method ({'pearson', 'kendall', 'spearman'}) –
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- min_periods (int, optional) – Minimum number of observations needed to have a valid result
Returns: correlation
Return type:
-
count
(split_every=False)¶ Return number of non-NA/null observations in the Series
Parameters: level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series Returns: nobs Return type: int or Series (if level specified) Notes
Dask doesn’t support the following argument(s).
- level
-
cov
(other, min_periods=None, split_every=False)¶ Compute covariance with Series, excluding missing values
Parameters: - other (Series) –
- min_periods (int, optional) – Minimum number of observations needed to have a valid result
Returns: - covariance (float)
- Normalized by N-1 (unbiased estimator).
-
cummax
(axis=None, skipna=True)¶ Return cumulative max over requested axis.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cummax
Return type:
-
cummin
(axis=None, skipna=True)¶ Return cumulative minimum over requested axis.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cummin
Return type:
-
cumprod
(axis=None, skipna=True)¶ Return cumulative product over requested axis.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cumprod
Return type:
-
cumsum
(axis=None, skipna=True)¶ Return cumulative sum over requested axis.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: cumsum
Return type:
-
describe
(split_every=False)¶ Generate various summary statistics, excluding NaN values.
Parameters: - percentiles (array-like, optional) – The percentiles to include in the output. Should all be in the interval [0, 1]. By default percentiles is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
- exclude (include,) –
Specify the form of the returned result. Either:
- None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.
- A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=[‘O’])
- If include is the string ‘all’, the output column-set will match the input one.
Returns: summary
Return type: NDFrame of summary statistics
Notes
The output DataFrame index depends on the requested dtypes:
For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles.
For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items.
For mixed dtypes, the index will be the union of the corresponding output types. Non-applicable entries will be filled with NaN. Note that mixed-dtype outputs can only be returned from mixed-dtype inputs and appropriate use of the include/exclude arguments.
If multiple values have the highest count, then the count and most common pair will be arbitrarily chosen from among those with the highest count.
The include, exclude arguments are ignored for Series.
See also
DataFrame.select_dtypes()
- Extra Notes ———– Dask doesn’t support the following argument(s). * percentiles * include * exclude
-
diff
(periods=1, axis=0)¶ 1st discrete difference of object
Parameters: - periods (int, default 1) – Periods to shift for forming difference
- axis ({0 or 'index', 1 or 'columns'}, default 0) –
Take difference over rows (0) or columns (1).
Returns: diffed
Return type:
-
div
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
drop_duplicates
(split_every=None, split_out=1, **kwargs)¶ Return DataFrame with duplicate rows removed, optionally only considering certain columns
Parameters: - subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns
- keep ({'first', 'last', False}, default 'first') –
first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.- False : Drop all duplicates.
- take_last (deprecated) –
- inplace (boolean, default False) – Whether to drop duplicates in place or to return a copy
Returns: deduplicated
Return type:
-
dropna
()¶ Return Series without null values
Returns: - valid (Series)
- inplace (boolean, default False) – Do operation in place.
Notes
Dask doesn’t support the following argument(s).
- axis
- inplace
-
eq
(other, level=None, axis=0)¶ Equal to of series and other, element-wise (binary operator eq).
Equivalent to
series == other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
Series.None()
-
ffill
(axis=None, limit=None)¶ Synonym for NDFrame.fillna(method=’ffill’) .. rubric:: Notes
Dask doesn’t support the following argument(s).
- inplace
- downcast
-
fillna
(value=None, method=None, limit=None, axis=None)¶ Fill NA/NaN values using the specified method
Parameters: - value (scalar, dict, Series, or DataFrame) – Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
- method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) – Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
- axis ({0 or 'index', 1 or 'columns'}) –
- inplace (boolean, default False) – If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).
- limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.
- downcast (dict, default is None) – a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
See also
reindex()
,asfreq()
Returns: filled – .. rubric:: Notes Dask doesn’t support the following argument(s).
- inplace
- downcast
Return type: DataFrame
-
first
(offset)¶ Convenience method for subsetting initial periods of time series data based on a date offset.
Parameters: offset (string, DateOffset, dateutil.relativedelta) – Examples
ts.first(‘10D’) -> First 10 days
Returns: subset Return type: type of caller
-
floordiv
(other, level=None, fill_value=None, axis=0)¶ Integer division of series and other, element-wise (binary operator floordiv).
Equivalent to
series // other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
ge
(other, level=None, axis=0)¶ Greater than or equal to of series and other, element-wise (binary operator ge).
Equivalent to
series >= other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
Series.None()
-
get_partition
(n)¶ Get a dask DataFrame/Series representing the nth partition.
-
groupby
(by=None, **kwargs)¶ Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Parameters: - by (mapping function / list of functions, dict, Series, or tuple /) – list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups
- axis (int, default 0) –
- level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels
- as_index (boolean, default True) – For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
- sort (boolean, default True) – Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
- group_keys (boolean, default True) – When calling apply, add group keys to index to identify pieces
- squeeze (boolean, default False) – reduce the dimensionality of the return type if possible, otherwise return a consistent type
Examples
DataFrame results
>>> data.groupby(func, axis=0).mean() >>> data.groupby(['col1', 'col2'])['col3'].mean()
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean()
Returns: Return type: GroupBy object Notes
Dask doesn’t support the following argument(s).
- axis
- level
- as_index
- sort
- group_keys
- squeeze
-
gt
(other, level=None, axis=0)¶ Greater than of series and other, element-wise (binary operator gt).
Equivalent to
series > other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
Series.None()
-
head
(n=5, npartitions=1, compute=True)¶ First n rows of the dataset
Parameters: - n (int, optional) – The number of rows to return. Default is 5.
- npartitions (int, optional) – Elements are only taken from the first
npartitions
, with a default of 1. If there are fewer thann
rows in the firstnpartitions
a warning will be raised and any found rows returned. Pass -1 to use all partitions. - compute (bool, optional) – Whether to compute the result, default is True.
-
idxmax
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be first index.
Returns: idxmax
Return type: Notes
This method is the DataFrame version of
ndarray.argmax
.See also
-
idxmin
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
Returns: idxmin
Return type: Notes
This method is the DataFrame version of
ndarray.argmin
.See also
-
isin
(values)¶ Return a boolean
Series
showing whether each element in theSeries
is exactly contained in the passed sequence ofvalues
.Parameters: values (set or list-like) – The sequence of values to test. Passing in a single string will raise a
TypeError
. Instead, turn a single string into alist
of one element.New in version 0.18.1.
Support for values as a set
Returns: isin Return type: Series (bool dtype) Raises: TypeError
– * Ifvalues
is a stringSee also
Examples
>>> s = pd.Series(list('abc')) >>> s.isin(['a', 'c', 'e']) 0 True 1 False 2 True dtype: bool
Passing a single string as
s.isin('a')
will raise an error. Use a list of one element instead:>>> s.isin(['a']) 0 True 1 False 2 False dtype: bool
-
isnull
()¶ Return a boolean same-sized object indicating if the values are null.
See also
notnull()
- boolean inverse of isnull
-
iteritems
()¶ Lazily iterate over (index, value) tuples
-
last
(offset)¶ Convenience method for subsetting final periods of time series data based on a date offset.
Parameters: offset (string, DateOffset, dateutil.relativedelta) – Examples
ts.last(‘5M’) -> Last 5 months
Returns: subset Return type: type of caller
-
le
(other, level=None, axis=0)¶ Less than or equal to of series and other, element-wise (binary operator le).
Equivalent to
series <= other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
Series.None()
-
lt
(other, level=None, axis=0)¶ Less than of series and other, element-wise (binary operator lt).
Equivalent to
series < other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
Series.None()
-
map
(arg, na_action=None, meta='__no_default__')¶ Map values of Series using input correspondence (which can be a dict, Series, or function)
Parameters: - arg (function, dict, or Series) –
- na_action ({None, 'ignore'}) – If ‘ignore’, propagate NA values, without passing them to the mapping function
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
Returns: y – same index as caller
Return type: Examples
Map inputs to outputs
>>> x one 1 two 2 three 3
>>> y 1 foo 2 bar 3 baz
>>> x.map(y) one foo two bar three baz
Use na_action to control whether NA values are affected by the mapping function.
>>> s = pd.Series([1, 2, 3, np.nan])
>>> s2 = s.map(lambda x: 'this is a string {}'.format(x), na_action=None) 0 this is a string 1.0 1 this is a string 2.0 2 this is a string 3.0 3 this is a string nan dtype: object
>>> s3 = s.map(lambda x: 'this is a string {}'.format(x), na_action='ignore') 0 this is a string 1.0 1 this is a string 2.0 2 this is a string 3.0 3 NaN dtype: object
-
map_overlap
(func, before, after, *args, **kwargs)¶ Apply a function to each partition, sharing rows with adjacent partitions.
This can be useful for implementing windowing functions such as
df.rolling(...).mean()
ordf.diff()
.Parameters: - func (function) – Function applied to each partition.
- before (int) – The number of rows to prepend to partition
i
from the end of partitioni - 1
. - after (int) – The number of rows to append to partition
i
from the beginning of partitioni + 1
. - kwargs (args,) – Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
Notes
Given positive integers
before
andafter
, and a functionfunc
,map_overlap
does the following:- Prepend
before
rows to each partitioni
from the end of partitioni - 1
. The first partition has no rows prepended. - Append
after
rows to each partitioni
from the beginning of partitioni + 1
. The last partition has no rows appended. - Apply
func
to each partition, passing in any extraargs
andkwargs
if provided. - Trim
before
rows from the beginning of all but the first partition. - Trim
after
rows from the end of all but the last partition.
Note that the index and divisions are assumed to remain unchanged.
Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to
df.rolling(2).sum()
:>>> ddf.compute() x y 0 1 1.0 1 2 2.0 2 4 3.0 3 7 4.0 4 11 5.0 >>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute() x y 0 NaN NaN 1 3.0 3.0 2 6.0 5.0 3 11.0 7.0 4 18.0 9.0
The pandas
diff
method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls todf.diff
to each partition after prepending/appending that many rows, depending on sign:>>> def diff(df, periods=1): ... before, after = (periods, 0) if periods > 0 else (0, -periods) ... return df.map_overlap(lambda df, periods=1: df.diff(periods), ... periods, 0, periods=periods) >>> diff(ddf, 1).compute() x y 0 NaN NaN 1 1.0 1.0 2 2.0 1.0 3 3.0 1.0 4 4.0 1.0
If you have a
DatetimeIndex
, you can use a timedelta for time- based windows. >>> ts = pd.Series(range(10), index=pd.date_range(‘2017’, periods=10)) >>> dts = dd.from_pandas(ts, npartitions=2) >>> dts.map_overlap(lambda df: df.rolling(‘2D’).sum(), ... pd.Timedelta(‘2D’), 0).compute() 2017-01-01 0.0 2017-01-02 1.0 2017-01-03 3.0 2017-01-04 5.0 2017-01-05 7.0 2017-01-06 9.0 2017-01-07 11.0 2017-01-08 13.0 2017-01-09 15.0 2017-01-10 17.0 dtype: float64
-
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Note that the index and divisions are assumed to remain unchanged.
Parameters: - func (function) – Function applied to each partition.
- kwargs (args,) – Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
One can use
map_partitions
to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:
>>> def myadd(df, a, b=1): ... return df.x + df.y + a + b >>> res = ddf.map_partitions(myadd, 1, b=2) >>> res.dtype dtype('float64')
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with no name, and dtype
float64
:>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))
Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:
>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y)) >>> res.dtypes x int64 y float64 z float64 dtype: object
As before, the output metadata can also be specified manually. This time we pass in a
dict
, as the output is a DataFrame:>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y), ... meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)
Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:
>>> ddf.map_partitions(func).clear_divisions()
-
mask
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
Parameters: - cond (boolean NDFrame, array or callable) –
If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as cond.
- other (scalar, NDFrame, or callable) –
If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as other.
- inplace (boolean, default False) – Whether to perform the operation in place on the data
- axis (alignment axis if needed, default None) –
- level (alignment level if needed, default None) –
- try_cast (boolean, default False) – try to cast the result back to the input type (if possible),
- raise_on_error (boolean, default True) – Whether to raise on invalid data types (e.g. trying to where on strings)
Returns: wh
Return type: same type as caller
Notes
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isFalse
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
mask
documentation in indexing.Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
See also
DataFrame.where()
- Extra Notes ———– Dask doesn’t support the following argument(s). * inplace * axis * level * try_cast * raise_on_error
- cond (boolean NDFrame, array or callable) –
-
max
(axis=None, skipna=True, split_every=False)¶ - This method returns the maximum of the values in the object.
- If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: max – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
mean
(axis=None, skipna=True, split_every=False)¶ Return the mean of the values for the requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: mean – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
memory_usage
(index=True, deep=False)¶ Memory usage of the Series
Parameters: Returns: Return type: scalar bytes of memory consumed
Notes
Memory usage does not include memory consumed by elements that are not components of the array if deep=False
See also
numpy.ndarray.nbytes()
-
min
(axis=None, skipna=True, split_every=False)¶ - This method returns the minimum of the values in the object.
- If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: min – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
mod
(other, level=None, fill_value=None, axis=0)¶ Modulo of series and other, element-wise (binary operator mod).
Equivalent to
series % other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
mul
(other, level=None, fill_value=None, axis=0)¶ Multiplication of series and other, element-wise (binary operator mul).
Equivalent to
series * other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
ne
(other, level=None, axis=0)¶ Not equal to of series and other, element-wise (binary operator ne).
Equivalent to
series != other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
Series.None()
-
nlargest
(n=5, split_every=None)¶ Return the largest n elements.
Parameters: - n (int) – Return this many descending sorted values
- keep ({'first', 'last', False}, default 'first') – Where there are duplicate values:
-
first
: take the first occurrence. -last
: take the last occurrence. - take_last (deprecated) –
Returns: top_n – The n largest values in the Series, in sorted order
Return type: Notes
Faster than
.sort_values(ascending=False).head(n)
for small n relative to the size of theSeries
object.See also
Examples
>>> import pandas as pd >>> import numpy as np >>> s = pd.Series(np.random.randn(1e6)) >>> s.nlargest(10) # only sorts up to the N requested
-
notnull
()¶ Return a boolean same-sized object indicating if the values are not null.
See also
isnull()
- boolean inverse of notnull
-
nsmallest
(n=5, split_every=None)¶ Return the smallest n elements.
Parameters: - n (int) – Return this many ascending sorted values
- keep ({'first', 'last', False}, default 'first') – Where there are duplicate values:
-
first
: take the first occurrence. -last
: take the last occurrence. - take_last (deprecated) –
Returns: bottom_n – The n smallest values in the Series, in sorted order
Return type: Notes
Faster than
.sort_values().head(n)
for small n relative to the size of theSeries
object.See also
Examples
>>> import pandas as pd >>> import numpy as np >>> s = pd.Series(np.random.randn(1e6)) >>> s.nsmallest(10) # only sorts up to the N requested
-
nunique
(split_every=None)¶ Return number of unique elements in the object.
Excludes NA values by default.
Parameters: dropna (boolean, default True) – Don’t include NaN in the count. Returns: nunique Return type: int Notes
Dask doesn’t support the following argument(s).
- dropna
-
nunique_approx
(split_every=None)¶ Approximate number of unique rows.
This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.
Parameters: split_every (int, optional) – Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8. Returns: Return type: a float representing the approximate number of elements
-
persist
(**kwargs)¶ Persist this dask collection into memory
See
dask.base.persist
for full docstring
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs)
New in version 0.16.2.
Parameters: - func (function) – function to apply to the NDFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the NDFrame. - args (positional arguments passed into
func
.) – - kwargs (a dictionary of keyword arguments passed into
func
.) –
Returns: object
Return type: the return type of
func
.Notes
Use
.pipe
when chaining together functions that expect on Series or DataFrames. Instead of writing>>> f(g(h(df), arg1=a), arg2=b, arg3=c)
You can write
>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe(f, arg2=b, arg3=c) ... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe((f, 'arg2'), arg1=a, arg3=c) ... )
- func (function) – function to apply to the NDFrame.
-
pow
(other, level=None, fill_value=None, axis=0)¶ Exponential power of series and other, element-wise (binary operator pow).
Equivalent to
series ** other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
prod
(axis=None, skipna=True, split_every=False)¶ Return the product of the values for the requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: prod – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
quantile
(q=0.5)¶ Approximate quantiles of Series
- q : list/array of floats, default 0.5 (50%)
- Iterable of numbers ranging from 0 to 1 for the desired quantiles
-
radd
(other, level=None, fill_value=None, axis=0)¶ Addition of series and other, element-wise (binary operator radd).
Equivalent to
other + series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
random_split
(frac, random_state=None)¶ Pseudorandomly split dataframe into different pieces row-wise
Parameters: - frac (list) – List of floats that should sum to one.
- random_state (int or np.random.RandomState) – If int create a new RandomState with this as the seed
- draw from the passed RandomState (Otherwise) –
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5])
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)
See also
dask.DataFrame.sample()
-
rdiv
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
reduction
(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)¶ Generic row-wise reductions.
Parameters: - chunk (callable) – Function to operate on each partition. Should return a
pandas.DataFrame
,pandas.Series
, or a scalar. - aggregate (callable, optional) –
Function to operate on the concatenated result of
chunk
. If not specified, defaults tochunk
. Used to do the final aggregation in a tree reduction.The input to
aggregate
depends on the output ofchunk
. If the output ofchunk
is a:- scalar: Input is a Series, with one row per partition.
- Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
- DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.
Should return a
pandas.DataFrame
,pandas.Series
, or a scalar. - combine (callable, optional) – Function to operate on intermediate concatenated results of
chunk
in a tree-reduction. If not provided, defaults toaggregate
. The input/output requirements should match that ofaggregate
described above. - meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
. - token (str, optional) – The name to use for the output keys.
- split_every (int, optional) – Group partitions into groups of this size while performing a
tree-reduction. If set to False, no tree-reduction will be used,
and all intermediates will be concatenated and passed to
aggregate
. Default is 8. - chunk_kwargs (dict, optional) – Keyword arguments to pass on to
chunk
only. - aggregate_kwargs (dict, optional) – Keyword arguments to pass on to
aggregate
only. - combine_kwargs (dict, optional) – Keyword arguments to pass on to
combine
only. - kwargs – All remaining keywords will be passed to
chunk
,combine
, andaggregate
.
Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)}) >>> ddf = dd.from_pandas(df, npartitions=4)
Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:
>>> res = ddf.reduction(lambda x: x.count(), ... aggregate=lambda x: x.sum()) >>> res.compute() x 50 y 50 dtype: int64
Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).
>>> def count_greater(x, value=0): ... return (x >= value).sum() >>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(), ... chunk_kwargs={'value': 25}) >>> res.compute() 25
Aggregate both the sum and count of a Series at the same time:
>>> def sum_and_count(x): ... return pd.Series({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum()) >>> res.compute() count 50 sum 1225 dtype: int64
Doing the same, but for a DataFrame. Here
chunk
returns a DataFrame, meaning the input toaggregate
is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.>>> def sum_and_count(x): ... return pd.DataFrame({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.reduction(sum_and_count, ... aggregate=lambda x: x.groupby(level=0).sum()) >>> res.compute() count sum x 50 1225 y 50 3725
- chunk (callable) – Function to operate on each partition. Should return a
-
repartition
(divisions=None, npartitions=None, freq=None, force=False)¶ Repartition dataframe along new divisions
Parameters: - divisions (list, optional) – List of partitions to be used. If specified npartitions will be ignored.
- npartitions (int, optional) – Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.
- freq (str, pd.Timedelta) – A period on which to partition timeseries data like
'7D'
or'12h'
orpd.Timedelta(hours=12)
. Assumes a datetime index. - force (bool, default False) – Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.
Examples
>>> df = df.repartition(npartitions=10) >>> df = df.repartition(divisions=[0, 5, 10, 20]) >>> df = df.repartition(freq='7d')
-
resample
(rule, how=None, closed=None, label=None)¶ Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Parameters: - rule (string) – the offset string or object representing target conversion
- axis (int, optional, default 0) –
- closed ({'right', 'left'}) – Which side of bin interval is closed
- label ({'right', 'left'}) – Which bin edge label to label bucket with
- convention ({'start', 'end', 's', 'e'}) –
- loffset (timedelta) – Adjust the resampled time labels
- base (int, default 0) – For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
- on (string, optional) –
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
New in version 0.19.0.
- level (string or int, optional) –
For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.
New in version 0.19.0.
- learn more about the offset strings, please see `this link (To) –
- <http (//pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.) –
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label``2000-01-01 00:03:00`` does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] #select first 5 rows 2000-01-01 00:00:00 0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaN
values using thepad
method.>>> series.resample('30S').pad()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaN
values using thebfill
method.>>> series.resample('30S').bfill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply
>>> def custom_resampler(array_like): ... return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler) 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
Notes
Dask doesn’t support the following argument(s).
- axis
- fill_method
- convention
- kind
- loffset
- limit
- base
- on
- level
-
reset_index
(drop=False)¶ Reset the index to the default index.
Note that unlike in
pandas
, the resetdask.dataframe
index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g.index1 = [0, ..., 10], index2 = [0, ...]
). This is due to the inability to statically know the full length of the index.For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.
Parameters: drop (boolean, default False) – Do not try to insert index into dataframe columns.
-
rfloordiv
(other, level=None, fill_value=None, axis=0)¶ Integer division of series and other, element-wise (binary operator rfloordiv).
Equivalent to
other // series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
rmod
(other, level=None, fill_value=None, axis=0)¶ Modulo of series and other, element-wise (binary operator rmod).
Equivalent to
other % series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
rmul
(other, level=None, fill_value=None, axis=0)¶ Multiplication of series and other, element-wise (binary operator rmul).
Equivalent to
other * series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
rolling
(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)¶ Provides rolling transformations.
Parameters: - window (int, str, offset) –
Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a
DatetimeIndex
Changed in version 0.15.0: Now accepts offsets and string offset aliases
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- center (boolean, default False) – Set the labels at the center of the window.
- win_type (string, default None) – Provide a window type. The recognized window types are identical to pandas.
- axis (int, default 0) –
Returns: Return type: a Rolling object on which to call a method to compute a statistic
Notes
The freq argument is not supported.
- window (int, str, offset) –
-
round
(decimals=0)¶ Round each value in a Series to the given number of decimals.
Parameters: decimals (int) – Number of decimal places to round to (default: 0). If decimals is negative, it specifies the number of positions to the left of the decimal point. Returns: Return type: Series object See also
numpy.around()
,DataFrame.round()
-
rpow
(other, level=None, fill_value=None, axis=0)¶ Exponential power of series and other, element-wise (binary operator rpow).
Equivalent to
other ** series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
rsub
(other, level=None, fill_value=None, axis=0)¶ Subtraction of series and other, element-wise (binary operator rsub).
Equivalent to
other - series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
rtruediv
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
sample
(frac, replace=False, random_state=None)¶ Random sample of items
Parameters: - frac (float, optional) – Fraction of axis items to return.
- replace (boolean, optional) – Sample with or without replacement. Default = False.
- random_state (int or
np.random.RandomState
) – If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState
-
sem
(axis=None, skipna=None, ddof=1, split_every=False)¶ Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- ddof (int, default 1) – degrees of freedom
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: sem – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
shift
(periods=1, freq=None, axis=0)¶ Shift index by desired number of periods with an optional time freq
Parameters: - periods (int) – Number of periods to move, can be positive or negative
- freq (DateOffset, timedelta, or time rule string, optional) – Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes.
- axis ({0 or 'index', 1 or 'columns'}) –
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Returns: shifted Return type: DataFrame
-
std
(axis=None, skipna=True, ddof=1, split_every=False)¶ Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- ddof (int, default 1) – degrees of freedom
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: std – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
sub
(other, level=None, fill_value=None, axis=0)¶ Subtraction of series and other, element-wise (binary operator sub).
Equivalent to
series - other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
sum
(axis=None, skipna=True, split_every=False)¶ Return the sum of the values for the requested axis
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: sum – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
tail
(n=5, compute=True)¶ Last n rows of the dataset
Caveat, the only checks the last n rows of the last partition.
-
to_bag
(index=False)¶ Craeate a Dask Bag from a Series
-
to_csv
(filename, **kwargs)¶ See dd.to_csv docstring for more information
-
to_delayed
()¶ See dd.to_delayed docstring for more information
-
to_frame
(name=None)¶ Convert Series to DataFrame
Parameters: name (object, default None) – The passed name should substitute for the series name (if it has one). Returns: data_frame Return type: DataFrame
-
to_hdf
(path_or_buf, key, mode='a', append=False, get=None, **kwargs)¶ See dd.to_hdf docstring for more information
-
to_parquet
(path, *args, **kwargs)¶ See dd.to_parquet docstring for more information
-
to_string
(max_rows=5)¶ Render a string representation of the Series
Parameters: - buf (StringIO-like, optional) – buffer to write to
- na_rep (string, optional) – string representation of NAN to use, default ‘NaN’
- float_format (one-parameter function, optional) – formatter function to apply to columns’ elements if they are floats default None
- header (boolean, default True) – Add the Series header (index name)
- index (bool, optional) – Add index (row) labels, default True
- length (boolean, default False) – Add the Series length
- dtype (boolean, default False) – Add the Series dtype
- name (boolean, default False) – Add the Series name if not None
- max_rows (int, optional) – Maximum number of rows to show before truncating. If None, show all.
Returns: formatted
Return type: string (if not buffer passed)
Notes
Dask doesn’t support the following argument(s).
- buf
- na_rep
- float_format
- header
- index
- length
- dtype
- name
-
to_timestamp
(freq=None, how='start', axis=0)¶ Cast to DatetimeIndex of timestamps, at beginning of period
Parameters: - freq (string, default frequency of PeriodIndex) – Desired frequency
- how ({'s', 'e', 'start', 'end'}) – Convention for converting period to timestamp; start of period vs. end
- axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert (the index by default)
- copy (boolean, default True) – If false then underlying input data is not copied
Returns: df
Return type: DataFrame with DatetimeIndex
Notes
Dask doesn’t support the following argument(s).
- copy
-
truediv
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: - other (Series or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill missing (NaN) values with this value. If both Series are missing, the result will be missing
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result
Return type: See also
-
unique
(split_every=None, split_out=1)¶ Return Series of unique values in the object. Includes NA values.
Returns: uniques Return type: Series
-
value_counts
(split_every=None, split_out=1)¶ Returns object containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Parameters: - normalize (boolean, default False) – If True then the object returned will contain the relative frequencies of the unique values.
- sort (boolean, default True) – Sort by values
- ascending (boolean, default False) – Sort in ascending order
- bins (integer, optional) – Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data
- dropna (boolean, default True) – Don’t include counts of NaN.
Returns: counts
Return type: Notes
Dask doesn’t support the following argument(s).
- normalize
- sort
- ascending
- bins
- dropna
-
var
(axis=None, skipna=True, ddof=1, split_every=False)¶ Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0), columns (1)}) –
- skipna (boolean, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- ddof (int, default 1) – degrees of freedom
- numeric_only (boolean, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: var – .. rubric:: Notes
Dask doesn’t support the following argument(s).
- level
- numeric_only
Return type: Series or DataFrame (if level specified)
-
visualize
(filename='mydask', format=None, optimize_graph=False, **kwargs)¶ Render the computation of this object’s task graph using graphviz.
Requires
graphviz
to be installed.Parameters: - filename (str or None, optional) – The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.
- format ({'png', 'pdf', 'dot', 'svg', 'jpeg', 'jpg'}, optional) – Format in which to write output file. Default is ‘png’.
- optimize_graph (bool, optional) – If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
- **kwargs – Additional keyword arguments to forward to
to_graphviz
.
Returns: result – See dask.dot.dot_graph for more information.
Return type: IPython.diplay.Image, IPython.display.SVG, or None
See also
dask.base.visualize()
,dask.dot.dot_graph()
Notes
For more information on optimization see here:
-
where
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Parameters: - cond (boolean NDFrame, array or callable) –
If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as cond.
- other (scalar, NDFrame, or callable) –
If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1.
A callable can be used as other.
- inplace (boolean, default False) – Whether to perform the operation in place on the data
- axis (alignment axis if needed, default None) –
- level (alignment level if needed, default None) –
- try_cast (boolean, default False) – try to cast the result back to the input type (if possible),
- raise_on_error (boolean, default True) – Whether to raise on invalid data types (e.g. trying to where on strings)
Returns: wh
Return type: same type as caller
Notes
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isTrue
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
where
documentation in indexing.Examples
>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
See also
DataFrame.mask()
- Extra Notes ———– Dask doesn’t support the following argument(s). * inplace * axis * level * try_cast * raise_on_error
- cond (boolean NDFrame, array or callable) –
-
dtype
¶ Return data type
-
index
¶ Return dask Index instance
-
known_divisions
¶ Whether divisions are already known
-
loc
¶ Purely label-location based indexer for selection by label.
>>> df.loc["b"] >>> df.loc["b":"d"]
-
nbytes
¶ Number of bytes
-
ndim
¶ Return dimensionality
-
npartitions
¶ Return number of partitions
-
size
¶ Size of the series
-
values
¶ Return a dask.array of the values of this dataframe
Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.
DataFrameGroupBy¶
-
class
dask.dataframe.groupby.
DataFrameGroupBy
(df, by=None, slice=None)¶ -
agg
(arg, split_every=None, split_out=1)¶ Aggregate using input function or dict of {column -> function}
Parameters: arg (function or dict) – Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If passed a dict, the keys must be DataFrame column names.
- Accepted Combinations are:
- string cythonized function name
- function
- list of functions
- dict of columns -> functions
- nested dict of names -> dicts of functions
Notes
Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).
Returns: aggregated Return type: DataFrame
-
aggregate
(arg, split_every=None, split_out=1)¶ Aggregate using input function or dict of {column -> function}
Parameters: arg (function or dict) – Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If passed a dict, the keys must be DataFrame column names.
- Accepted Combinations are:
- string cythonized function name
- function
- list of functions
- dict of columns -> functions
- nested dict of names -> dicts of functions
Notes
Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).
Returns: aggregated Return type: DataFrame
-
apply
(func, meta='__no_default__')¶ Parallel version of pandas GroupBy.apply
This mimics the pandas version except for the following:
- The user should provide output metadata.
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters: - func (function) – Function to apply
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
Returns: applied
Return type: Series or DataFrame depending on columns keyword
-
count
(split_every=None, split_out=1)¶ Compute count of group, excluding missing values
-
cumcount
(axis=None)¶ Number each item in each group from 0 to the length of that group - 1.
Essentially this is equivalent to
>>> self.apply(lambda x: Series(np.arange(len(x)), x.index))
Parameters: ascending (bool, default True) – If False, number in reverse, from length of group - 1 to 0. Examples
>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], ... columns=['A']) >>> df A 0 a 1 a 2 a 3 b 4 b 5 a >>> df.groupby('A').cumcount() 0 0 1 1 2 2 3 0 4 1 5 3 dtype: int64 >>> df.groupby('A').cumcount(ascending=False) 0 3 1 2 2 1 3 1 4 0 5 0 dtype: int64
See also
pandas.Series.groupby()
,pandas.DataFrame.groupby()
pandas.Panel.groupby()
- Notes —– Dask doesn’t support the following argument(s). * ascending
-
cumprod
(axis=0)¶ Cumulative product for each group
-
cumsum
(axis=0)¶ Cumulative sum for each group
-
get_group
(key)¶ Constructs NDFrame from group with provided name
Parameters: - name (object) – the name of the group to get as a DataFrame
- obj (NDFrame, default None) – the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used
Returns: group
Return type: type of obj
Notes
Dask doesn’t support the following argument(s).
- name
- obj
-
max
(split_every=None, split_out=1)¶ Compute max of group values
-
mean
(split_every=None, split_out=1)¶ Compute mean of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
-
min
(split_every=None, split_out=1)¶ Compute min of group values
-
size
(split_every=None, split_out=1)¶ Compute group sizes
-
std
(ddof=1, split_every=None, split_out=1)¶ Compute standard deviation of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof (integer, default 1) – degrees of freedom
-
sum
(split_every=None, split_out=1)¶ Compute sum of group values
-
var
(ddof=1, split_every=None, split_out=1)¶ Compute variance of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof (integer, default 1) – degrees of freedom
-
SeriesGroupBy¶
-
class
dask.dataframe.groupby.
SeriesGroupBy
(df, by=None, slice=None)¶ -
agg
(arg, split_every=None, split_out=1)¶ Apply aggregation function or functions to groups, yielding most likely Series but in some cases DataFrame depending on the output of the aggregation function
Parameters: func_or_funcs (function or list / dict of functions) – List/dict of functions will produce DataFrame with column names determined by the function names themselves (list) or the keys in the dict Notes
agg is an alias for aggregate. Use it.
Examples
>>> series bar 1.0 baz 2.0 qot 3.0 qux 4.0
>>> mapper = lambda x: x[0] # first letter >>> grouped = series.groupby(mapper)
>>> grouped.aggregate(np.sum) b 3.0 q 7.0
>>> grouped.aggregate([np.sum, np.mean, np.std]) mean std sum b 1.5 0.5 3 q 3.5 0.5 7
>>> grouped.agg({'result' : lambda x: x.mean() / x.std(), ... 'total' : np.sum}) result total b 2.121 3 q 4.95 7
See also
apply()
,transform()
Returns: - Series or DataFrame
- Extra Notes
- ———–
- Dask doesn’t support the following argument(s).
- * func_or_funcs
-
aggregate
(arg, split_every=None, split_out=1)¶ Apply aggregation function or functions to groups, yielding most likely Series but in some cases DataFrame depending on the output of the aggregation function
Parameters: func_or_funcs (function or list / dict of functions) – List/dict of functions will produce DataFrame with column names determined by the function names themselves (list) or the keys in the dict Notes
agg is an alias for aggregate. Use it.
Examples
>>> series bar 1.0 baz 2.0 qot 3.0 qux 4.0
>>> mapper = lambda x: x[0] # first letter >>> grouped = series.groupby(mapper)
>>> grouped.aggregate(np.sum) b 3.0 q 7.0
>>> grouped.aggregate([np.sum, np.mean, np.std]) mean std sum b 1.5 0.5 3 q 3.5 0.5 7
>>> grouped.agg({'result' : lambda x: x.mean() / x.std(), ... 'total' : np.sum}) result total b 2.121 3 q 4.95 7
See also
apply()
,transform()
Returns: - Series or DataFrame
- Extra Notes
- ———–
- Dask doesn’t support the following argument(s).
- * func_or_funcs
-
apply
(func, meta='__no_default__')¶ Parallel version of pandas GroupBy.apply
This mimics the pandas version except for the following:
- The user should provide output metadata.
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters: - func (function) – Function to apply
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
Returns: applied
Return type: Series or DataFrame depending on columns keyword
-
count
(split_every=None, split_out=1)¶ Compute count of group, excluding missing values
-
cumcount
(axis=None)¶ Number each item in each group from 0 to the length of that group - 1.
Essentially this is equivalent to
>>> self.apply(lambda x: Series(np.arange(len(x)), x.index))
Parameters: ascending (bool, default True) – If False, number in reverse, from length of group - 1 to 0. Examples
>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], ... columns=['A']) >>> df A 0 a 1 a 2 a 3 b 4 b 5 a >>> df.groupby('A').cumcount() 0 0 1 1 2 2 3 0 4 1 5 3 dtype: int64 >>> df.groupby('A').cumcount(ascending=False) 0 3 1 2 2 1 3 1 4 0 5 0 dtype: int64
See also
pandas.Series.groupby()
,pandas.DataFrame.groupby()
pandas.Panel.groupby()
- Notes —– Dask doesn’t support the following argument(s). * ascending
-
cumprod
(axis=0)¶ Cumulative product for each group
-
cumsum
(axis=0)¶ Cumulative sum for each group
-
get_group
(key)¶ Constructs NDFrame from group with provided name
Parameters: - name (object) – the name of the group to get as a DataFrame
- obj (NDFrame, default None) – the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used
Returns: group
Return type: type of obj
Notes
Dask doesn’t support the following argument(s).
- name
- obj
-
max
(split_every=None, split_out=1)¶ Compute max of group values
-
mean
(split_every=None, split_out=1)¶ Compute mean of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
-
min
(split_every=None, split_out=1)¶ Compute min of group values
-
size
(split_every=None, split_out=1)¶ Compute group sizes
-
std
(ddof=1, split_every=None, split_out=1)¶ Compute standard deviation of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof (integer, default 1) – degrees of freedom
-
sum
(split_every=None, split_out=1)¶ Compute sum of group values
-
var
(ddof=1, split_every=None, split_out=1)¶ Compute variance of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof (integer, default 1) – degrees of freedom
-
Storage and Conversion¶
-
dask.dataframe.
read_csv
(urlpath, blocksize=64000000, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, assume_missing=False, storage_options=None, **kwargs)¶ Read CSV files into a Dask.DataFrame
This parallelizes the
pandas.read_csv
function in the following ways:It supports loading many files at once using globstrings:
>>> df = dd.read_csv('myfiles.*.csv')
In some cases it can break up large files:
>>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks
It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:
>>> df = dd.read_csv('s3://bucket/myfiles.*.csv') >>> df = dd.read_csv('hdfs:///myfiles.*.csv') >>> df = dd.read_csv('hdfs://namenode.example.com/myfiles.*.csv')
Internally
dd.read_csv
usespandas.read_csv
and supports many of the same keyword arguments with the same performance guarantees. See the docstring forpandas.read_csv
for more information on available keyword arguments.Parameters: - urlpath (string) – Absolute or relative filepath, URL (may include protocols like
s3://
), or globstring for CSV files. - blocksize (int or None, optional) – Number of bytes by which to cut up larger files. Default value is
computed based on available physical memory and the number of cores.
If
None
, use a single block for each file. - collection (boolean, optional) – Return a dask.dataframe if True or list of dask.delayed objects if False
- sample (int, optional) – Number of bytes to use when determining dtypes
- assume_missing (bool, optional) – If True, all integer columns that aren’t specified in
dtype
are assumed to contain missing values, and are converted to floats. Default is False. - storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
- **kwargs – Extra keyword arguments to forward to
pandas.read_csv
.
Notes
Dask dataframe tries to infer the
dtype
of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if thedtype
is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was aNaN
, then this would error at compute time. To fix this, you have a few options:- Provide explicit dtypes for the offending columns using the
dtype
keyword. This is the recommended solution. - Use the
assume_missing
keyword to assume that all columns inferred as integers contain missing values, and convert them to floats. - Increase the size of the sample using the
sample
keyword.
It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify
blocksize=None
to not split files into multiple partitions, at the cost of reduced parallelism.
-
dask.dataframe.
read_table
(urlpath, blocksize=64000000, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, assume_missing=False, storage_options=None, **kwargs)¶ Read delimited files into a Dask.DataFrame
This parallelizes the
pandas.read_table
function in the following ways:It supports loading many files at once using globstrings:
>>> df = dd.read_table('myfiles.*.csv')
In some cases it can break up large files:
>>> df = dd.read_table('largefile.csv', blocksize=25e6) # 25MB chunks
It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:
>>> df = dd.read_table('s3://bucket/myfiles.*.csv') >>> df = dd.read_table('hdfs:///myfiles.*.csv') >>> df = dd.read_table('hdfs://namenode.example.com/myfiles.*.csv')
Internally
dd.read_table
usespandas.read_table
and supports many of the same keyword arguments with the same performance guarantees. See the docstring forpandas.read_table
for more information on available keyword arguments.Parameters: - urlpath (string) – Absolute or relative filepath, URL (may include protocols like
s3://
), or globstring for delimited files. - blocksize (int or None, optional) – Number of bytes by which to cut up larger files. Default value is
computed based on available physical memory and the number of cores.
If
None
, use a single block for each file. - collection (boolean, optional) – Return a dask.dataframe if True or list of dask.delayed objects if False
- sample (int, optional) – Number of bytes to use when determining dtypes
- assume_missing (bool, optional) – If True, all integer columns that aren’t specified in
dtype
are assumed to contain missing values, and are converted to floats. Default is False. - storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
- **kwargs – Extra keyword arguments to forward to
pandas.read_table
.
Notes
Dask dataframe tries to infer the
dtype
of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if thedtype
is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was aNaN
, then this would error at compute time. To fix this, you have a few options:- Provide explicit dtypes for the offending columns using the
dtype
keyword. This is the recommended solution. - Use the
assume_missing
keyword to assume that all columns inferred as integers contain missing values, and convert them to floats. - Increase the size of the sample using the
sample
keyword.
It should also be noted that this function may fail if a delimited file includes quoted strings that contain the line terminator. To get around this you can specify
blocksize=None
to not split files into multiple partitions, at the cost of reduced parallelism.
-
dask.dataframe.
read_parquet
(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto')¶ Read ParquetFile into a Dask DataFrame
This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if any exist.
Parameters: - path (string) – Source directory for data. May be a glob string.
Prepend with protocol like
s3://
orhdfs://
for remote data. - columns (list or None) – List of column names to load
- filters (list) – List of filters to apply, like
[('x', '>' 0), ...]
. This implements row-group (partition) -level filtering only, i.e., to prevent the loading of some chunks of the data, and only if relevant statistics have been included in the metadata. - index (string or None (default) or False) – Name of index column to use if that column is sorted; False to force dask to not use any column as the index
- categories (list, dict or None) – For any fields listed here, if the parquet encoding is Dictionary, the column will be created with dtype category. Use only if it is guaranteed that the column is encoded as dictionary in all row-groups. If a list, assumes up to 2**16-1 labels; if a dict, specify the number of labels expected; if None, will load categories automatically for data written by dask/fastparquet, not otherwise.
- storage_options (dict) – Key/value pairs to be passed on to the file-system backend, if any.
- engine ({'auto', 'fastparquet', 'arrow'}, default 'auto') – Parquet reader library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’
Examples
>>> df = read_parquet('s3://bucket/my-parquet-data')
See also
- path (string) – Source directory for data. May be a glob string.
Prepend with protocol like
-
dask.dataframe.
read_hdf
(pattern, key, start=0, stop=None, columns=None, chunksize=1000000, sorted_index=False, lock=True, mode='a')¶ Read HDF files into a Dask DataFrame
Read hdf files into a dask dataframe. This function is like
pandas.read_hdf
, except it can read from a single large file, or from multiple files, or from multiple keys from the same file.Parameters: - pattern (string, list) – File pattern (string), buffer to read from, or list of file paths. Can contain wildcards.
- key (group identifier in the store. Can contain wildcards) –
- start (optional, integer (defaults to 0), row number to start at) –
- stop (optional, integer (defaults to None, the last row), row number to) – stop at
- columns (list of columns, optional) – A list of columns that if not None, will limit the return columns (default is None)
- chunksize (positive integer, optional) – Maximal number of rows per partition (default is 1000000).
- sorted_index (boolean, optional) – Option to specify whether or not the input hdf files have a sorted index (default is False).
- lock (boolean, optional) – Option to use a lock to prevent concurrency issues (default is True).
- mode ({'a', 'r', 'r+'}, default 'a'. Mode to use when opening file(s).) –
- ‘r’
- Read-only; no data can be modified.
- ‘a’
- Append; an existing file is opened for reading and writing, and if the file does not exist it is created.
- ‘r+’
- It is similar to ‘a’, but the file must already exist.
Returns: Return type: dask.DataFrame
Examples
Load single file
>>> dd.read_hdf('myfile.1.hdf5', '/x')
Load multiple files
>>> dd.read_hdf('myfile.*.hdf5', '/x')
>>> dd.read_hdf(['myfile.1.hdf5', 'myfile.2.hdf5'], '/x')
Load multiple datasets
>>> dd.read_hdf('myfile.1.hdf5', '/*')
-
dask.dataframe.
read_sql_table
(table, uri, index_col, divisions=None, npartitions=None, limits=None, columns=None, bytes_per_chunk=268435456, **kwargs)¶ Create dataframe from an SQL table.
If neither divisions or npartitions is given, the memory footprint of the first five rows will be determined, and partitions of size ~256MB will be used.
Parameters: - table (string or sqlalchemy expression) – Select columns from here.
- uri (string) – Full sqlalchemy URI for the database connection
- index_col (string) – Column which becomes the index, and defines the partitioning. Should
be a indexed column in the SQL server, and numerical.
Could be a function to return a value, e.g.,
sql.func.abs(sql.column('value')).label('abs(value)')
. Labeling columns created by functions or arithmetic operations is required. - divisions (sequence) – Values of the index column to split the table by.
- npartitions (int) – Number of partitions, if divisions is not given. Will split the values of the index column linearly between limits, if given, or the column max/min.
- limits (2-tuple or None) – Manually give upper and lower range of values for use with npartitions; if None, first fetches max/min from the DB. Upper limit, if given, is inclusive.
- columns (list of strings or None) – Which columns to select; if None, gets all; can include sqlalchemy
functions, e.g.,
sql.func.abs(sql.column('value')).label('abs(value)')
. Labeling columns created by functions or arithmetic operations is recommended. - bytes_per_chunk (int) – If both divisions and npartitions is None, this is the target size of each partition, in bytes
- kwargs (dict) – Additional parameters to pass to pd.read_sql()
Returns: Return type: dask.dataframe
Examples
>>> df = dd.read_sql('accounts', 'sqlite:///path/to/bank.db', ... npartitions=10, index_col='id')
-
dask.dataframe.
from_array
(x, chunksize=50000, columns=None)¶ Read any slicable array into a Dask Dataframe
Uses getitem syntax to pull slices out of the array. The array need not be a NumPy array but must support slicing syntax
x[50000:100000]and have 2 dimensions:
x.ndim == 2or have a record dtype:
x.dtype == [(‘name’, ‘O’), (‘balance’, ‘i8’)]
-
dask.dataframe.
from_pandas
(data, npartitions=None, chunksize=None, sort=True, name=None)¶ Construct a Dask DataFrame from a Pandas DataFrame
This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel.
Note that, despite parallelism, Dask.dataframe may not always be faster than Pandas. We recommend that you stay with Pandas for as long as possible before switching to Dask.dataframe.
Parameters: - df (pandas.DataFrame or pandas.Series) – The DataFrame/Series with which to construct a Dask DataFrame/Series
- npartitions (int, optional) – The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.
- chunksize (int, optional) – The number of rows per index partition to use.
- sort (bool) – Sort input first to obtain cleanly divided partitions or don’t sort and don’t get cleanly divided partitions
- name (string, optional) – An optional keyname for the dataframe. Defaults to hashing the input
Returns: A dask DataFrame/Series partitioned along the index
Return type: dask.DataFrame or dask.Series
Examples
>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))), ... index=pd.date_range(start='20100101', periods=6)) >>> ddf = from_pandas(df, npartitions=3) >>> ddf.divisions (Timestamp('2010-01-01 00:00:00', freq='D'), Timestamp('2010-01-03 00:00:00', freq='D'), Timestamp('2010-01-05 00:00:00', freq='D'), Timestamp('2010-01-06 00:00:00', freq='D')) >>> ddf = from_pandas(df.a, npartitions=3) # Works with Series too! >>> ddf.divisions (Timestamp('2010-01-01 00:00:00', freq='D'), Timestamp('2010-01-03 00:00:00', freq='D'), Timestamp('2010-01-05 00:00:00', freq='D'), Timestamp('2010-01-06 00:00:00', freq='D'))
Raises: TypeError
– If something other than apandas.DataFrame
orpandas.Series
is passed in.See also
from_array()
- Construct a dask.DataFrame from an array that has record dtype
read_csv()
- Construct a dask.DataFrame from a CSV file
-
dask.dataframe.
from_bcolz
(x, chunksize=None, categorize=True, index=None, lock=<thread.lock object>, **kwargs)¶ Read BColz CTable into a Dask Dataframe
BColz is a fast on-disk compressed column store with careful attention given to compression. https://bcolz.readthedocs.io/en/latest/
Parameters: - x (bcolz.ctable) –
- chunksize (int, optional) – The size(rows) of blocks to pull out from ctable.
- categorize (bool, defaults to True) – Automatically categorize all string dtypes
- index (string, optional) – Column to make the index
- lock (bool or Lock) – Lock to use when reading or False for no lock (not-thread-safe)
See also
from_array()
- more generic function not optimized for bcolz
-
dask.dataframe.
from_dask_array
(x, columns=None)¶ Create a Dask DataFrame from a Dask Array.
Converts a 2d array into a DataFrame and a 1d array into a Series.
Parameters: - x (da.Array) –
- columns (list or string) – list of column names if DataFrame, single string if Series
Examples
>>> import dask.array as da >>> import dask.dataframe as dd >>> x = da.ones((4, 2), chunks=(2, 2)) >>> df = dd.io.from_dask_array(x, columns=['a', 'b']) >>> df.compute() a b 0 1.0 1.0 1 1.0 1.0 2 1.0 1.0 3 1.0 1.0
See also
dask.bag.to_dataframe()
- from dask.bag
dask.dataframe._Frame.values()
- Reverse conversion
dask.dataframe._Frame.to_records()
- Reverse conversion
-
dask.dataframe.
from_delayed
(dfs, meta=None, divisions=None, prefix='from-delayed')¶ Create Dask DataFrame from many Dask Delayed objects
Parameters: - dfs (list of Delayed) – An iterable of
dask.delayed.Delayed
objects, such as come fromdask.delayed
These comprise the individual partitions of the resulting dataframe. - meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
. - divisions (tuple, str, optional) – Partition boundaries along the index. For tuple, see http://dask.pydata.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information
- prefix (str, optional) – Prefix to prepend to the keys.
- dfs (list of Delayed) – An iterable of
-
dask.dataframe.
to_delayed
(df)¶ Create Dask Delayed objects from a Dask Dataframe
Returns a list of delayed values, one value per partition.
Examples
>>> partitions = df.to_delayed()
-
dask.dataframe.
to_records
(df)¶ Create Dask Array from a Dask Dataframe
Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.
Examples
>>> df.to_records() dask.array<shape=(nan,), dtype=(numpy.record, [('ind', '<f8'), ('x', 'O'), ('y', '<i8')]), chunksize=(nan,)>
See also
dask.dataframe._Frame.values()
,dask.dataframe.from_dask_array()
-
dask.dataframe.
to_csv
(df, filename, name_function=None, compression=None, compute=True, get=None, storage_options=None, **kwargs)¶ Store Dask DataFrame to CSV files
One filename per partition will be created. You can specify the filenames in a variety of ways.
Use a globstring:
>>> df.to_csv('/path/to/data/export-*.csv')
The * will be replaced by the increasing sequence 0, 1, 2, ...
/path/to/data/export-0.csv /path/to/data/export-1.csv
Use a globstring and a
name_function=
keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.>>> from datetime import date, timedelta >>> def name(i): ... return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0) '2015-01-01' >>> name(15) '2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)
/path/to/data/export-2015-01-01.csv /path/to/data/export-2015-01-02.csv ...
You can also provide an explicit list of paths:
>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...] >>> df.to_csv(paths)
Parameters: - filename (string) – Path glob indicating the naming scheme for the output files
- name_function (callable, default None) – Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions
- compression (string, optional) – String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically
- sep (character, default ',') – Field delimiter for the output file
- na_rep (string, default '') – Missing data representation
- float_format (string, default None) – Format string for floating point numbers
- columns (sequence, optional) – Columns to write
- header (boolean or list of string, default True) – Write out column names. If a list of string is given it is assumed to be aliases for the column names
- index (boolean, default True) – Write row names (index)
- index_label (string or sequence, or False, default None) – Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R
- nanRep (None) – deprecated, use na_rep
- mode (str) – Python write mode, default ‘w’
- encoding (string, optional) – A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
- compression – a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
- line_terminator (string, default 'n') – The newline character or character sequence to use in the output file
- quoting (optional constant from csv module) – defaults to csv.QUOTE_MINIMAL
- quotechar (string (length 1), default '"') – character used to quote fields
- doublequote (boolean, default True) – Control quoting of quotechar inside a field
- escapechar (string (length 1), default None) – character used to escape sep and quotechar when appropriate
- chunksize (int or None) – rows to write at a time
- tupleize_cols (boolean, default False) – write multi_index columns as a list of tuples (if True) or new (expanded format) if False)
- date_format (string, default None) – Format string for datetime objects
- decimal (string, default '.') – Character recognized as decimal separator. E.g. use ‘,’ for European data
- storage_options (dict) – Parameters passed on to the backend filesystem class.
Returns: - The names of the file written if they were computed right away
- If not, the delayed tasks associated to the writing of the files
-
dask.dataframe.
to_bag
(df, index=False)¶ Create Dask Bag from a Dask DataFrame
Parameters: index (bool, optional) – If True, the elements are tuples of (index, value)
, otherwise they’re just thevalue
. Default is False.Examples
>>> bag = df.to_bag()
-
dask.dataframe.
to_hdf
(df, path, key, mode='a', append=False, get=None, name_function=None, compute=True, lock=None, dask_kwargs={}, **kwargs)¶ Store Dask Dataframe to Hierarchical Data Format (HDF) files
This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.
This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix
*
within the filename or datapath, and an optionalname_function
. The asterix will be replaced with an increasing sequence of integers starting from0
or with the result of callingname_function
on each of those integers.This function only supports the Pandas
'table'
format, not the more specialized'fixed'
format.Parameters: - path (string) – Path to a target filename. May contain a
*
to denote many filenames - key (string) – Datapath within the files. May contain a
*
to denote many locations - name_function (function) – A function to convert the
*
in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below) - compute (bool) – Whether or not to execute immediately. If False then this returns a
dask.Delayed
value. - lock (Lock, optional) – Lock to use to prevent concurrency issues. By default a
threading.Lock
,multiprocessing.Lock
orSerializableLock
will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection. - **other – See pandas.to_hdf for more information
Examples
Save Data to a single file
>>> df.to_hdf('output.hdf', '/data')
Save data to multiple datapaths within the same file:
>>> df.to_hdf('output.hdf', '/data-*')
Save data to multiple files:
>>> df.to_hdf('output-*.hdf', '/data')
Save data to multiple files, using the multiprocessing scheduler:
>>> df.to_hdf('output-*.hdf', '/data', get=dask.multiprocessing.get)
Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..
>>> from datetime import date, timedelta >>> base = date(year=2000, month=1, day=1) >>> def name_function(i): ... ''' Convert integer 0 to n to a string ''' ... return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function)
Returns: - None (if compute == True)
- delayed value (if compute == False)
See also
- path (string) – Path to a target filename. May contain a
-
dask.dataframe.
to_parquet
(path, df, compression=None, write_index=None, has_nulls=True, fixed_text=None, object_encoding=None, storage_options=None, append=False, ignore_divisions=False, partition_on=None, compute=True)¶ Store Dask.dataframe to Parquet files
Notes
Each partition will be written to a separate file.
Parameters: - path (string) – Destination directory for data. Prepend with protocol like
s3://
orhdfs://
for remote data. - df (Dask.dataframe) –
- compression (string or dict) – Either a string like “SNAPPY” or a dictionary mapping column names to
compressors like
{"name": "GZIP", "values": "SNAPPY"}
- write_index (boolean) – Whether or not to write the index. Defaults to True if divisions are known.
- has_nulls (bool, list or 'infer') – Specifies whether to write NULLs information for columns. If bools, apply to all columns, if list, use for only the named columns, if ‘infer’, use only for columns which don’t have a sentinel NULL marker (currently object columns only).
- fixed_text (dict {col: int}) – For column types that are written as bytes (bytes, utf8 strings, or json and bson-encoded objects), if a column is included here, the data will be written in fixed-length format, which should be faster but can potentially result in truncation.
- object_encoding (dict {col: bytes|utf8|json|bson} or str) – For object columns, specify how to encode to bytes. If a str, same encoding is applied to all object columns.
- storage_options (dict) – Key/value pairs to be passed on to the file-system backend, if any.
- append (bool (False)) – If False, construct data-set from scratch; if True, add new row-group(s) to existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.
- ignore_divisions (bool (False)) – If False raises error when previous divisions overlap with the new appended divisions. Ignored if append=False.
- partition_on (list) – Construct directory-based partitioning by splitting on these fields’ values. Each dask partition will result in one or more datafiles, there will be no global groupby.
- compute (bool (True)) – If true (default) then we compute immediately. If False then we return a dask.delayed object for future computation.
- uses the fastparquet project (This) –
- http (//fastparquet.readthedocs.io/en/latest) –
Examples
>>> df = dd.read_csv(...) >>> to_parquet('/path/to/output/', df, compression='SNAPPY')
See also
read_parquet()
- Read parquet data to dask.dataframe
- path (string) – Destination directory for data. Prepend with protocol like
Rolling¶
-
dask.dataframe.rolling.
rolling_apply
(arg, window, *args, **kwargs)¶ Generic moving function application.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- func (function) – Must produce a single value from an ndarray input
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Whether the label should correspond with center of window
- args (tuple) – Passed on to func
- kwargs (dict) – Passed on to func
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).To learn more about the frequency strings, please see this link.
-
dask.dataframe.rolling.
map_overlap
(func, df, before, after, *args, **kwargs)¶ Apply a function to each partition, sharing rows with adjacent partitions.
Parameters: - func (function) – Function applied to each partition.
- df (dd.DataFrame, dd.Series) –
- before (int or timedelta) – The rows to prepend to partition
i
from the end of partitioni - 1
. - after (int or timedelta) – The rows to append to partition
i
from the beginning of partitioni + 1
. - kwargs (args,) – Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
See also
dd.DataFrame.map_overlap()
-
dask.dataframe.rolling.
rolling_count
(arg, window, *args, **kwargs)¶ Rolling count of number of non-NaN observations inside provided window.
Parameters: - arg (DataFrame or numpy ndarray-like) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Whether the label should correspond with center of window
- how (string, default 'mean') – Method for down- or re-sampling
Returns: rolling_count
Return type: type of caller
Notes
The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).To learn more about the frequency strings, please see this link.
-
dask.dataframe.rolling.
rolling_kurt
(arg, window, *args, **kwargs)¶ Unbiased moving kurtosis.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Set the labels at the center of the window.
- how (string, default 'None') – Method for down- or re-sampling
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_max
(arg, window, *args, **kwargs)¶ Moving maximum.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Set the labels at the center of the window.
- how (string, default ''max') – Method for down- or re-sampling
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_mean
(arg, window, *args, **kwargs)¶ Moving mean.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Set the labels at the center of the window.
- how (string, default 'None') – Method for down- or re-sampling
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_median
(arg, window, *args, **kwargs)¶ Moving median.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Set the labels at the center of the window.
- how (string, default ''median') – Method for down- or re-sampling
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_min
(arg, window, *args, **kwargs)¶ Moving minimum.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Set the labels at the center of the window.
- how (string, default ''min') – Method for down- or re-sampling
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_quantile
(arg, window, *args, **kwargs)¶ Moving quantile.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- quantile (float) – 0 <= quantile <= 1
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Whether the label should correspond with center of window
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).To learn more about the frequency strings, please see this link.
-
dask.dataframe.rolling.
rolling_skew
(arg, window, *args, **kwargs)¶ Unbiased moving skewness.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Set the labels at the center of the window.
- how (string, default 'None') – Method for down- or re-sampling
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_std
(arg, window, *args, **kwargs)¶ Moving standard deviation.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Set the labels at the center of the window.
- how (string, default 'None') – Method for down- or re-sampling
- ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations
is
N - ddof
, whereN
represents the number of elements.
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_sum
(arg, window, *args, **kwargs)¶ Moving sum.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Set the labels at the center of the window.
- how (string, default 'None') – Method for down- or re-sampling
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_var
(arg, window, *args, **kwargs)¶ Moving variance.
Parameters: - arg (Series, DataFrame) –
- window (int) – Size of the moving window. This is the number of observations used for calculating the statistic.
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Set the labels at the center of the window.
- how (string, default 'None') – Method for down- or re-sampling
- ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations
is
N - ddof
, whereN
represents the number of elements.
Returns: y
Return type: type of input argument
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
dask.dataframe.rolling.
rolling_window
(arg, window, **kwargs)¶ Applies a moving window of type
window_type
and sizewindow
on the data.Parameters: - arg (Series, DataFrame) –
- window (int or ndarray) – Weighting window specification. If the window is an integer, then it is treated as the window length and win_type is required
- win_type (str, default None) – Window type (see Notes)
- min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA).
- freq (string or DateOffset object, optional (default None)) – Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center (boolean, default False) – Whether the label should correspond with center of window
- mean (boolean, default True) – If True computes weighted mean, else weighted sum
- axis ({0, 1}, default 0) –
- how (string, default 'mean') – Method for down- or re-sampling
Returns: y
Return type: type of input argument
Notes
The recognized window types are:
boxcar
triang
blackman
hamming
bartlett
parzen
bohman
blackmanharris
nuttall
barthann
kaiser
(needs beta)gaussian
(needs std)general_gaussian
(needs power, width)slepian
(needs width).
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).To learn more about the frequency strings, please see this link.
Other functions¶
-
dask.dataframe.
compute
(*args, **kwargs)¶ Compute several dask collections at once.
Parameters: - args (object) – Any number of objects. If it is a dask object, it’s computed and the
result is returned. By default, python builtin collections are also
traversed to look for dask objects (for more information see the
traverse
keyword). Non-dask arguments are passed through unchanged. - traverse (bool, optional) – By default dask traverses builtin python collections looking for dask
objects passed to
compute
. For large collections this can be expensive. If none of the arguments contain any dask objects, settraverse=False
to avoid doing this traversal. - get (callable, optional) – A scheduler
get
function to use. If not provided, the default is to check the global settings first, and then fall back to defaults for the collections. - optimize_graph (bool, optional) – If True [default], the optimizations for each collection are applied before computation. Otherwise the graph is run as is. This can be useful for debugging.
- kwargs – Extra keywords to forward to the scheduler
get
function.
Examples
>>> import dask.array as da >>> a = da.arange(10, chunks=2).sum() >>> b = da.arange(10, chunks=2).mean() >>> compute(a, b) (45, 4.5)
By default, dask objects inside python collections will also be computed:
>>> compute({'a': a, 'b': b, 'c': 1}) ({'a': 45, 'b': 4.5, 'c': 1},)
- args (object) – Any number of objects. If it is a dask object, it’s computed and the
result is returned. By default, python builtin collections are also
traversed to look for dask objects (for more information see the
-
dask.dataframe.
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Parameters: - func (function) – Function applied to each partition.
- kwargs (args,) – Arguments and keywords to pass to the function. At least one of the args should be a Dask.dataframe.
- meta (pd.DataFrame, pd.Series, dict, iterable, tuple, optional) – An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
-
dask.dataframe.multi.
concat
(dfs, axis=0, join='outer', interleave_partitions=False)¶ Concatenate DataFrames along rows.
- When axis=0 (default), concatenate DataFrames row-wise:
- If all divisions are known and ordered, concatenate DataFrames keeping divisions. When divisions are not ordered, specifying interleave_partition=True allows concatenate divisions each by each.
- If any of division is unknown, concatenate DataFrames resetting its division to unknown (None)
- When axis=1, concatenate DataFrames column-wise:
- Allowed if all divisions are known.
- If any of division is unknown, it raises ValueError.
Parameters: - dfs (list) – List of dask.DataFrames to be concatenated
- axis ({0, 1, 'index', 'columns'}, default 0) – The axis to concatenate along
- join ({'inner', 'outer'}, default 'outer') – How to handle indexes on other axis
- interleave_partitions (bool, default False) – Whether to concatenate DataFrames ignoring its order. If True, every divisions are concatenated each by each.
Examples
If all divisions are known and ordered, divisions are kept.
>>> a dd.DataFrame<x, divisions=(1, 3, 5)> >>> b dd.DataFrame<y, divisions=(6, 8, 10)> >>> dd.concat([a, b]) dd.DataFrame<concat-..., divisions=(1, 3, 6, 8, 10)>
Unable to concatenate if divisions are not ordered.
>>> a dd.DataFrame<x, divisions=(1, 3, 5)> >>> b dd.DataFrame<y, divisions=(2, 3, 6)> >>> dd.concat([a, b]) ValueError: All inputs have known divisions which cannot be concatenated in order. Specify interleave_partitions=True to ignore order
Specify interleave_partitions=True to ignore the division order.
>>> dd.concat([a, b], interleave_partitions=True) dd.DataFrame<concat-..., divisions=(1, 2, 3, 5, 6)>
If any of division is unknown, the result division will be unknown
>>> a dd.DataFrame<x, divisions=(None, None)> >>> b dd.DataFrame<y, divisions=(1, 4, 10)> >>> dd.concat([a, b]) dd.DataFrame<concat-..., divisions=(None, None, None, None)>
- When axis=0 (default), concatenate DataFrames row-wise:
-
dask.dataframe.multi.
merge
(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle=None, max_branch=None)¶ Merge DataFrame objects by performing a database-style join operation by columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
Parameters: - left (DataFrame) –
- right (DataFrame) –
- how ({'left', 'right', 'outer', 'inner'}, default 'inner') –
- left: use only keys from left frame (SQL: left outer join)
- right: use only keys from right frame (SQL: right outer join)
- outer: use union of keys from both frames (SQL: full outer join)
- inner: use intersection of keys from both frames (SQL: inner join)
- on (label or list) – Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
- left_on (label or list, or array-like) – Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
- right_on (label or list, or array-like) – Field names to join on in right DataFrame or vector/list of vectors per left_on docs
- left_index (boolean, default False) – Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels
- right_index (boolean, default False) – Use the index from the right DataFrame as the join key. Same caveats as left_index
- sort (boolean, default False) – Sort the join keys lexicographically in the result DataFrame
- suffixes (2-length sequence (tuple, list, ...)) – Suffix to apply to overlapping column names in the left and right side, respectively
- copy (boolean, default True) – If False, do not copy data unnecessarily
- indicator (boolean or string, default False) –
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.
New in version 0.17.0.
Examples
>>> A >>> B lkey value rkey value 0 foo 1 0 foo 5 1 bar 2 1 bar 6 2 baz 3 2 qux 7 3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 4 foo 5 2 bar 2 bar 6 3 bar 2 bar 8 4 baz 3 NaN NaN 5 NaN NaN qux 7
Returns: merged – The output type will the be same as ‘left’, if it is a subclass of DataFrame. Return type: DataFrame See also
merge_ordered()
,merge_asof()