Data transformations¶
The top bar shows a list of icons to quickly perform common data manipulations. Mosaic transforms the data using Blaze expressions, and each of the buttons inserts a command that can be further customized to fit your needs.
After inserting a command click Preview to preview the changes you are about to make, click Apply to perform the transformation. A sequence of tags will appear below the command field, these keep track of the transformations so you can undo your previous commands by just deleting the tags.
The data in the following table was obtained performing a by transformation followed by a projection. You can see them listed right above the table:

It is also possible to copy the current data into a new dataset, click Commit to open the Commit Expression box, enter a Name and a Description for the new dataset and click OK.

Now you should see the new data set listed on the left-hand column:

The dataset on the current active tab is always denoted by x, but it is also possible to use the label instead. For instance, if you open the Iris dataset, whose default label on Mosaic is iris_csv, the following commands are equivalent:
iris_csv[iris_csv.petal_length > 2]
x[x.petal_length > 2]
Below you will find a full description of the transformations available trough the icons on the top bar:
by¶

By is the basic tool for Split-Apply-Combine operations. It allows to break the table into pieces and perform reductions on each piece.
To show how by works the following code will use the species column as categorical variable and compute the average petal width and count how many observation there are.
Clicking the by button displays by(x.name, total=x.amount.sum()) as base command. We modify this default command to:
by(x.species, mean=x.petal_width.mean(), total=x.species.count())
Which produces the next table

projection¶

This option allows to select a subset of the fields from the data. To do this just use double brackets and a list of comma-separated field names. For instance, to select out the ‘species’ and ‘sepal_length’ columns just run
x[['species', 'sepal_length']]
And the corresponding output is a two-column table:

merge¶

Use this to merge many fields together.
merge(x.species, petal_area=x.petal_length * x.petal_width)
transform¶

Adds new named columns to an existing table. It takes as arguments the table and the new columns names and data:
transform(x, petal_area=x.petal_width * x.petal_length)
The resulting table has a new column called “petal_area”

relabel¶

Allows to change the labels for each column. The data will remain unaltered, only the labels are updated.
x.relabel({'sepal_width': 'width', 'sepal_length': 'length'})
There’s an alternative syntax to relabel the column, using an assignment operator:
x.relabel(sepal_width = 'width', sepal_length = 'length')
In the resulting table, instead of the previous “sepal_length” and “sepal_width” labels the new labels “length” and “width” are shown.

concat¶

Stacks tables on common columns. It takes as arguments the collections to concatenate.
For example, if there are two tables with the same fields:


To combine them into a single table jut run the following command:
concat(table1, table2)
A third parameter can be passed to determine the axis to concatenate on. The resulting table of the previous command is this:

join¶

Joins two tables on common columns. The simplest case is when both tables have a common column with unique values. For instance, suppose you have the following tables whose names are table1 and table2 respectively:


The “id” column can be used to join them by means of the next command
join(table1, table2, 'id')
In this case there are three arguments: the left table, the right table, and the common column to use. It is possible to user more than one column in this command, just pass the additional columns as extra arguments.
In the case of the example, it creates a new table combining the previous two:

Additional parameters can be used to select fields from the left and right tables, and a suffix can be set in case of duplicate columns. For more information see the Blaze documentation.
sort¶

Allows to sort the table using one or several columns. For example, you can use the petal_width column in the “Iris” table to sort the data:
x.sort('petal_width')
You can further sort the data using the petal_length and sepal_width columns. In this case a list containing the column names is passed as parameter:
x.sort(['petal_width', 'petal_length', 'sepal_width'])
Which produces the following table:

Additionally, there is an optional boolean parameter ascending to determine the order of the sort.
select¶

Filter elements of expression based on predicate or condition. The syntax is simple, just insert the condition inside brackets. For example, to filter out those elements whose petal length is greater than 4 just run:
x[x.petal_length>4]
distinct¶

Scanning a particular column, it removes the duplicate elements in that column from the table. For example, the following command removes the duplicate values in the petal_length column:
x.distinct('petal_length')
All the entries in that column are unique in the resulting table:

The last two icons in the top bar are drop-down menus with extra commands to perform row-wise and column-wise operations, respectively.
rows¶

head Prints the first n elements of the table. The default value for n is 10. To see the fist 5 elements of the table the following table will do the trick:
x.head(5)
This creates the following output:
tail Prints the last n elements of the table. Again, the default value for n is 10. To see the last 7 elements of the table run the command below:
x.tail(7)
slice Select a subset of the data providing the start and stop indexes. Optionally a step parameter can be used. The following command selects 10 table entries, from the row 10 to the row 30, the step width is 2.
x[10: 30: 2]
cols¶

coerce Coerce an expression to a different type. The pane at the bottom of the right column shows the data types of the columns in the current data set. In the case of the “Iris” table this is what is shows:
It is possible to coerce the numeric columns to a different type, for instance to switch the sepal_length column to float16 just run:
x.sepal_length.coerce('float16')
Which generates the following output:
isin Checks if a set of values is in the table. The following tests if the values [1.3, 1.4, 1.5] are in the petal_length column and select the corresponding rows:
x[x.petal_length.isin([1.3, 1.4, 1.5])]o
shift Shift a column backward or forward by N elements, negative arguments will switch it backward. For example, the following image shows the first 10 entries of the petal_length column:
The next command switches the column 3 positions forward:
x.petal_length.shift(3)
This is how the new petal_length columns looks like: