User Guide¶
Mosaic UI¶
After installing mosaic, start by activating the mosaic environment.
On OS X or Linux:
source activate mosaic
On Windows:
activate mosaic
Now that the mosaic environment is active, type the following command in your terminal window:
anaconda-mosaic
The default browser will open and show the following window:

Add a new dataset by clicking the “+” icon at the top left, then following the prompts in the Add Dataset dialog box:

As of Mosaic 1.3 you can add Excel spreadsheets in .xls, .xlsx or .cvs format.
The Iris dataset is provided as built-in sample for initial exploration. Most of the examples in this documentation use this dataset to show how each component of Mosaic works.
After a dataset is selected, there are three main components in the Mosaic window:
- The left column, which displays a list of datasets currently opened. Provides a quick search bar and the + button to import new data.
- The main panel displays the data. You can work on several datasets using embedded tabs, each tab contains a top bar with several buttons to perform data transformations. Below that bar you can see a text field to introduce Blaze commands, then click Preview to see if the command works, Apply to apply that command to the dataset, and Commit to create a new dataset from the command. Below the command field there are four buttons to explore your data: Table, Plot, List and Stats.
- The right column shows some information about the dataset you are currently working on. It also shows a button to open the dataset in a Jupyter Notebook. Refresh your row count at any time by clicking the “Refresh” icon.

Data transformations¶
The top bar shows a list of icons to quickly perform common data manipulations. Mosaic transforms the data using Blaze expressions, and each of the buttons inserts a command that can be further customized to fit your needs.
After inserting a command click Preview to preview the changes you are about to make, click Apply to perform the transformation. A sequence of tags will appear below the command field, these keep track of the transformations so you can undo your previous commands by just deleting the tags.
The data in the following table was obtained performing a by transformation followed by a projection. You can see them listed right above the table:

It is also possible to copy the current data into a new dataset, click Commit to open the Commit Expression box, enter a Name and a Description for the new dataset and click OK.

Now you should see the new data set listed on the left-hand column:

The dataset on the current active tab is always denoted by x, but it is also possible to use the label instead. For instance, if you open the Iris dataset, whose default label on Mosaic is iris_csv, the following commands are equivalent:
iris_csv[iris_csv.petal_length > 2]
x[x.petal_length > 2]
Below you will find a full description of the transformations available trough the icons on the top bar:
by¶

By is the basic tool for Split-Apply-Combine operations. It allows to break the table into pieces and perform reductions on each piece.
To show how by works the following code will use the species column as categorical variable and compute the average petal width and count how many observation there are.
Clicking the by button displays by(x.name, total=x.amount.sum()) as base command. We modify this default command to:
by(x.species, mean=x.petal_width.mean(), total=x.species.count())
Which produces the next table

projection¶

This option allows to select a subset of the fields from the data. To do this just use double brackets and a list of comma-separated field names. For instance, to select out the ‘species’ and ‘sepal_length’ columns just run
x[['species', 'sepal_length']]
And the corresponding output is a two-column table:

merge¶

Use this to merge many fields together.
merge(x.species, petal_area=x.petal_length * x.petal_width)
transform¶

Adds new named columns to an existing table. It takes as arguments the table and the new columns names and data:
transform(x, petal_area=x.petal_width * x.petal_length)
The resulting table has a new column called “petal_area”

relabel¶

Allows to change the labels for each column. The data will remain unaltered, only the labels are updated.
x.relabel({'sepal_width': 'width', 'sepal_length': 'length'})
There’s an alternative syntax to relabel the column, using an assignment operator:
x.relabel(sepal_width = 'width', sepal_length = 'length')
In the resulting table, instead of the previous “sepal_length” and “sepal_width” labels the new labels “length” and “width” are shown.

concat¶

Stacks tables on common columns. It takes as arguments the collections to concatenate.
For example, if there are two tables with the same fields:


To combine them into a single table jut run the following command:
concat(table1, table2)
A third parameter can be passed to determine the axis to concatenate on. The resulting table of the previous command is this:

join¶

Joins two tables on common columns. The simplest case is when both tables have a common column with unique values. For instance, suppose you have the following tables whose names are table1 and table2 respectively:


The “id” column can be used to join them by means of the next command
join(table1, table2, 'id')
In this case there are three arguments: the left table, the right table, and the common column to use. It is possible to user more than one column in this command, just pass the additional columns as extra arguments.
In the case of the example, it creates a new table combining the previous two:

Additional parameters can be used to select fields from the left and right tables, and a suffix can be set in case of duplicate columns. For more information see the Blaze documentation.
sort¶

Allows to sort the table using one or several columns. For example, you can use the petal_width column in the “Iris” table to sort the data:
x.sort('petal_width')
You can further sort the data using the petal_length and sepal_width columns. In this case a list containing the column names is passed as parameter:
x.sort(['petal_width', 'petal_length', 'sepal_width'])
Which produces the following table:

Additionally, there is an optional boolean parameter ascending to determine the order of the sort.
select¶

Filter elements of expression based on predicate or condition. The syntax is simple, just insert the condition inside brackets. For example, to filter out those elements whose petal length is greater than 4 just run:
x[x.petal_length>4]
distinct¶

Scanning a particular column, it removes the duplicate elements in that column from the table. For example, the following command removes the duplicate values in the petal_length column:
x.distinct('petal_length')
All the entries in that column are unique in the resulting table:

The last two icons in the top bar are drop-down menus with extra commands to perform row-wise and column-wise operations, respectively.
rows¶

head Prints the first n elements of the table. The default value for n is 10. To see the fist 5 elements of the table the following table will do the trick:
x.head(5)
This creates the following output:
tail Prints the last n elements of the table. Again, the default value for n is 10. To see the last 7 elements of the table run the command below:
x.tail(7)
slice Select a subset of the data providing the start and stop indexes. Optionally a step parameter can be used. The following command selects 10 table entries, from the row 10 to the row 30, the step width is 2.
x[10: 30: 2]
cols¶

coerce Coerce an expression to a different type. The pane at the bottom of the right column shows the data types of the columns in the current data set. In the case of the “Iris” table this is what is shows:
It is possible to coerce the numeric columns to a different type, for instance to switch the sepal_length column to float16 just run:
x.sepal_length.coerce('float16')
Which generates the following output:
isin Checks if a set of values is in the table. The following tests if the values [1.3, 1.4, 1.5] are in the petal_length column and select the corresponding rows:
x[x.petal_length.isin([1.3, 1.4, 1.5])]o
shift Shift a column backward or forward by N elements, negative arguments will switch it backward. For example, the following image shows the first 10 entries of the petal_length column:
The next command switches the column 3 positions forward:
x.petal_length.shift(3)
This is how the new petal_length columns looks like:
Explore your data¶
Mosaic provides tools for quick data visualization, a list of the transformation performed on the current dataset, and per-column statistical measures. Four tabs below the main command field allow you to switch to other tools.
Table¶
This is the default view and shows a table representing the current dataset. You can select the numbers of rows to be displayed by clicking the gear icon on the top right corner of the table:

Plot¶
The Plot tab provides 5 different plot types to represent the data: Scatter, Histogram, BoxPlot, TimeSeries and Shader.
- Scatter. This plot uses points to represent data on the plane. It can represent up to four variables by selecting them from the corresponding drop-down menu: X Axis, Y Axis, Color and Size.

- Histogram. This plot shows the frequencies for a set of values. Only one column can be selected for this plot type.

- BoxPlot. Also known as box and whisker plots, this type displays the distribution of the data, grouping it by a second variable. Two columns, one for the Label and other for Values can be selected.

- TimeSeries. This plot is suited to display data over time intervals. On the X axis select the column containing the dates should be introduces, on the Y axis the column with the data to plot. The following example shows the Euro-USD exchange rate.

- Shader. Is a special type of scatter plot that makes it easier to visualize large datasets that otherwise would look too cluttered. In this type of plots the points are smaller and an additional shading algorithm is used to improve the appearance. Two columns can be selected for the X and Y axes. For further customization you can select an aggregation type, an aggregation field and a transparency function that determines the shading algorithm to use for the plot.

List¶
The List tab shows a list of transformations performed on the current dataset. Every time you type something on the command entry and click Apply it is recorded in this tab.

Stats¶
In this tab you will see a summary of the data with several statistical measures for each numeric column: min, max, mean and sum:

Explore your data in a Jupyter notebook¶
Mosaic provides several tools for quick data exploration. If you want to dive deeper and use more advanced tools you can easily open your dataset in a Jupyter Notebook, just click the corresponding button on the right-hand pane:

Jupyter Notebook opens in a new tab with some commands that will automatically import the current dataset. Click Cell -> Run Cells or use the keyboard shortcut Ctrl + Enter and your will be prompted to enter your Mosaic password:

After that you can import any python library you need to analyze the data.
Importing Datasets¶
Mosaic supports several data file types. As mentioned before, Mosaic uses Blaze as backed, which provides a single interface for many different data computing systems.
To import a new dataset click the plus icon next to the filter filed on the left-hand pane:

This opens the following:

The Data URI drop-down list shows a list to select a Uniform Resource Identifier (URI). The options are: File path, HDFStore, SQLite, PostgreSQL, MongoDB, Other SQL DB, HTTP, Lux and Blaze Server. Once you select the URI chose a name to label it. Optionally a short description can be provided.
Excel files may be imported in .xls, .xlsx or .csv format (1.3+). The following screenshot shows a csv file imported from the local file system:

Optionally, enable custom arguments for every computational backend. In the Add Custom Fields section, click the “+” icon and add one or more extra options, which correspond to key-value pairs. Values must be string, integer, or list. Keys will be parsed as strings. Input examples: 1, “two”, [3, 4, 5].
NOTE: You can edit and add custom arguments from the top right “Settings” icon, Edit Dataset Metadata.
The new dataset will be displayed on the left-hand column. Click it to open it in a new tab:
