User Guide

Mosaic UI

After installing mosaic, make sure that the mosaic environment is active and type the following command in your terminal:

anaconda-mosaic

The default browser will open and show the following window:

../../_images/mosaic_intro1.png

The Iris dataset is provided as built-in sample for initial exploration. Most of the examples in this documentation use this dataset to show how each component of Mosaic works.

Once a dataset is selected there are three main components in the Mosaic window:

  • The left column, which displays a list of datasets currently opened. Provides a quick search bar and the + button to import new data.
  • The main panel displays the data. You can work on several datasets using embedded tabs, each tab contains a top bar with several buttons to perform data transformations. Below that bar you can see a text field to introduce Blaze commands, then click Preview to see if the command works, Apply to apply that command to the dataset, and Commit to create a new dataset from the command. Below the command field there are four buttons to explore your data: Table, Plot, List and Stats.
  • The right column shows some information about the dataset you are currently working on. It also shows a button to open the dataset in a Jupyter Notebook.
../../_images/mosaic_intro2.png

Data transformations

The top bar shows a list of icons to quickly perform common data manipulations. Mosaic transforms the data using Blaze expressions, and each of the buttons inserts a command that can be further customized to fit your needs.

After inserting a command click Preview to preview the changes you are about to make, click Apply to perform the transformation. A sequence of tags will appear below the command field, these keep track of the transformations so you can undo your previous commands by just deleting the tags.

The data in the following table was obtained performing a by transformation followed by a projection. You can see them listed right above the table:

../../_images/mosaic_intro3.png

It is also possible to copy the current data into a new dataset, click Commit to open the Commit Expression box, enter a Name and a Description for the new dataset and click OK.

../../_images/mosaic_intro4.png

Now you should see the new data set listed on the left-hand column:

../../_images/mosaic_intro5.png

The dataset on the current active tab is always denoted by x, but it is also possible to use the label instead. For instance, if you open the Iris dataset, whose default label on Mosaic is iris_csv, the following commands are equivalent:

iris_csv[iris_csv.petal_length > 2]

x[x.petal_length > 2]

Below you will find a full description of the transformations available trough the icons on the top bar:

by

../../_images/mosaic_by.png

By is the basic tool for Split-Apply-Combine operations. It allows to break the table into pieces and perform reductions on each piece.

To show how by works the following code will use the species column as categorical variable and compute the average petal width and count how many observation there are.

Clicking the by button displays by(x.name, total=x.amount.sum()) as base command. We modify this default command to:

by(x.species, mean=x.petal_width.mean(), total=x.species.count())

Which produces the next table

../../_images/mosaic-userguide1.png

projection

../../_images/mosaic_projection.png

This option allows to select a subset of the fields from the data. To do this just use double brackets and a list of comma-separated field names. For instance, to select out the ‘species’ and ‘sepal_length’ columns just run

x[['species', 'sepal_length']]

And the corresponding output is a two-column table:

../../_images/mosaic-userguide2.png

merge

../../_images/mosaic_merge.png

Use this to merge many fields together.

merge(x.species, petal_area=x.petal_length * x.petal_width)

transform

../../_images/mosaic_transform.png

Adds new named columns to an existing table. It takes as arguments the table and the new columns names and data:

transform(x, petal_area=x.petal_width * x.petal_length)

The resulting table has a new column called “petal_area”

../../_images/mosaic-userguide3.png

relabel

../../_images/mosaic_relabel.png

Allows to change the labels for each column. The data will remain unaltered, only the labels are updated.

x.relabel({'sepal_width':  'width', 'sepal_length': 'length'})

There’s an alternative syntax to relabel the column, using an assignment operator:

x.relabel(sepal_width = 'width', sepal_length = 'length')

In the resulting table, instead of the previous “sepal_length” and “sepal_width” labels the new labels “length” and “width” are shown.

../../_images/mosaic-userguide4.png

concat

../../_images/mosaic_concat.png

Stacks tables on common columns. It takes as arguments the collections to concatenate.

For example, if there are two tables with the same fields:

../../_images/mosaic-userguide5.png ../../_images/mosaic-userguide6.png

To combine them into a single table jut run the following command:

concat(table1, table2)

A third parameter can be passed to determine the axis to concatenate on. The resulting table of the previous command is this:

../../_images/mosaic-userguide7.png

join

../../_images/mosaic_join.png

Joins two tables on common columns. The simplest case is when both tables have a common column with unique values. For instance, suppose you have the following tables whose names are table1 and table2 respectively:

../../_images/mosaic-userguide08.png ../../_images/mosaic-userguide09.png

The “id” column can be used to join them by means of the next command

join(table1, table2, 'id')

In this case there are three arguments: the left table, the right table, and the common column to use. It is possible to user more than one column in this command, just pass the additional columns as extra arguments.

In the case of the example, it creates a new table combining the previous two:

../../_images/mosaic-userguide10.png

Additional parameters can be used to select fields from the left and right tables, and a suffix can be set in case of duplicate columns. For more information see the Blaze documentation.

sort

../../_images/mosaic_sort.png

Allows to sort the table using one or several columns. For example, you can use the petal_width column in the “Iris” table to sort the data:

x.sort('petal_width')

You can further sort the data using the petal_length and sepal_width columns. In this case a list containing the column names is passed as parameter:

x.sort(['petal_width', 'petal_length', 'sepal_width'])

Which produces the following table:

../../_images/mosaic-userguide11.png

Additionally, there is an optional boolean parameter ascending to determine the order of the sort.

select

../../_images/mosaic_select.png

Filter elements of expression based on predicate or condition. The syntax is simple, just insert the condition inside brackets. For example, to filter out those elements whose petal length is greater than 4 just run:

x[x.petal_length>4]

distinct

../../_images/mosaic_distinct.png

Scanning a particular column, it removes the duplicate elements in that column from the table. For example, the following command removes the duplicate values in the petal_length column:

x.distinct('petal_length')

All the entries in that column are unique in the resulting table:

../../_images/mosaic-userguide12.png

The last two icons in the top bar are drop-down menus with extra commands to perform row-wise and column-wise operations, respectively.

rows

../../_images/mosaic_rows.png
  • head Prints the first n elements of the table. The default value for n is 10. To see the fist 5 elements of the table the following table will do the trick:

    x.head(5)
    

    This creates the following output:

    ../../_images/mosaic-userguide13.png
  • tail Prints the last n elements of the table. Again, the default value for n is 10. To see the last 7 elements of the table run the command below:

    x.tail(7)
    
  • slice Select a subset of the data providing the start and stop indexes. Optionally a step parameter can be used. The following command selects 10 table entries, from the row 10 to the row 30, the step width is 2.

    x[10: 30: 2]
    
    ../../_images/mosaic-userguide14.png

cols

../../_images/mosaic_cols.png
  • coerce Coerce an expression to a different type. The pane at the bottom of the right column shows the data types of the columns in the current data set. In the case of the “Iris” table this is what is shows:

    ../../_images/mosaic-userguide15.png

    It is possible to coerce the numeric columns to a different type, for instance to switch the sepal_length column to float16 just run:

    x.sepal_length.coerce('float16')
    

    Which generates the following output:

    ../../_images/mosaic-userguide16.png
  • isin Checks if a set of values is in the table. The following tests if the values [1.3, 1.4, 1.5] are in the petal_length column and select the corresponding rows:

    x[x.petal_length.isin([1.3, 1.4, 1.5])]o
    
    ../../_images/mosaic-userguide17.png
  • shift Shift a column backward or forward by N elements, negative arguments will switch it backward. For example, the following image shows the first 10 entries of the petal_length column:

    ../../_images/mosaic-userguide18.png

    The next command switches the column 3 positions forward:

    x.petal_length.shift(3)
    

    This is how the new petal_length columns looks like:

    ../../_images/mosaic-userguide19.png

Explore your data

Mosaic provides tools for quick data visualization, a list of the transformation performed on the current dataset, and per-column statistical measures. Four tabs below the main command field allow you to switch to other tools.

Table

This is the default view and shows a table representing the current dataset. You can select the numbers of rows to be displayed by clicking the gear icon on the top right corner of the table:

../../_images/mosaic-userguide19-1.png

Plot

The Plot tab provides 5 different plot types to represent the data: Scatter, Histogram, BoxPlot, TimeSeries and Shader.

  • Scatter. This plot uses points to represent data on the plane. It can represent up to four variables by selecting them from the corresponding drop-down menu: X Axis, Y Axis, Color and Size.
../../_images/mosaic-userguide20.png
  • Histogram. This plot shows the frequencies for a set of values. Only one column can be selected for this plot type.
../../_images/mosaic-userguide21.png
  • BoxPlot. Also known as box and whisker plots, this type displays the distribution of the data, grouping it by a second variable. Two columns, one for the Label and other for Values can be selected.
../../_images/mosaic-userguide22.png
  • TimeSeries. This plot is suited to display data over time intervals. On the X axis select the column containing the dates should be introduces, on the Y axis the column with the data to plot. The following example shows the Euro-USD exchange rate.
../../_images/mosaic-userguide23.png
  • Shader. Is a special type of scatter plot that makes it easier to visualize large datasets that otherwise would look too cluttered. In this type of plots the points are smaller and an additional shading algorithm is used to improve the appearance. Two columns can be selected for the X and Y axes. For further customization you can select an aggregation type, an aggregation field and a transparency function that determines the shading algorithm to use for the plot.
../../_images/mosaic-userguide24.png

List

The List tab shows a list of transformations performed on the current dataset. Every time you type something on the command entry and click Apply it is recorded in this tab.

../../_images/mosaic-userguide25.png

Stats

In this tab you will see a summary of the data with several statistical measures for each numeric column: min, max, mean and sum:

../../_images/mosaic-userguide25-2.png

Explore your data in a Jupyter notebook

Mosaic provides several tools for quick data exploration. If you want to dive deeper and use more advanced tools you can easily open your dataset in a Jupyter Notebook, just click the corresponding button on the right-hand pane:

../../_images/mosaic-userguide26.png

Jupyter Notebook opens in a new tab with some commands that will automatically import the current dataset. Click Cell -> Run Cells or use the keyboard shortcut Ctrl + Enter and your will be prompted to enter your Mosaic password:

../../_images/mosaic-userguide27.png

After that you can import any python library you need to analyze the data.

Importing Datasets

Mosaic supports several data file types. As mentioned before, Mosaic uses Blaze as backed, which provides a single interface for many different data computing systems.

To import a new dataset click the plus icon next to the filter filed on the left-hand pane:

../../_images/mosaic-userguide28.png

This opens the following:

../../_images/mosaic-userguide29.png

The Data URI drop-down list shows a list to select a Uniform Resource Identifier (URI). The options are: File path, HDFStore, SQLite, PostgreSQL, MongoDB, Other SQL DB, HTTP, Lux and Blaze Server. Once you select the URI chose a name to label it. Optionally a short description can be provided.

The following screenshot shows a csv file imported from the local file system:

../../_images/mosaic-userguide30.png

The new dataset will be displayed on the left-hand column. Click it to open it in a new tab:

../../_images/mosaic-userguide31.png