Lux Tutorial and Walkthrough

Lux is a backend for Anaconda Mosaic that is designed to efficiently handle nested directories of CSV or text files and represent them as a single table. With Anaconda Mosaic and Lux, we can efficiently run queries on directories of flat files as if they were a single dataset, allowing us to efficiently extract the data we need and transform it as needed.

This tutorial will guide you through configuring and loading daily stock data with Lux. Once the data are loaded, we will then show you how to explore and transform them in Anaconda Mosaic.

For this example, the daily stock data we will use is provided by Stooq. Specifically, we are using the daily US data under the ASCII column, which, when downloaded, comes packaged as a zip archive of nested directories of CSV or text files. These data were selected as typical financial market data, and are structured in such a way that makes them inconvenient to work with.

After downloading and unzipping, we see the following structure:

data
└── daily
    └── us
        ├── nasdaq etfs
        ├── nasdaq stocks
        │   ├── 1
        │   └── 2
        ├── nyse etfs
        ├── nyse stocks
        │   ├── 1
        │   └── 2
        ├── nysemkt etfs
        └── nysemkt stocks

Inside the nasdaq stocks/1 and nasdaq stocks/2 folders are several text files, one for each symbol, with daily bars. For example, here’s a sample from nasdaq stocks/1/aapl.us.txt:

$ head -5 data/daily/us/nasdaq\ stocks/1/aapl.us.txt
Date,Open,High,Low,Close,Volume,OpenInt
19840907,0.4379,0.4432,0.4326,0.4379,22476461,0
19840910,0.4379,0.43923,0.42735,0.43527,17445402,0
19840911,0.43923,0.45114,0.43923,0.4432,41137289,0
19840912,0.4432,0.44584,0.42995,0.42995,35936930,0

$ tail -5 data/daily/us/nasdaq\ stocks/1/aapl.us.txt
20160401,108.78,110,108.2,109.99,25114568,0
20160404,110.42,112.19,110.27,111.12,34791172,0
20160405,109.51,110.73,109.42,109.81,24789296,0
20160406,110.23,110.98,109.2,110.96,25152497,0
20160407,109.95,110.42,108.121,108.54,29499200,0

We see that the data for Apple spans more than 30 years, and has about 8000 rows. Other symbols may have fewer rows.

Some aspects of these data sets make them inconvenient to work with, and must be handled by any process that wants to collect these data into a single table:

  • The symbol name does not appear inside the file itself, requiring that we stitch together the symbol name with its data ourselves,
  • The vendor breaks up the data into two sub directories named 1 and 2 to keep the number of files per directory within limits.
  • Altogether, the complete symbol set from nasdaq stocks and nyse stocks comprises 6477 separate files. We want to treat this as one dataset.

We first need to describe these datasets to Lux so it knows how to load our 6000 files. To describe a dataset to Lux, three core things are required:

  • the root directory of the dataset to load,
  • an extractor template string to extract meaningful data from the filename and directory path to include in the final table, and

Extractor Strings

The Lux extractor string is modeled on the Python Format string syntax. It uses the Parse package, which has a few deviations from stock format strings. A format specification in Python follows this pattern:

[fill][align][0][width][.precision][type]

The differences between parse string specifications and standard format string specifications are:

  • The align operators will cause spaces (or specified fill character) to be stripped from the parsed value. The width is not enforced; it just indicates there may be whitespace or 0s to strip.
  • Numeric parsing will automatically handle a 0b, 0o or 0x prefix. That is, the # format character is handled automatically by d, b, o and x formats.
  • For d any will be accepted, but for the others the correct prefix must be present if at all.
  • Numeric sign is handled automatically.
  • The thousands separator is handled automatically if the “n” type is used.
  • The types supported are a slightly different mix to the format() types. Some format() types come directly over: d, n, %, f, e, b, o and x. In addition some regular expression character group types–D, w, W, s and S–are also available.
  • The e and g types are case-insensitive so there is no need for the E or G types.

In our example, the extractor string is:

{}/{Symbol}.us.txt

The first empty set of braces ({}) matches the 1 or 2 directories. Because the braces are empty, no name is associated with these directories, and they are not included in the resulting table. We have to include the empty braces in the extractor string to ensure the extractor matches these directories and stitches together all the files into one logical table.

The /{Symbol}.us.txt part of the extractor string will match filenames like aapl.us.txt, zyne.us.txt, etc. All contents after the / directory separator and before the .us.txt suffix will be extracted and used as the Symbol field. Because this field has a name, it is included in the resulting table.

Loading Lux datasets in Anaconda Mosaic

If Anaconda Mosaic is not running, in your terminal or command window, navigate to the directory where stooq data is located, and run anaconda-mosaic as follows:

anaconda-mosaic

After initializing, Anaconda Mosaic will open a browser window. Log in to Anaconda Mosaic if you are not logged in already.

We will now add our two Lux data sets to Anaconda Mosaic. Click the + icon on the upper left hand side. In the “Add Dataset” dialogue that comes up, select “Lux” for the Data URI.

Modify the template to read as follows:

lux://data/daily/us/nasdaq stocks

For the “Name” field, fill in stooq_nasdaq_lux. The “Description” field is optional.

For the “Extractor” field, enter:

{}/{Symbol}.us.txt

Click “OK”.

You should see a new stooq_nasdaq_lux dataset added to the left hand side. Selecting it will provide a preview of the full dataset, which may take several seconds to load.

We can repeat this process for the stooq_nyse_lux dataset. The Data URI for that dataset in the Add Dataset dialogue is:

lux://data/daily/us/nyse stocks

and we name it stooq_nyse_lux in the “Name” field.

For the “Extractor” field, enter:

{}/{Symbol}.us.txt

Click “OK”.

Querying Lux datasets

We can now select records from our datasets and perform other transformations with Anaconda Mosaic.

If we are interested in large volume stocks, for example, we can select that data using the select operation. Select the stooq_nasdaq_lux dataset from the left hand side, and click on select from the expression builder. Fill in the expression template as follows:

x[x.Volume > 10**6]

After previewing or applying this expression, we see the results of the selection in the table view.

Anaconda Mosaic coordinates all the file loading, parsing, and behind-the-scenes logic for you to compute this operation.

If you haven’t already, click “Apply” to store this selection in the breadcrumb and allow us to build more complex queries on top of this selection.

We can add other operations on top of this one; for instance, we can group these large volume records by the Date column and compute the maximum volume traded on that date.

Click on the by expression, and edit the expression template to read:

by(x.Date, max_volume=x.Volume.max())

Selecting preview or apply will compute this grouping operation on the result of the large volume selection.

We can “Commit” the result of this large Volume grouping, which allows us to store this expression with a meaningful name and return to it easily for later inspection.

Summary

This is just a taste of the sorts of computations that Anaconda Mosaic and Lux enable. With a few steps, we are able to describe our potentially large repository of flat files and load them into unified logical datasets. We can then begin querying and computing on top of these data, to explore and analyze our data, without the tedious and error-prone steps often required to manage nested directories of flat files.