Lux Tutorial and Walkthrough

Lux is a backend for Blaze that is designed to efficiently handle nested directories of CSV files and represent them as a single table. With Lux and Blaze combined, we can efficiently run queries on directories of flat files as if they were a single dataset, allowing us to efficiently extract the data we need and transform it as needed.

This tutorial will guide you through configuring and loading daily stock data with Lux. Once the data are loaded, we will then show you how to explore and transform them in Blaze.

For this example, the daily stock data we will use is provided by Stooq. Specifically, we are using the daily US data under the ASCII column, which, when downloaded, comes packaged as a zip archive of nested directories of CSV files. These data were selected as typical financial market data, and are structured in such a way that makes them inconvenient to work with.

After downloading and unzipping, we see the following structure:

data
└── daily
    └── us
        ├── nasdaq etfs
        ├── nasdaq stocks
        │   ├── 1
        │   └── 2
        ├── nyse etfs
        ├── nyse stocks
        │   ├── 1
        │   └── 2
        ├── nysemkt etfs
        └── nysemkt stocks

Inside the nasdaq stocks/1 and nasdaq stocks/2 folders are several csv files, one for each symbol, with daily bars. For example, here’s a sample from nasdaq stocks/1/aapl.us.txt:

$ head -5 data/daily/us/nasdaq\ stocks/1/aapl.us.txt
Date,Open,High,Low,Close,Volume,OpenInt
19840907,0.4379,0.4432,0.4326,0.4379,22476461,0
19840910,0.4379,0.43923,0.42735,0.43527,17445402,0
19840911,0.43923,0.45114,0.43923,0.4432,41137289,0
19840912,0.4432,0.44584,0.42995,0.42995,35936930,0

$ tail -5 data/daily/us/nasdaq\ stocks/1/aapl.us.txt
20160401,108.78,110,108.2,109.99,25114568,0
20160404,110.42,112.19,110.27,111.12,34791172,0
20160405,109.51,110.73,109.42,109.81,24789296,0
20160406,110.23,110.98,109.2,110.96,25152497,0
20160407,109.95,110.42,108.121,108.54,29499200,0

We see that the data for Apple spans more than 30 years, and has about 8000 rows. Other symbols may have fewer rows.

Some aspects of these data sets make them inconvenient to work with, and must be handled by any process that wants to collect these data into a single table:

  • The symbol name does not appear inside the file itself, requiring that we stitch together the symbol name with its data ourselves,
  • The vendor breaks up the data into two sub directories named 1 and 2 to keep the number of files per directory within limits.
  • Altogether, the complete symbol set from nasdaq stocks and nyse stocks comprises 6477 separate files. We want to treat this as one dataset.

We first need to describe these datasets in a yaml file to Lux so it knows how to load our 6000 files. To describe a dataset to Lux, three core things are required:

  • the root directory of the dataset to load,
  • an extractor template string to extract meaningful data from the filename and directory path to include in the final table, and
  • a datashape describing the column names and their datatypes.

We will use Lux in conjunction with Blaze server, and will provide these three elements inside a Blaze server YAML file, named lux.yaml for this tutorial.

Because the data are broken into two indices, NASDAQ and NYSE, we will represent these as two separate datasets in the yaml file. After loading, we will see how to combine them into one dataset with Blaze.

The yaml entries for these datasets are as follows:

stooq_nasdaq_lux:
    imports: ["lux"]
    source: "lux://data/daily/us/nasdaq stocks"
    extractor: "{}/{Symbol}.us.txt"
    dshape: "var * {Symbol: ?string,
                    Date: string,
                    Open: float64,
                    High: float64,
                    Low: float64,
                    Close: float64,
                    Volume: float64,
                    OpenInt: float64}"

stooq_nyse_lux:
    imports: ["lux"]
    source: "lux://data/daily/us/nyse stocks"
    extractor: "{}/{Symbol}.us.txt"
    dshape: "var * {Symbol: ?string,
                    Date: string,
                    Open: float64,
                    High: float64,
                    Low: float64,
                    Close: float64,
                    Volume: float64,
                    OpenInt: float64}"

Here we describe two Lux datasets, stooq_nasdaq_lux and stooq_nyse_lux. Both have many similarities:

  • The imports: ["lux"] entry tells Blaze server that it needs to import the lux module to access the Lux backend before loading these datasets;
  • the source: "lux://..." entry indicates the lux URI string that specifies the root directory for each dataset–nasdaq stocks and nyse stocks;
  • the extractor: "..." entry specifies how Lux is to extract meaningful data from the directory structure and filename to include in the final table, and
  • the dshape entry specifies the column names and datatypes to use for the resulting table.

Extractor Strings

The Lux extractor string is modeled on the Python Format string syntax. It uses the Parse package, which has a few deviations from stock format strings. A format specification in Python follows this pattern:

[fill][align][0][width][.precision][type]

The differences between parse string specifications and standard format string specifications are:

  • The align operators will cause spaces (or specified fill character) to be stripped from the parsed value. The width is not enforced; it just indicates there may be whitespace or 0s to strip.
  • Numeric parsing will automatically handle a 0b, 0o or 0x prefix. That is, the # format character is handled automatically by d, b, o and x formats.
  • For d any will be accepted, but for the others the correct prefix must be present if at all.
  • Numeric sign is handled automatically.
  • The thousands separator is handled automatically if the “n” type is used.
  • The types supported are a slightly different mix to the format() types. Some format() types come directly over: d, n, %, f, e, b, o and x. In addition some regular expression character group types–D, w, W, s and S–are also available.
  • The e and g types are case-insensitive so there is no need for the E or G types.

In our example, the extractor string is:

{}/{Symbol}.us.txt

The first empty set of braces ({}) matches the 1 or 2 directories. Because the braces are empty, no name is associated with these directories, and they are not included in the resulting table. We have to include the empty braces in the extractor string to ensure the extractor matches these directories and stitches together all the files into one logical table.

The /{Symbol}.us.txt part of the extractor string will match filenames like aapl.us.txt, zyne.us.txt, etc. All contents after the / directory separator and before the .us.txt suffix will be extracted and used as the Symbol field. Because this field has a name, it is included in the resulting table.

datashape

Blaze uses datashape as its type system. In our example, the datashape for both the NASDAQ and NYSE data sets is:

"var * {Symbol: ?string,
        Date: ?string,
        Open: float64,
        High: float64,
        Low: float64,
        Close: float64,
        Volume: float64,
        OpenInt: float64}"

This specifies a tabular datashape with an unknown (var) number of rows and with eight named columns. Each column has a specified type. For this example, the types are either ?string or float64, although datashape understands many other types, notably several integer and floating point types, fixed and variable length string types, and datetime types. The ? prefix in the ?string type indicates that this type may have missing values.

Loading Lux datasets in Blaze and Anaconda Mosaic

Now that the yaml specification file has been created–named lux.yaml for this tutorial–we will walk you through loading it in Anaconda Mosaic.

Note that this process will change and improve in coming versions of Mosaic–many of the following steps will be unnecessary after those improvements are incorporated.

If Anaconda Mosaic is running, close the browser window and shut down the anaconda-mosaic process in your terminal or command window.

In your terminal or command window, navigate to the directory where lux.yaml is located, and run anaconda-mosaic as follows:

anaconda-mosaic -f lux.yaml

After initializing, Mosaic will open a browser window. Log in to Mosaic if you are not logged in already.

We will now add our two Lux data sets to Mosaic. Click the + icon on the upper left hand side. In the “Add Dataset” dialogue that comes up, select “blaze server” for the Data URI, which will pre-fill the next field with the following template:

blaze://{host}

Modify the template to read as follows:

blaze://127.0.0.1:6363::stooq_nasdaq_lux

For the “Name” field, fill in stooq_nasdaq_lux. The “Description” field is optional. Click “OK”.

You should see a new stooq_nasdaq_lux dataset added to the left hand side. Selecting it will provide a preview of the full dataset, which may take several seconds to load.

We can repeat this process for the stooq_nyse_lux dataset. The Data URI for that dataset in the Add Dataset dialogue is:

blaze://127.0.0.1:6363::stooq_nyse_lux

and we name it stooq_nyse_lux in the “Name” field.

It is also possible to add the Lux datasets to Mosaic by means of the Lux URI. Click the + icon on the upper left hand side. In the “Add Dataset” dialog select Lux, the field will show the following:

lux:///{path}/{to}/{folder}

Modify it accordingly. For example:

lux:///home/user/Dowloads/nasdaq stocks

name it stooq_nyse_lux in the “Name” field. For the extractor field, we use the same as in the yaml file:

{}/{Symbol}.us.txt

and click “OK”. Repeat the process for stooq_nyse_lux. If successful, you should have two new Lux datasets available in Mosaic– stooq_nyse_lux and stooq_nasdaq_lux.

Querying Lux datasets

We can now select records from our datasets and perform other transformations with Mosaic.

If we are interested in large volume stocks, for example, we can select that data using the select operation. Select the stooq_nasdaq_lux dataset from the left hand side, and click on select from the expression builder. Fill in the expression template as follows:

x[x.Volume > 10**6]

After previewing or applying this expression, we see the results of the selection in the table view.

Mosaic, Blaze, and Lux coordinate all the file loading, parsing, and behind-the-scenes logic for you to compute this Blaze operation.

If you haven’t already, click “Apply” to store this selection in the breadcrumb and allow us to build more complex queries on top of this selection.

We can add other operations on top of this one; for instance, we can group these large volume records by the Date column and compute the maximum volume traded on that date.

Click on the by expression, and edit the expression template to read:

by(x.Date, max_volume=x.Volume.max())

Selecting preview or apply will compute this grouping operation on the result of the large volume selection.

We can “Commit” the result of this large Volume grouping, which allows us to store this expression with a meaningful name and return to it easily for later inspection.

Summary

This is just a taste of the sorts of computations that Mosaic, Blaze, and Lux enable. With a few steps, we are able to describe our potentially large repository of flat files and load them into unified logical datasets. We can then begin querying and computing on top of these data, to explore and analyze our data, without the tedious and error-prone steps often required to manage nested directories of flat files.