Lux Tutorial and Walkthrough¶
Lux is a backend for Blaze that is designed to efficiently handle nested directories of CSV files and represent them as a single table. With Lux and Blaze combined, we can efficiently run queries on directories of flat files as if they were a single dataset, allowing us to efficiently extract the data we need and transform it as needed.
This tutorial will guide you through configuring and loading daily stock data with Lux. Once the data are loaded, we will then show you how to explore and transform them in Blaze.
For this example, the daily stock data we will use is provided by Stooq. Specifically, we are using the daily US data under the ASCII column, which, when downloaded, comes packaged as a zip archive of nested directories of CSV files. These data were selected as typical financial market data, and are structured in such a way that makes them inconvenient to work with.
After downloading and unzipping, we see the following structure:
data
└── daily
└── us
├── nasdaq etfs
├── nasdaq stocks
│ ├── 1
│ └── 2
├── nyse etfs
├── nyse stocks
│ ├── 1
│ └── 2
├── nysemkt etfs
└── nysemkt stocks
Inside the nasdaq stocks/1
and nasdaq stocks/2
folders are several csv
files, one for each symbol, with daily bars. For example, here’s a sample from
nasdaq stocks/1/aapl.us.txt
:
$ head -5 data/daily/us/nasdaq\ stocks/1/aapl.us.txt
Date,Open,High,Low,Close,Volume,OpenInt
19840907,0.4379,0.4432,0.4326,0.4379,22476461,0
19840910,0.4379,0.43923,0.42735,0.43527,17445402,0
19840911,0.43923,0.45114,0.43923,0.4432,41137289,0
19840912,0.4432,0.44584,0.42995,0.42995,35936930,0
$ tail -5 data/daily/us/nasdaq\ stocks/1/aapl.us.txt
20160401,108.78,110,108.2,109.99,25114568,0
20160404,110.42,112.19,110.27,111.12,34791172,0
20160405,109.51,110.73,109.42,109.81,24789296,0
20160406,110.23,110.98,109.2,110.96,25152497,0
20160407,109.95,110.42,108.121,108.54,29499200,0
We see that the data for Apple spans more than 30 years, and has about 8000 rows. Other symbols may have fewer rows.
Some aspects of these data sets make them inconvenient to work with, and must be handled by any process that wants to collect these data into a single table:
- The symbol name does not appear inside the file itself, requiring that we stitch together the symbol name with its data ourselves,
- The vendor breaks up the data into two sub directories named
1
and2
to keep the number of files per directory within limits. - Altogether, the complete symbol set from
nasdaq stocks
andnyse stocks
comprises 6477 separate files. We want to treat this as one dataset.
We first need to describe these datasets in a yaml
file to Lux so it knows
how to load our 6000 files. To describe a dataset to Lux, three core things
are required:
- the root directory of the dataset to load,
- an extractor template string to extract meaningful data from the filename and directory path to include in the final table, and
- a datashape describing the column names and their datatypes.
We will use Lux in conjunction with Blaze server, and will provide these three
elements inside a Blaze server YAML file, named lux.yaml
for this tutorial.
Because the data are broken into two indices, NASDAQ
and NYSE
, we will
represent these as two separate datasets in the yaml file. After loading, we
will see how to combine them into one dataset with Blaze.
The yaml entries for these datasets are as follows:
stooq_nasdaq_lux:
imports: ["lux"]
source: "lux://data/daily/us/nasdaq stocks"
extractor: "{}/{Symbol}.us.txt"
dshape: "var * {Symbol: ?string,
Date: string,
Open: float64,
High: float64,
Low: float64,
Close: float64,
Volume: float64,
OpenInt: float64}"
stooq_nyse_lux:
imports: ["lux"]
source: "lux://data/daily/us/nyse stocks"
extractor: "{}/{Symbol}.us.txt"
dshape: "var * {Symbol: ?string,
Date: string,
Open: float64,
High: float64,
Low: float64,
Close: float64,
Volume: float64,
OpenInt: float64}"
Here we describe two Lux datasets, stooq_nasdaq_lux
and stooq_nyse_lux
.
Both have many similarities:
- The
imports: ["lux"]
entry tells Blaze server that it needs to import thelux
module to access the Lux backend before loading these datasets; - the
source: "lux://..."
entry indicates thelux
URI string that specifies the root directory for each dataset–nasdaq stocks
andnyse stocks
; - the
extractor: "..."
entry specifies how Lux is to extract meaningful data from the directory structure and filename to include in the final table, and - the
dshape
entry specifies the column names and datatypes to use for the resulting table.
Extractor Strings¶
The Lux extractor string is modeled on the Python Format string syntax. It uses the Parse package, which has a few deviations from stock format strings. A format specification in Python follows this pattern:
[fill][align][0][width][.precision][type]
The differences between parse string specifications and standard format string specifications are:
- The
align
operators will cause spaces (or specified fill character) to be stripped from the parsed value. The width is not enforced; it just indicates there may be whitespace or0
s to strip. - Numeric parsing will automatically handle a
0b
,0o
or0x
prefix. That is, the#
format character is handled automatically byd
,b
,o
andx
formats. - For
d
any will be accepted, but for the others the correct prefix must be present if at all. - Numeric sign is handled automatically.
- The thousands separator is handled automatically if the “n” type is used.
- The types supported are a slightly different mix to the format() types. Some
format() types come directly over:
d
,n
,%
,f
,e
,b
,o
andx
. In addition some regular expression character group types–D
,w
,W
,s
andS
–are also available. - The
e
andg
types are case-insensitive so there is no need for theE
orG
types.
In our example, the extractor string is:
{}/{Symbol}.us.txt
The first empty set of braces ({}
) matches the 1
or 2
directories.
Because the braces are empty, no name is associated with these directories, and
they are not included in the resulting table. We have to include the empty
braces in the extractor string to ensure the extractor matches these
directories and stitches together all the files into one logical table.
The /{Symbol}.us.txt
part of the extractor string will match filenames like
aapl.us.txt
, zyne.us.txt
, etc. All contents after the /
directory
separator and before the .us.txt
suffix will be extracted and used as the Symbol
field. Because this field has a name, it is included in the resulting table.
datashape¶
Blaze uses datashape as its type system. In our example, the datashape for both the NASDAQ and NYSE data sets is:
"var * {Symbol: ?string,
Date: ?string,
Open: float64,
High: float64,
Low: float64,
Close: float64,
Volume: float64,
OpenInt: float64}"
This specifies a tabular datashape with an unknown (var
) number of rows and
with eight named columns. Each column has a specified type. For this example,
the types are either ?string
or float64
, although datashape understands
many other types,
notably several integer and floating point types, fixed and variable length
string types, and datetime types. The ?
prefix in the ?string
type
indicates that this type may have missing values.
Loading Lux datasets in Blaze and Anaconda Mosaic¶
Now that the yaml
specification file has been created–named lux.yaml
for this tutorial–we will walk you through loading it in Anaconda Mosaic.
Note that this process will change and improve in coming versions of Mosaic–many of the following steps will be unnecessary after those improvements are incorporated.
If Anaconda Mosaic is running, close the browser window and shut down the
anaconda-mosaic
process in your terminal or command window.
In your terminal or command window, navigate to the directory where
lux.yaml
is located, and run anaconda-mosaic
as follows:
anaconda-mosaic -f lux.yaml
After initializing, Mosaic will open a browser window. Log in to Mosaic if you are not logged in already.
We will now add our two Lux data sets to Mosaic. Click the +
icon on the
upper left hand side. In the “Add Dataset” dialogue that comes up, select
“blaze server” for the Data URI, which will pre-fill the next field with the
following template:
blaze://{host}
Modify the template to read as follows:
blaze://127.0.0.1:6363::stooq_nasdaq_lux
For the “Name” field, fill in stooq_nasdaq_lux
. The “Description” field is
optional. Click “OK”.
You should see a new stooq_nasdaq_lux
dataset added to the left hand side.
Selecting it will provide a preview of the full dataset, which may take several
seconds to load.
We can repeat this process for the stooq_nyse_lux
dataset. The Data URI
for that dataset in the Add Dataset dialogue is:
blaze://127.0.0.1:6363::stooq_nyse_lux
and we name it stooq_nyse_lux
in the “Name” field.
It is also possible to add the Lux datasets to Mosaic by means of the Lux URI.
Click the +
icon on the upper left hand side. In the “Add Dataset” dialog
select Lux, the field will show the following:
lux:///{path}/{to}/{folder}
Modify it accordingly. For example:
lux:///home/user/Dowloads/nasdaq stocks
name it stooq_nyse_lux
in the “Name” field. For the extractor field, we use
the same as in the yaml file:
{}/{Symbol}.us.txt
and click “OK”. Repeat the process for stooq_nyse_lux
. If successful, you
should have two new Lux datasets available in Mosaic– stooq_nyse_lux
and
stooq_nasdaq_lux
.
Querying Lux datasets¶
We can now select records from our datasets and perform other transformations with Mosaic.
If we are interested in large volume stocks, for example, we can select that
data using the select
operation. Select the stooq_nasdaq_lux
dataset
from the left hand side, and click on select
from the expression builder.
Fill in the expression template as follows:
x[x.Volume > 10**6]
After previewing or applying this expression, we see the results of the selection in the table view.
Mosaic, Blaze, and Lux coordinate all the file loading, parsing, and behind-the-scenes logic for you to compute this Blaze operation.
If you haven’t already, click “Apply” to store this selection in the breadcrumb and allow us to build more complex queries on top of this selection.
We can add other operations on top of this one; for instance, we can group
these large volume records by the Date
column and compute the maximum
volume traded on that date.
Click on the by
expression, and edit the expression template to read:
by(x.Date, max_volume=x.Volume.max())
Selecting preview
or apply
will compute this grouping operation on the
result of the large volume selection.
We can “Commit” the result of this large Volume grouping, which allows us to store this expression with a meaningful name and return to it easily for later inspection.
Summary¶
This is just a taste of the sorts of computations that Mosaic, Blaze, and Lux enable. With a few steps, we are able to describe our potentially large repository of flat files and load them into unified logical datasets. We can then begin querying and computing on top of these data, to explore and analyze our data, without the tedious and error-prone steps often required to manage nested directories of flat files.