Lux Tutorial and Walkthrough¶
Lux is a backend for Anaconda Mosaic that is designed to efficiently handle nested directories of CSV or text files and represent them as a single table. With Anaconda Mosaic and Lux, we can efficiently run queries on directories of flat files as if they were a single dataset, allowing us to efficiently extract the data we need and transform it as needed.
This tutorial will guide you through configuring and loading daily stock data with Lux. Once the data are loaded, we will then show you how to explore and transform them in Anaconda Mosaic.
For this example, the daily stock data we will use is provided by Stooq. Specifically, we are using the daily US data under the ASCII column, which, when downloaded, comes packaged as a zip archive of nested directories of CSV or text files. These data were selected as typical financial market data, and are structured in such a way that makes them inconvenient to work with.
After downloading and unzipping, we see the following structure:
data
└── daily
└── us
├── nasdaq etfs
├── nasdaq stocks
│ ├── 1
│ └── 2
├── nyse etfs
├── nyse stocks
│ ├── 1
│ └── 2
├── nysemkt etfs
└── nysemkt stocks
Inside the nasdaq stocks/1
and nasdaq stocks/2
folders are several text
files, one for each symbol, with daily bars. For example, here’s a sample from
nasdaq stocks/1/aapl.us.txt
:
$ head -5 data/daily/us/nasdaq\ stocks/1/aapl.us.txt
Date,Open,High,Low,Close,Volume,OpenInt
19840907,0.4379,0.4432,0.4326,0.4379,22476461,0
19840910,0.4379,0.43923,0.42735,0.43527,17445402,0
19840911,0.43923,0.45114,0.43923,0.4432,41137289,0
19840912,0.4432,0.44584,0.42995,0.42995,35936930,0
$ tail -5 data/daily/us/nasdaq\ stocks/1/aapl.us.txt
20160401,108.78,110,108.2,109.99,25114568,0
20160404,110.42,112.19,110.27,111.12,34791172,0
20160405,109.51,110.73,109.42,109.81,24789296,0
20160406,110.23,110.98,109.2,110.96,25152497,0
20160407,109.95,110.42,108.121,108.54,29499200,0
We see that the data for Apple spans more than 30 years, and has about 8000 rows. Other symbols may have fewer rows.
Some aspects of these data sets make them inconvenient to work with, and must be handled by any process that wants to collect these data into a single table:
- The symbol name does not appear inside the file itself, requiring that we stitch together the symbol name with its data ourselves,
- The vendor breaks up the data into two sub directories named
1
and2
to keep the number of files per directory within limits. - Altogether, the complete symbol set from
nasdaq stocks
andnyse stocks
comprises 6477 separate files. We want to treat this as one dataset.
We first need to describe these datasets to Lux so it knows how to load our 6000 files. To describe a dataset to Lux, three core things are required:
- the root directory of the dataset to load,
- an extractor template string to extract meaningful data from the filename and directory path to include in the final table, and
Extractor Strings¶
The Lux extractor string is modeled on the Python Format string syntax. It uses the Parse package, which has a few deviations from stock format strings. A format specification in Python follows this pattern:
[fill][align][0][width][.precision][type]
The differences between parse string specifications and standard format string specifications are:
- The
align
operators will cause spaces (or specified fill character) to be stripped from the parsed value. The width is not enforced; it just indicates there may be whitespace or0
s to strip. - Numeric parsing will automatically handle a
0b
,0o
or0x
prefix. That is, the#
format character is handled automatically byd
,b
,o
andx
formats. - For
d
any will be accepted, but for the others the correct prefix must be present if at all. - Numeric sign is handled automatically.
- The thousands separator is handled automatically if the “n” type is used.
- The types supported are a slightly different mix to the format() types. Some
format() types come directly over:
d
,n
,%
,f
,e
,b
,o
andx
. In addition some regular expression character group types–D
,w
,W
,s
andS
–are also available. - The
e
andg
types are case-insensitive so there is no need for theE
orG
types.
In our example, the extractor string is:
{}/{Symbol}.us.txt
The first empty set of braces ({}
) matches the 1
or 2
directories.
Because the braces are empty, no name is associated with these directories, and
they are not included in the resulting table. We have to include the empty
braces in the extractor string to ensure the extractor matches these
directories and stitches together all the files into one logical table.
The /{Symbol}.us.txt
part of the extractor string will match filenames like
aapl.us.txt
, zyne.us.txt
, etc. All contents after the /
directory
separator and before the .us.txt
suffix will be extracted and used as the Symbol
field. Because this field has a name, it is included in the resulting table.
Loading Lux datasets in Anaconda Mosaic¶
If Anaconda Mosaic is not running, in your terminal or command window, navigate to the directory where
stooq data
is located, and run anaconda-mosaic
as follows:
anaconda-mosaic
After initializing, Anaconda Mosaic will open a browser window. Log in to Anaconda Mosaic if you are not logged in already.
We will now add our two Lux data sets to Anaconda Mosaic. Click the +
icon on the
upper left hand side. In the “Add Dataset” dialogue that comes up, select
“Lux” for the Data URI.
Modify the template to read as follows:
lux://data/daily/us/nasdaq stocks
For the “Name” field, fill in stooq_nasdaq_lux
. The “Description” field is
optional.
For the “Extractor” field, enter:
{}/{Symbol}.us.txt
Click “OK”.
You should see a new stooq_nasdaq_lux
dataset added to the left hand side.
Selecting it will provide a preview of the full dataset, which may take several
seconds to load.
We can repeat this process for the stooq_nyse_lux
dataset. The Data URI
for that dataset in the Add Dataset dialogue is:
lux://data/daily/us/nyse stocks
and we name it stooq_nyse_lux
in the “Name” field.
For the “Extractor” field, enter:
{}/{Symbol}.us.txt
Click “OK”.
Querying Lux datasets¶
We can now select records from our datasets and perform other transformations with Anaconda Mosaic.
If we are interested in large volume stocks, for example, we can select that
data using the select
operation. Select the stooq_nasdaq_lux
dataset
from the left hand side, and click on select
from the expression builder.
Fill in the expression template as follows:
x[x.Volume > 10**6]
After previewing or applying this expression, we see the results of the selection in the table view.
Anaconda Mosaic coordinates all the file loading, parsing, and behind-the-scenes logic for you to compute this operation.
If you haven’t already, click “Apply” to store this selection in the breadcrumb and allow us to build more complex queries on top of this selection.
We can add other operations on top of this one; for instance, we can group
these large volume records by the Date
column and compute the maximum
volume traded on that date.
Click on the by
expression, and edit the expression template to read:
by(x.Date, max_volume=x.Volume.max())
Selecting preview
or apply
will compute this grouping operation on the
result of the large volume selection.
We can “Commit” the result of this large Volume grouping, which allows us to store this expression with a meaningful name and return to it easily for later inspection.
Summary¶
This is just a taste of the sorts of computations that Anaconda Mosaic and Lux enable. With a few steps, we are able to describe our potentially large repository of flat files and load them into unified logical datasets. We can then begin querying and computing on top of these data, to explore and analyze our data, without the tedious and error-prone steps often required to manage nested directories of flat files.