TextAdapter First Steps¶
Basic Usage¶
Create TextAdapter object for data source:
>>> import iopro
>>> adapter = iopro.text_adapter('data.csv', parser='csv')
Define field dtypes (example: set field 0 to unsigned int and field 4 to float):
>>> adapter.set_field_types({0: 'u4', 4:'f4'})
Parse text and store records in NumPy array using slicing notation:
>>> # read all records
>>> array = adapter[:]
>>> # read first ten records
>>> array = adapter[0:10]
>>> # read last record
>>> array = adapter[-1]
>>> # read every other record
>>> array = adapter[::2]
JSON Support¶
Text data in JSON format can be parsed by specifying ‘json’ for the parser argument:
>>> adapter = iopro.text_adapter('data.json', parser='json')
Currently, each JSON object at the root level is interpreted as a single NumPy record. Each JSON object can be part of an array, or separated by a newline. Examples of valid JSON documents that can be parsed by IOPro, with the NumPy array result:
>>> # Single JSON object
>>> data = StringIO('{"id":123, "name":"xxx"}')
>>> iopro.text_adapter(data, parser='json')[:]
array([(123L, 'xxx')],
dtype=[('f0', 'u8'), ('f1', 'O')])
>>> # Array of two JSON objects
>>> data = StringIO('[{"id":123, "name":"xxx"}, {"id":456, "name":"yyy"}]')
>>> iopro.text_adapter(data, parser='json')[:]
array([(123L, 'xxx'), (456L, 'yyy')],
dtype=[('f0', 'u8'), ('f1', 'O')])
>>> # Two JSON objects separated by newline
>>> data = StringIO('{"id":123, "name":"xxx"}\n{"id":456, "name":"yyy"}')
>>> iopro.text_adapter(data, parser='json')[:]
array([(123L, 'xxx'), (456L, 'yyy')],
dtype=[('f0', 'u8'), ('f1', 'O')])
Future versions of IOPro will have support for selecting specific JSON fields, using a query language similar to XPath for XML.
Advanced Usage¶
user defined converter function for field 0:
>>> import iopro
>>> import io
>>> data = '1, abc, 3.3\n2, xxx, 9.9'
>>> adapter = iopro.text_adapter(io.StringIO(data), parser='csv', field_names=False)
>>> # Override default converter for first field
>>> adapter.set_converter(0, lambda x: int(x)*2)
>>> adapter[:]
array([(2L, ' abc', 3.3), (4L, ' xxx', 9.9)],
dtype=[('f0', '<u8'), ('f1', 'S4'), ('f2', '<f8')])
overriding default missing and fill values:
>>> import iopro
>>> import io
>>> data = '1,abc,inf\n2,NA,9.9'
>>> adapter = iopro.text_adapter(io.StringIO(data), parser='csv', field_names=False)
>>> adapter.set_field_types({1:'S3', 2:'f4'})
>>> # Define list of strings for each field that represent missing values
>>> adapter.set_missing_values({1:['NA'], 2:['inf']})
>>> # Set fill value for missing values in each field
>>> adapter.set_fill_values({1:'xxx', 2:999.999})
>>> adapter[:]
array([(' abc', 999.9990234375), ('xxx', 9.899999618530273)],
dtype=[('f0', 'S4'), ('f1', '<f4')])
creating and saving tuple of index arrays for gzip file, and reloading indices:
>>> import iopro
>>> adapter = iopro.text_adapter('data.gz', parser='csv', compression='gzip')
>>> # build index of records and save index to NumPy array
>>> adapter.create_index('index_file')
>>> # reload index
>>> adapter = iopro.text_adapter('data.gz', parser='csv', compression='gzip', index_name='index_file')
>>> # Read last record
>>> adapter[-1]
array([(100, 101, 102)],dtype=[('f0', '<u4'), ('f1', '<u4'), ('f2', '<u4')])
Use regular expression for finer control of extracting data:
>>> import iopro
>>> import io
>>> # Define regular expression to extract dollar amount, percentage, and month.
>>> # Each set of parentheses defines a field.
>>> data = '$2.56, 50%, September 20 1978\n$1.23, 23%, April 5 1981'
>>> regex_string = '([0-9]\.[0-9][0-9]+)\,\s ([0-9]+)\%\,\s ([A-Za-z]+)'
>>> adapter = iopro.text_adapter(io.StringIO(data), parser='regex', regex_string=regex_string, field_names=False, infer_types=False)
>>> # set dtype of field to float
>>> adapter.set_field_types({0:'f4', 1:'u4', 2:'S10'})
>>> adapter[:]
array([(2.56, 50L, 'September'), (1.23, 23L, 'April')],
dtype=[('f0', '<f8'), ('f1', '<u8'), ('f2', 'S9')])