NAME Table::BoxFormat - Parsing the tabular data format generated by database SELECTs VERSION Version 0.01 SYNOPSIS use Table::BoxFormat; # Reading input from a "dbox" temp file my $dbx = Table::BoxFormat->new( input_file => '/tmp/select_result.dbox' ); my $data = $self->data; # array of arrays, header in first row # Input dbox from a string my $dbx = Table::BoxFormat->new( input_data => $dboxes_string ); my $data = $self->data; # array of arrays, header in first row # input from dbox file, output directly to a tsv file my $dbx = Table::BoxFormat->new(); $dbx->output_to_tsv( '/tmp/select_result.dbox', '/tmp/select_result.tsv' ); # input dbox from a string, output directly to a tsv file $dbx = Table::BoxFormat->new( input_data => $dbox_string ); $dbx->output_to_tsv( $output_tsv_file ); DESCRIPTION Table::BoxFormat is a module to work with data in the tabular text format(s) commonly used in database client shells (postgresql's "psql", mysql's "mysql", or sqlite's "sqlite3"), where a SELECT will typical display data in a form such as this (mysql): +-----+------------+---------------+-------------+ | id | date | type | amount | +-----+------------+---------------+-------------+ | 11 | 2010-09-01 | factory | 146035.00 | | 15 | 2011-01-01 | factory | 191239.00 | | 16 | 2010-09-01 | marketing | 467087.00 | | 17 | 2010-10-01 | marketing | 409430.00 | +-----+------------+---------------+-------------+ Or this (postgresql's "ascii" form): id | date | type | amount ----+------------+-----------+-------- 1 | 2010-09-01 | factory | 146035 4 | 2011-01-01 | factory | 191239 6 | 2010-09-01 | marketing | 467087 7 | 2010-10-01 | marketing | 409430 These formats are human-readable, but not suitable for other purposes such as feeding to a graphics program, or inserting into another database table. This code presumes these text tables of "data boxes" are either stored in a string or saved to a file. This code works with at least three different formats: mysql, psql and unicode psql. implementation notes The main method here is read_dbox, which works by first looking for a horizontal ruler line near the top of the data, for example: +-----+------------+---------------+-------------+ ----+------------+-----------+-------- ────┼────────────┼───────────┼──────── These ruler lines are used to identify the boundary columns, afterwhich the header and data lines are treated as fixed-width fields. Leading and trailing whitespace are stripped from each value. An earlier (now deprecated) method named read_simple takes an opposite approach, ignoring the horizontal rules entirely and doing regular expression matches looking for data delimiters on each line. In comparison, the read_dbox should run faster and be able to handle strings with delimiter characters embedded in them. METHODS new Creates a new Table::BoxFormat object. Takes a list of attribute/setting pairs as an argument. input_encoding Default's to "UTF-8". Change to suit text encoding (e.g. "ISO-8859-1"). Must work as a perl ":encoding(...)" layer. output_encoding Like input_encoding. Default: "UTF-8". input_file File to input data from. Can be supplied later, e.g. when read_dbox is called. Only required if input_data was not defined directly. (( TODO change this: make it required ? )) input_data SQL SELECT output in the fixed-width-plus-delimiter form discussed above. the parsing regular expressions (type: RegexpRef) separator_rule The column separators (vertical bar) ruler_line_rule Matches the Horizontal ruler lines (typically just under the header line) cross_rule Match cross marks the horizontal bars typically use to mark column boundaries. left_edge_rule Left border delimiters (we strip these before processing). right_edge_rule Right border delimiters (we strip these before processing). slurp_input_data Example usage: $self->slurp_input_data( $input_file_name ); read_dbox Given data in tabular boxes from a multiline string, convert it into an array of arrays. my $data = $bxs->read_dbox(); Converts the boxdata from the object's input_data into an array of arrays, with the field names included in the first row. As a side-effect, copies the header (first row of returned data) in the object's header, and puts some format metadata in the object's meta. analyze_ruler Internal method that analyzes the given ruler line and location to determine column widths and the dbox format. Returns an ordered list like so: format: 'mysql', 'postgres', 'postgres_unicode', 'sqlite' header location: a row number: 0 or 1 first_data: the row number where data begins: 2 or 3 positions: a list of column boundary positions Example usage: ( $format, $header_loc, $first_data, @pos ) = $self->analyze_ruler( $line, $i ); read_simple This is DEPRECATED. See read_dbox. Given data in tabular boxes from a multiline string, convert it into an array of arrays. my $data = $bxs->read_simple(); Goes through the boxdata slurped into the object field input_data, returns it as an array of arrays, including the field names in the first row. As a side-effect, stores the header (first row of boxdata) in the object's header. output_to_tsv A convenience method that runs read_dbox and writes the data to a tsv file specified by the given argument. Returns a reference to the data (array of arrays). Example usage: $dbx->output_to_tsv( $input_dbox_file, $output_tsv_file ); Or: $dbx = Table::BoxFormat->new( input_file => $input_dbox_file ); $dbx->output_to_tsv( $output_tsv_file ); Or: $dbx = Table::BoxFormat->new( input_data => $dbox_string ); $dbx->output_to_tsv( $output_tsv_file ); output_to_csv A convenience method that runs read_dbox and writes the data to a csv file specified by the given argument. Example usage: $dbx->output_to_csv( $input_dbox_file, $output_csv_file ); Or: $dbx = Table::BoxFormat->new( input_file => $input_dbox_file ); $dbx->output_to_csv( $output_csv_file ); Or: $dbx = Table::BoxFormat->new( input_data => $dbox_string ); $dbx->output_to_csv( $output_csv_file ); AUTHOR Joseph Brenner, <doom@kzsu.stanford.edu>, 05 Jun 2016 LIMITATIONS memory limited As implemented, this presumes the entire data set can be held in memory. Future versions may be more stream-oriented: there's no technical reason this couldn't be done. what you get is what you get This code is only guaranteed to cover input formats from mysql, psql and some from sqlite3. It may work with other databases, but hasn't been tested. At present it is not easily extensible (implementing a plugin system ala DBI/DBD seemed like overkill). sqlite3 This code does not support the default output from sqlite3, only a variation with these settings: .header on .mode column While sqlite3 is very flexible, unfortunately the default output does not seem very useable: SELECT * from expensoids; |2010-09-01|factory|146035.0 |2010-11-01|factory|218866.0 |2011-01-01|factory|191239.0 |2010-10-01|marketing|409430.0 This is separated by the traditional ascii vertical bar, but without the usual bracketing spaces, and without any attempt at using fixed width columns. Somewhat oddly, the left edge has a vertical bar, but the right edge does not, but worse there's no header that provides column labels. If I were actually working with sqlite a lot I would turn on the header display and switch to fixed-width columns: .header on .mode column That yields output that looks like this: id date type amount ---------- ---------- ---------- ---------- 1 2010-09-01 factory 146035.0 2 2010-10-01 factory 208816.0 3 2010-11-01 factory 218866.0 That's very similar to the psql format using "\pset border 0" (which has one space column breaks instead of two): both are supported by read_dbox using the analyze_ruler routine. COPYRIGHT AND LICENSE Copyright (C) 2016 by Joseph Brenner This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License. See http://dev.perl.org/licenses/ for more information.