NAME

    Table::BoxFormat - Parsing the tabular data format generated by
    database SELECTs

VERSION

    Version 0.01

SYNOPSIS

       use Table::BoxFormat;
       # Reading input from a "dbox" temp file
       my $dbx = Table::BoxFormat->new( input_file => '/tmp/select_result.dbox' );
       my $data = $self->data;  # array of arrays, header in first row
    
       # Input dbox from a string
       my $dbx = Table::BoxFormat->new( input_data => $dboxes_string );
       my $data = $self->data;  # array of arrays, header in first row
    
       # input from dbox file, output directly to a tsv file
       my $dbx = Table::BoxFormat->new();
       $dbx->output_to_tsv( '/tmp/select_result.dbox', '/tmp/select_result.tsv' );
    
       # input dbox from a string, output directly to a tsv file
       $dbx = Table::BoxFormat->new( input_data => $dbox_string );
       $dbx->output_to_tsv( $output_tsv_file );

DESCRIPTION

    Table::BoxFormat is a module to work with data in the tabular text
    format(s) commonly used in database client shells (postgresql's "psql",
    mysql's "mysql", or sqlite's "sqlite3"), where a SELECT will typical
    display data in a form such as this (mysql):

      +-----+------------+---------------+-------------+
      | id  | date       | type          | amount      |
      +-----+------------+---------------+-------------+
      |  11 | 2010-09-01 | factory       |   146035.00 |
      |  15 | 2011-01-01 | factory       |   191239.00 |
      |  16 | 2010-09-01 | marketing     |   467087.00 |
      |  17 | 2010-10-01 | marketing     |   409430.00 |
      +-----+------------+---------------+-------------+

    Or this (postgresql's "ascii" form):

       id |    date    |   type    | amount
      ----+------------+-----------+--------
        1 | 2010-09-01 | factory   | 146035
        4 | 2011-01-01 | factory   | 191239
        6 | 2010-09-01 | marketing | 467087
        7 | 2010-10-01 | marketing | 409430

    These formats are human-readable, but not suitable for other purposes
    such as feeding to a graphics program, or inserting into another
    database table.

    This code presumes these text tables of "data boxes" are either stored
    in a string or saved to a file.

    This code works with at least three different formats: mysql, psql and
    unicode psql.

 implementation notes

    The main method here is read_dbox, which works by first looking for a
    horizontal ruler line near the top of the data, for example:

      +-----+------------+---------------+-------------+
      ----+------------+-----------+--------
      ────┼────────────┼───────────┼────────

    These ruler lines are used to identify the boundary columns, afterwhich
    the header and data lines are treated as fixed-width fields. Leading
    and trailing whitespace are stripped from each value.

    An earlier (now deprecated) method named read_simple takes an opposite
    approach, ignoring the horizontal rules entirely and doing regular
    expression matches looking for data delimiters on each line. In
    comparison, the read_dbox should run faster and be able to handle
    strings with delimiter characters embedded in them.

METHODS

    new

      Creates a new Table::BoxFormat object.

      Takes a list of attribute/setting pairs as an argument.

      input_encoding

	Default's to "UTF-8". Change to suit text encoding (e.g.
	"ISO-8859-1"). Must work as a perl ":encoding(...)" layer.

      output_encoding

	Like input_encoding. Default: "UTF-8".

      input_file

	File to input data from. Can be supplied later, e.g. when read_dbox
	is called. Only required if input_data was not defined directly. ((
	TODO change this: make it required ? ))

      input_data

	SQL SELECT output in the fixed-width-plus-delimiter form discussed
	above.

      the parsing regular expressions (type: RegexpRef)

	separator_rule

	  The column separators (vertical bar)

	ruler_line_rule

	  Matches the Horizontal ruler lines (typically just under the
	  header line)

	cross_rule

	  Match cross marks the horizontal bars typically use to mark
	  column boundaries.

	left_edge_rule

	  Left border delimiters (we strip these before processing).

	right_edge_rule

	  Right border delimiters (we strip these before processing).

    slurp_input_data

      Example usage:

        $self->slurp_input_data( $input_file_name );

    read_dbox

      Given data in tabular boxes from a multiline string, convert it into
      an array of arrays.

         my $data =
               $bxs->read_dbox();

      Converts the boxdata from the object's input_data into an array of
      arrays, with the field names included in the first row.

      As a side-effect, copies the header (first row of returned data) in
      the object's header, and puts some format metadata in the object's
      meta.

    analyze_ruler

      Internal method that analyzes the given ruler line and location to
      determine column widths and the dbox format.

      Returns an ordered list like so:

       format:
         'mysql', 'postgres', 'postgres_unicode', 'sqlite'
      
       header location:
         a row number: 0 or 1
      
       first_data:
         the row number where data begins:  2 or 3
      
       positions:
         a list of column boundary positions

      Example usage:

        ( $format, $header_loc, $first_data, @pos ) = $self->analyze_ruler( $line, $i );

    read_simple

      This is DEPRECATED. See read_dbox.

      Given data in tabular boxes from a multiline string, convert it into
      an array of arrays.

         my $data =
               $bxs->read_simple();

      Goes through the boxdata slurped into the object field input_data,
      returns it as an array of arrays, including the field names in the
      first row.

      As a side-effect, stores the header (first row of boxdata) in the
      object's header.

    output_to_tsv

      A convenience method that runs read_dbox and writes the data to a tsv
      file specified by the given argument.

      Returns a reference to the data (array of arrays).

      Example usage:

        $dbx->output_to_tsv( $input_dbox_file, $output_tsv_file );

      Or:

        $dbx = Table::BoxFormat->new( input_file => $input_dbox_file );
        $dbx->output_to_tsv( $output_tsv_file );

      Or:

        $dbx = Table::BoxFormat->new( input_data => $dbox_string );
        $dbx->output_to_tsv( $output_tsv_file );

    output_to_csv

      A convenience method that runs read_dbox and writes the data to a csv
      file specified by the given argument.

      Example usage:

        $dbx->output_to_csv( $input_dbox_file, $output_csv_file );

      Or:

        $dbx = Table::BoxFormat->new( input_file => $input_dbox_file );
        $dbx->output_to_csv( $output_csv_file );

      Or:

        $dbx = Table::BoxFormat->new( input_data => $dbox_string );
        $dbx->output_to_csv( $output_csv_file );

AUTHOR

    Joseph Brenner, <doom@kzsu.stanford.edu>, 05 Jun 2016

LIMITATIONS

 memory limited

    As implemented, this presumes the entire data set can be held in
    memory. Future versions may be more stream-oriented: there's no
    technical reason this couldn't be done.

 what you get is what you get

    This code is only guaranteed to cover input formats from mysql, psql
    and some from sqlite3. It may work with other databases, but hasn't
    been tested.

    At present it is not easily extensible (implementing a plugin system
    ala DBI/DBD seemed like overkill).

 sqlite3

    This code does not support the default output from sqlite3, only a
    variation with these settings:

      .header on
      .mode column

    While sqlite3 is very flexible, unfortunately the default output does
    not seem very useable:

      SELECT * from expensoids;
      |2010-09-01|factory|146035.0
      |2010-11-01|factory|218866.0
      |2011-01-01|factory|191239.0
      |2010-10-01|marketing|409430.0

    This is separated by the traditional ascii vertical bar, but without
    the usual bracketing spaces, and without any attempt at using fixed
    width columns. Somewhat oddly, the left edge has a vertical bar, but
    the right edge does not, but worse there's no header that provides
    column labels.

    If I were actually working with sqlite a lot I would turn on the header
    display and switch to fixed-width columns:

      .header on
      .mode column

    That yields output that looks like this:

      id          date        type        amount
      ----------  ----------  ----------  ----------
      1           2010-09-01  factory     146035.0
      2           2010-10-01  factory     208816.0
      3           2010-11-01  factory     218866.0

    That's very similar to the psql format using "\pset border 0" (which
    has one space column breaks instead of two): both are supported by
    read_dbox using the analyze_ruler routine.

COPYRIGHT AND LICENSE

    Copyright (C) 2016 by Joseph Brenner

    This program is free software; you can redistribute it and/or modify it
    under the terms of either: the GNU General Public License as published
    by the Free Software Foundation; or the Artistic License.

    See http://dev.perl.org/licenses/ for more information.