NAME

    Web::PageMeta - get page open-graph / meta data

SYNOPSIS

        use Web::PageMeta;
        my $page = Web::PageMeta->new(url => "https://www.apa.at/");
        say $page->title;
        say $page->image;

    async fetch previews and images:

        use Web::PageMeta;
        my @urls = qw(
            https://www.apa.at/
            http://www.diepresse.at/
            https://metacpan.org/
            https://github.com/
        );
        my @page_views = map { Web::PageMeta->new( url => $_ ) }
                @urls;
        Future->wait_all( map { $_->fetch_image_data_ft, } @page_views )->get;
        foreach my $pv (@page_views) {
            say 'title> '.$pv->title;
            say 'img_size> '.length($pv->image_data);
        }
    
        # alternativelly instead of Future->wait_all()
        use Future::Utils qw( fmap_void );
        fmap_void(
            sub { return $_[0]->fetch_image_data_ft },
            foreach    => [@page_views],
            concurrent => 3
        )->get;

DESCRIPTION

    Get (not only) open-graph web page meta data. can be used in both
    normal and async code.

    For any other than 200 http status codes during data downloads,
    HTTP::Exception is thrown.

ACCESSORS

 new

    Constructor, only "url" is required.

 url

    HTTP url to fetch data from.

 timeout

    In addition to AnyEvent::HTTP timeout will also check time during
    download as the data are being downloaded and dies when over the limit.
    Default 5 minutes.

 max_size

    Will die when the document or image size is greater than this limit.
    Default 100MB.

 user_agent

    User-Agent header to use for http requests. Default is one from Chrome
    89.0.4389.90.

 extra_headers

    HashRef with extra http request headers.

 cookie_jar

    Accepts optional HTTP::Cookies compatible object that must provide
    get_cookies() method. If set will send http cookie headers with each
    request.

 title

    Returns title of the page.

 description

    Returns description of the page.

 image

    Returns image location of the page.

 image_data

    Returns image binary data of "image" link.

    Will throw 404 exception if there is not "image" link.

 page_meta

    Returns hash ref with all open-graph data.

 extra_scraper

    Web::Scraper::LibXML object to fetch image, title or description from
    different than default location.

        use Web::Scraper::LibXML;
        use Web::PageMeta;
        my $escraper = scraper {
            process_first '.slider .camera_wrap div', 'image' => '@data-src';
        };
        my $wmeta = Web::PageMeta->new(
            url => 'https://www.meon.eu/',
            extra_scraper => $escraper,
        );

 page_body_hdr

    Returns array ref with page [$body,$headers]. Can be useful for
    post-processing or special/additional data extractions.

    Only text/html content-type is accepted for fetching.

 fetch_page_meta_ft

    Returns future object for fetching paga meta data. See "ASYNC USE". On
    done "page_meta" hash is returned.

 fetch_image_data_ft

    Returns future object for fetching image data. See "ASYNC USE" On done
    "image_data" scalar is returned.

 fetch_page_body_hdr_ft

    Returns future object for fetching page content and headers. See "ASYNC
    USE" On done "page_body_hdr" array ref is returned.

ASYNC USE

    To run multiple page meta data or image http requests in parallel or to
    be used in async programs "fetch_page_meta_ft" and fetch_image_data_ft
    returning Future object can be used. See "SYNOPSIS" or t/02_async.t for
    sample use.

SEE ALSO

    https://ogp.me/

AUTHOR

    Jozef Kutej, <jkutej at cpan.org>

LICENSE AND COPYRIGHT

    Copyright 2021 jkutej@cpan.org

    This program is free software; you can redistribute it and/or modify it
    under the terms of either: the GNU General Public License as published
    by the Free Software Foundation; or the Artistic License.

    See http://dev.perl.org/licenses/ for more information.