%========================================================================% % @noweb-file{ % author = "C.P. Jobling", % version = "$Revision: 1.5 $", % date = "$Date: 1995/04/01 13:54:43 $, % filename = "correct-refs.nw", % address = "Control and computer aided engineering % Department of Electrical and Electronic Engineering % University of Wales, Swansea % Singleton Park % Swansea SA2 8PP % Wales, UK", % telephone = "+44-792-295580", % FAX = "+44-792-295686", % checksum = "", % email = "C.P.Jobling@Swansea.ac.uk", % codetable = "ISO/ASCII", % keywords = "", % supported = "yes", % abstract = "Postprocessing routines to correct link % errors introduced by \verb|rawhtml| when % anchors are moved into sub-nodes of the % main HTML document.", % docstring = "The checksum field above contains a CRC-16 % checksum as the first value, followed by the % equivalent of the standard UNIX wc (word % count) utility output of lines, words, and % characters. This is produced by Robert % Solovay's checksum utility.", % } %======================================================================== \documentclass[a4paper]{article} \usepackage{noweb,html,multicol} \newcommand{\noweb}{\texttt{noweb}} \newcommand{\command}{\texttt{correct-xref}} \title{\command \\ Correct HTML In-Document Anchors and Hyperlinks} \author{C.P. Jobling \\ University of Wales, Swansea \\ (C.P.Jobling@Swansea.ac.uk)} \date{$Date: 1995/04/01 13:54:43 $ \\ Version $Revision: 1.5 $} \begin{document} \maketitle \begin{abstract} Postprocessing routines to correct link errors introduced by \verb|rawhtml| when anchors are moved into sub-nodes of the main HTML document. \end{abstract} \tableofcontents <>= # This file is part of the correct-refs.nw package which is # Copyright (C) C.P. Jobling, 1995 # # The code may be freely used for any purpose whatever # provided that it distributed with # the noweb source correct-refs.nw and that this copyright # notice is kept intact. # # Please report any problems, errors or suggestions to # Chris Jobling, University of Wales, Swansea # email: C.P.Jobling@Swansea.ac.uk # www-home page: http://faith.swan.ac.uk/chris.html/cpj.html # # $Id: correct-refs.nw,v 1.5 1995/04/01 13:54:43 eechris Exp $ @ \section{Purpose} When a \LaTeX $+$ HTML document is processed by \LaTeX 2HTML the resulting HTML document consists of a set of smaller documents, called ``nodes''. Cross-referencing between nodes, works well if \LaTeX{} \verb|\label| and \verb|\ref| commands have been used to establish them. However, if the \verb|rawhtml| environment is used to create HTML anchors, e.g. \begin{verbatim} I want \begin{rawhtml} this \end{rawhtml} to be an anchor. : : And now I want to go back to the anchor: follow this \begin{rawhtml} link \end{rawhtml} \end{verbatim} will only work if the link and the anchor happen to be in the same document. This can only be guaranteed when the \verb|-split 0| option is used. Why is this a problem? The answer is that I am interested in ``literate programming'' using \noweb{} \cite{ramsey94} which has an option to create a \LaTeX{} file for its support of fancy mathematics, tables and figures with code-chunks, code and variable crossreferences formatted using \verb|rawhtml|. And the cross-references are exactly as described above. For largeish programs, particularly those that make heavy use of mathematics, tables and figures in the documentation sections (exactly the kind that I write!), it is inconvenient to have a single HTML file so I want to have the document split into pieces. Hence, I have to postprocess the resulting HTML files changing links so that the node-name of the anchor is included. For the previous example, say \begin{verbatim} I want \begin{rawhtml} this \end{rawhtml} to be an anchor. \end{verbatim} ends up as: \begin{verbatim} I want this to be an anchor. \end{verbatim} in \texttt{node1.html} Then any links to this anchor which are of the form \begin{verbatim} And now I want to go back to the anchor: follow this link \end{verbatim} Have to be changed to: \begin{verbatim} And now I want to go back to the anchor: follow this link \end{verbatim} in every node (except node1.html itself). In best UNIX, and \noweb{} tradition, the tools to achieve this are written in a mixture of \texttt{csh} \cite[Chapter 5]{nutshell-unix}, \texttt{awk} \cite{awk-book,gawk-manual} and \texttt{awk} \cite[Chapter 10]{nutshell-unix} although one day I might redo them in \texttt{perl} \cite{perl-llama,perl-camel} for better portability. \section{Usage} The command to correct the cross-references is \begin{quote} \command{} {\it html-dir} \end{quote} where {\it html-dir} is the document directory created by \LaTeX 2HTML from {\it html-dir.tex}. \section{The Code} \subsection{Finding the anchors} On examining the HTML created by \LaTeX 2HTML, I noticed that all anchors created from \LaTeX{} \verb|\ref|, \verb|\tableofcontents|, \verb|\index| and by the cross-referencing done by \LaTeX 2HTML itself are of the form \begin{verbatim} blah, blah \end{verbatim} That is the label name is not enclosed in double quotes. So, providing that the anchors that you create in your \verb|rawhtml| environments are of the form \begin{verbatim} Blah, blah \end{verbatim} \command{} will be able to distinguish between anchors and links created by \LaTeX 2HTML and anchors and links created by \noweb{} or by the author of the original \LaTeX{} document. Here is an \texttt{awk} script to extract all the anchors from a series of documents and to output them in a list in the form \begin{verbatim} filename:anchor-name1 filename:anchor-name2 \end{verbatim} <>= # list-anchors.awk --- process a set of .html files and list # anchors in form filename:anchor-name # usage: [gn]awk -f list-anchors.awk *.html # <> { <> for (i = 1; i <= NF; i++) <> } @ \LaTeX 2HTML adds \verb|| tags to the head of all nodes. These could confuse the anchor finder, so we have to throw them away. <>= if ($1 != ">= if ($i ~ /^(name|NAME)\=\".*\".*$/) { anchor = $i sub(/^(name|NAME)\=\"/,"",anchor) sub(/\".*$/,"",anchor) printf("%s:%s\n",FILENAME,anchor) } @ To use this program to create the anchor list: <>= cd $DOC echo Creating list of anchors gawk -f $LIB/list-anchors.awk *.html | sort -u >! anchor-list @ \subsection{Generating link editing scripts} In order to correct the links in the HTML files, we now use the {\it anchor-list} to control the creation of a set of \texttt{awk} scripts, one per html file, which will edit the links and replace \begin{verbatim} link \end{verbatim} by \begin{verbatim} link \end{verbatim} unless the anchor and link happen to be in the same file. The processing is again done with a \texttt{[gn]awk} script. <>= # awk-scripts.awk --- process a file conatining node:anchor # information to create a set of awk # scripts to correct the links in the nodes. # usage: [gn]awk -f awk-scripts.awk anchor-list # <> BEGIN {FS = ":"} { <> } END { <> } @ The first part of this script just reads the entry in the {\it anchor-list} and stores the information in arrays for use later. <>= if (! ($1 in files)) { files[$1] = NR } prefix[NR] = $1 anchor[NR] = $2 @ To produce the \texttt{awk} scripts we loop over each unique file name encountered: <>= for (file in files) { <> <> } @ To create a new \texttt{awk} script: <>= <> <> @ The first thing we need to is to get the root of the HTML file and append awk. <>= filename = file sub(/\..*$/,".awk",filename) @ Next we open the file using redirection <>= printf("# awk script to correct HTML links in %s\n",file) > filename @ To create the \texttt{awk} code, we have to loop over each item in the list of anchor names. We throw away any that have the same filename as the currently open file. It will be ok to leave these as \begin{verbatim} link \end{verbatim} because the anchor is in the same file as the link. The rest get the file-name prepended onto each use of the anchor name in all links. When this is run, the output will be: \begin{verbatim} { gsub(/href=\"#anchor-1\"/,"href=\"node1.html#anchor-1\"") } { gsub(/href=\"#anchor-2\"/,"href=\"node1.html#anchor-2\"") } : : { gsub(/href=\"#anchor-n\"/,"href=\"nodem.html#anchor-n\"") } { print $0 } \end{verbatim} The actual string printing command doesn't look much like this because of all the escaping that has to be done. <>= for (i=1; i < NR; i++) { <> printf("{ gsub(/href=\\\"#%s\\\"/,\"href=\\\"%s#%s\\\"\") }\n", anchor[i],prefix[i],anchor[i]) >> filename } printf("{ print $0 }") >> filename close(filename) @ <>= if (prefix[i] != file) @ To create the \texttt{awk} scripts <>= echo Creating awk scripts gawk -f $LIB/awk-scripts.awk anchor-list @ <>= echo Processing HTML nodes foreach f (*.awk) set root=$f:r set tmpfile=`mktemp --suffix=.html` echo -n Processing $root.html gawk -f $f < $root.html >! $tmpfile echo "..." Done cp $root.html $root.html.bak cp $tmpfile $root.html end @ \subsection{\texttt{csh} Script to do anchor correction} <>= #!/usr/bin/csh # correct-refs.csh - CSH script file to correct HTML links and anchors # usage: correct-refs HTMLDIR <> unalias rm set LIB = $HOME/lib set DOC = $argv[1] echo Correcting anchors in HTML dir $DOC <> <> <> <> @ <>= rm *.awk rm anchor-list echo Done! @ \bibliographystyle{plain} \bibliography{refs} \section*{Code Chunks} \nowebchunks \section*{Revision History} \begin{description} \item[1.3 to 1.4] The {\tt sed} script didn't work for \noweb{} generated long anchor and link names. The substitute command string was too long apparently! I changed the {\tt sed} scripts to {\tt awk scripts}. Also tidied a few bits of the documentation. \end{description} \end{document}