% TEXACCENTS.TEX % Documentation for texaccents.sno / texaccents.spt % version 1.0.1 -- 17th september 2022 % author: guido.milanese@unicatt.it % tex file produced from markdown source % using `pandoc texaccents.md -s -o texaccents.tex` % and compiled with lualatex % Line 35 was added: \microtypesetup{nopatch=item} % See: https://github.com/schlcht/microtype/issues/4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % Options for packages loaded elsewhere \PassOptionsToPackage{unicode}{hyperref} \PassOptionsToPackage{hyphens}{url} % \documentclass[ 12pt, english, ]{article} \usepackage{amsmath,amssymb} \usepackage[]{libertine} \usepackage{iftex} \ifPDFTeX \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{textcomp} % provide euro and other symbols \else % if luatex or xetex \usepackage{unicode-math} \defaultfontfeatures{Scale=MatchLowercase} \defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1} \fi % Use upquote if available, for straight quotes in verbatim environments \IfFileExists{upquote.sty}{\usepackage{upquote}}{} \IfFileExists{microtype.sty}{% use microtype if available \usepackage[]{microtype} \UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts \microtypesetup{nopatch=item} }{} \makeatletter \@ifundefined{KOMAClassName}{% if non-KOMA class \IfFileExists{parskip.sty}{% \usepackage{parskip} }{% else \setlength{\parindent}{0pt} \setlength{\parskip}{6pt plus 2pt minus 1pt}} }{% if KOMA class \KOMAoptions{parskip=half}} \makeatother \usepackage{xcolor} \IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available \IfFileExists{bookmark.sty}{\usepackage{bookmark}}{\usepackage{hyperref}} \hypersetup{ pdftitle={TeXaccents version 1.0.1}, pdfauthor={Guido Milanese}, pdflang={en}, hidelinks, pdfcreator={LaTeX via pandoc}} \urlstyle{same} % disable monospaced font for URLs \setlength{\emergencystretch}{3em} % prevent overfull lines \providecommand{\tightlist}{% \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} \setcounter{secnumdepth}{-\maxdimen} % remove section numbering \ifXeTeX % Load polyglossia as late as possible: uses bidi with RTL langages (e.g. Hebrew, Arabic) \usepackage{polyglossia} \setmainlanguage[]{english} \else \usepackage[main=english]{babel} % get rid of language-specific shorthands (see #6817): \let\LanguageShortHands\languageshorthands \def\languageshorthands#1{} \fi \ifLuaTeX \usepackage{selnolig} % disable illegal ligatures \fi \title{TeXaccents\\ version 1.0.1} \author{Guido Milanese\footnote{Università Cattolica d.S.C., Dipartimento di scienze storiche e filologiche, via Trieste 17, I-25121 Brescia}} \date{17\textsuperscript{th} September 2022} \begin{document} \maketitle \begin{abstract} TeXaccents is a standalone utility designed to convert legacy (La)TeX ligatures and codes for ``accented'' characters to Unicode equivalents (text mode, no math) . For example, \texttt{\textbackslash{}=\{a\}} (`a' with macron) will be converted to \texttt{ā}. \end{abstract} \hypertarget{general-information}{% \section{General information}\label{general-information}} Even if modern compilers handle Unicode encoding, (La) and files featuring ``legacy'' encoding for non-Ascii characters are still very common, and users may need to incorporate old code into new texts that make use of modern text encoding. Several utilities are available online that claim to be able to convert legacy (La) encoding to standard Unicode. See: \begin{itemize} \item \emph{Simple LaTeX to Text Converter}. A complex programme, able to deal with maths. Insofar as non-Ascii chars are concerned, it fails sometimes, at least according to my tests. See \url{https://pylatexenc.readthedocs.io/en/latest/latexwalker/}. Written in Python. \item \emph{LaTeX handler}. Converts non-Ascii (La) encoding to Unicode. However, it does not seem to be able to deal with the legacy encoding, e.g.~\texttt{\{\textbackslash{}a\}} instead of \texttt{\textbackslash{}\{a\}} or \texttt{\textbackslash{}a}. It does not convert simple ligatures as \texttt{\textbackslash{}ae\{\}} \texttt{\textbackslash{}oe\{\}}. I used the tables provided by this programme as a starting point. Written in Python. See \url{https://github.com/hayk314/LaTex-handler}. \item \emph{Pandoc} is the standard programme for any text format conversion (\url{https://pandoc.org/}). It converts almost all the accents (thorn and eth missing?), but (if I have checked this correctly) normalises files stripping non-standard fields. This can be a problem for scholars who frequently use non-standard fields, such as e.g. ``shorttitle'', required by not a few bibliographic styles. \end{itemize} \emph{TeXaccents} should be able to transform (La) normal text or ``accents'' (not ``math'' accents) to their Unicode equivalent. The programme deals with the following codes (\emph{not all the fonts are able to output all the required Unicode glyphs of this table!}): \begin{verbatim} | NAME | \tex | Unicode | |--------------- |------- |---------| | Umlaut | \"{a} | ä | | acute | \'{a} | á | | double acute | \H{a} | a̋ | | grave | \`{a} | à | | circumflex | \^{a} | â | | caron hraceck | \v{a} | ǎ | | breve | \u{a} | ă | | cedilla | \c{c} | ç | | dot | \.{a} | ȧ | | dot under | \d{a} | ạ | | ogonek | \k{a} | ą | | tilde | \~{a} | ã | | macron | \={a} | ā | | bar under | \b{a} | a̱ | | ring over | \r{a} | å | \end{verbatim} The programme should recognize the following varieties: \texttt{\textbackslash{}\textquotesingle{}a} -- \texttt{\textbackslash{}\textquotesingle{}\{a\}} -- \texttt{\{\textbackslash{}\textquotesingle{}a\}} -- \texttt{\{\{\textbackslash{}\textquotesingle{}a\}\}} It transforms also the encoding for : \texttt{æ~œ~Æ~Œ~ð~Ð~þ~Þ~ø~Ø~ł~Ł}. Checking the page \url{https://www.utf8-chartable.de/unicode-utf8-table.pl?number=512} I could not find a legacy text mode encoding for: \textbf{ƀ~Ƀ~đ~Đ~ǥ~Ǥ~ħ~Ħ~ɨ~Ɨ~ŧ~Ŧ~ƶ~Ƶ} (some chars are accessible in math mode). \hypertarget{setup}{% \section{Setup}\label{setup}} \hypertarget{from-source}{% \subsection{From source}\label{from-source}} The programme is written in Snobol (\url{https://en.wikipedia.org/wiki/SNOBOL} or \url{https://it.wikipedia.org/wiki/SNOBOL}) and should run on any platform. Steps: \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \item Install Snobol4 (version 2.3, March 2022) from \url{http://www.regressive.org/snobol4/csnobol4/curr/}. Make sure to install the compiler in a folder listed in your \texttt{PATH} or add the folder to your path. On Linux the folder \texttt{snobol4} is installed under \texttt{/usr/local/bin/}, which is normally listed in the PATH of a standard Linux system. \item Test the compiler running \texttt{snobol4} from the command line. Leave the compiler with \texttt{Ctr-C} or writing \texttt{end}. \item Copy \texttt{texaccents.sno} and all the provided \texttt{*.inc} files \end{enumerate} \begin{quote} \texttt{compiler.inc}~\texttt{delete.inc}~\texttt{grepl.inc}~\texttt{newline.inc}~\texttt{systype.inc} \end{quote} to a folder of your choice (e.g.~\texttt{/home/\textless{}user\textgreater{}/bin}). \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \setcounter{enumi}{3} \tightlist \item In this folder, run \texttt{snobol4\ texaccents.sno\ testaccents-in\ testaccents-out} to test the programme. The test file contains all the accents listed above. See the result typing \texttt{cat~testaccents-out} (Unixes / Powershell) or \texttt{type~testaccents-out} (Windows/Dos prompt), or open the file with your text editor. The output file name is just a suggestion, of course. \end{enumerate} \hypertarget{windows-standalone-version}{% \subsection{Windows standalone version}\label{windows-standalone-version}} If preferred, a Windows EXE standalone file is provided. It was compiled using Spitbol (see \url{https://github.com/spitbol/windows-nt}); the source code has been slightly adapted to Spitbol (basically only input/output syntax). From any directory, run \texttt{texaccents.exe\ INPUT\ OUTPUT}. To test the programme, run \texttt{texaccents.exe\ testaccents-in\ testaccents-out}. As above, the output file name is just a suggestion. \hypertarget{history}{% \section{History}\label{history}} \begin{itemize} \tightlist \item 25\textsuperscript{th} July 2022. First version (after trying unsuccesfully to convert an old file with existing utilities) \item 17\textsuperscript{th} August 2022. First complete version (0.9). \item 27\textsuperscript{rd} August 2022. This version (1.0) with documentation and comments. \item 17\textsuperscript{th} September 2022. Windows standalone executable. Manual page written. Version message added; help message improved. In the source, a regular shebang according to the recommendation of CTAN (\url{https://tug.org/texlive/pkgcontrib.html}) was added. Documentation updated accordingly. \end{itemize} \hypertarget{contacts-todo}{% \section{Contacts / todo}\label{contacts-todo}} Bugs / suggestions / improvements: please write to \url{guido.milanese@unicatt.it} using \emph{TeXaccents} as subject of the mail. Genoa, Italy, 17\textsuperscript{th} September 2022 \end{document}