 # Annotation of /Spoonerism/spoonerism.old.tex

Revision 1.1.1.1 - (hide annotations) (vendor branch)
Mon Nov 6 09:49:08 2006 UTC (12 years, 4 months ago) by riso
Branch: riso, MAIN
Initial import.

 1 riso 1.1 \documentclass[11pt, a4paper]{article} 2 3 \usepackage{epsfig} 4 \usepackage{graphicx} 5 \usepackage{amssymb} 6 \usepackage{alltt} 7 \usepackage{amsmath} 8 9 %\renewcommand{\baselinestretch}{1.5} 10 11 \newtheorem{theorem}{Theorem} 12 \newtheorem{lemma}{Lemma} 13 \newtheorem{definition}{Problem} 14 \newtheorem{invariant}{Invariant} 15 \newtheorem{restr}{Restriction} 16 \newtheorem{fact}{Fact} 17 \newtheorem{ext}{Extension} 18 \newtheorem{property}{Property} 19 \newtheorem{observation}{Observation} 20 21 22 \setlength{\oddsidemargin}{0pt} 23 \setlength{\textwidth}{6.3in} 24 \setlength{\textheight}{610pt} 25 \newcommand{\keyw}{{\bf #1}} 26 27 \newenvironment{sketch}{\noindent {\em Sketch}.\ }{\proofbox\par\smallskip\par} 28 \newenvironment{proof}{\noindent {\em Proof}.\ }{\proofbox\par\smallskip\par} 29 \newcommand{\halmos}{\rule{1ex}{1.4ex}} 30 \newcommand{\proofbox}{\hspace*{\fill}\mbox{$\halmos$}} 31 32 \begin{document} 33 34 \title{The Spoonerism Problem} 35 36 \author{M. Bender, R. Clifford, K. Steinhofel and K. Tsichlas} 37 38 %\date{} 39 40 41 \maketitle 42 43 \begin{abstract} 44 A sentence has a Spoonerism if it is possible to swap the symbols at two positions in the sentence and leave a sentence that is still correctly spelt. This paper considers the complexity of finding such a Spoonerism when a) there are no spaces between words and b) the length of a sentence is not bounded. Algorithms that run in $O(n^2 k^2)$ and $O(\sigma n k (\sigma+ k \log{n}))$ time are presented where $n$ is the length of the sentence, $k$ is the length of longest word in the dictionary and $\sigma$ is the alphabet size. 45 46 \end{abstract} 47 48 49 \section{Introduction} 50 51 Natural language processing has provided many challenging problems for those involved string matching and analysis. We consider here the problem of finding \emph{Spoonerisms} in a text. The word comes from the Reverend W. A. Spooner (born 1844) and in normal English means an accidental transposition of the initial sounds, or other parts, of two or more words" \cite{OED}. Spooner was rather prone to making accidental mistakes in speech. Classic examples include changing a well-oiled bicycle" into "a well-boiled icicle" and once proposing a toast to Her Highness Victoria as: Three cheers for our queer old dean!". 52 53 54 To formalise the problem we assume a sentence $s$ of length $n$ where each symbol in the sentence represents a phoneme. This is accompanied by a dictionary $\mathcal{D}$ of correctly spelt words that have also been translated into phonetic form. We say that $s$ is \emph{correctly spelt} by $\mathcal{D}$ if $s$ is the concatenation of words from $\mathcal{D}$. The task is to find if the symbols at two positions of the text can be swapped leaving a string that is still correctly spelt by $D$. For example, consider the phrasewave the sails". By swapping the sounds at the beginning of the first and last words we get "save the whales" which is correctly spelt in English. 55 56 In this paper we consider only a single swap and allow it to be occur at any place in a word. Further, to generalise the problem we do not assume that there are spaces separating words. By allowing a swap to be made with the position between symbols in sentence, our methods can also be extended to a single transposition of sounds, such as from tease ears" to ease tears". 57 58 \subsection*{Results} 59 In this paper we present the following results. 60 61 \emph{Na\"{i}ve solutions} We first give a na\"{i}ve algorithm for solving the Spoonerism problem in Section~\ref{sec:21} that considers all $O(n^2)$ different possible swaps independently. By constructing a graph of the different possibles parses of the text, this simple approach gives an $O(n^3 k)$ time algorithm, where $k$ is the maximum length of a word in $\mathcal{D}$. 62 63 \emph{Paramerisation} In Sections~\ref{sec:22} and \ref{sec:23} the time complexity is improved to $O(n^2(k^2+nl))$, where $l\leq k$, and $O(n^2 k^2)$. This is achieved by observing is that a change in a single position results in only localised changes to the graph of the different possible parses. 64 65 \emph{Divide and conquer} Finally, a divide and conquer algorithm is given in Section~\ref{sec:24} that solves the Spoonerism problem in $O(\sigma n k^2 \log{n})$ time. Where $\sigma$ is small compared to $n$ and $k$ this is the fastest solution presented. 66 67 68 \section{The Spoonerism Problem} 69 70 In this problem we are given: 71 72 \begin{enumerate} 73 74 \item An alphabet $\Sigma=\{1,2,\ldots\,\sigma\}$ 75 76 \item A dictionary $\mathcal{D}=\{w_{1},w_{2},\ldots\}$, where $w_{i}\in \Sigma^*$ 77 78 \item A sentence $s=s_{1},s_{2},\ldots,s_{n}$, $1 \leq s_{j} \leq \sigma$. 79 80 \end{enumerate} 81 82 The sentence can be represented as a sequence of words from $\mathcal{D}$. Note that there may be many different word sequences from $\mathcal{D}$ that represent the same sentence $s$. A sentence is correctly spelled if there is a sequence of words that represents it. Assume that $\varepsilon$ is the empty string. 83 84 We are asked to find whether we can swap the symbols at two positions in a correctly spelled sentence $s$ and get a new correctly spelled sentence $s'$. 85 86 87 \section{Algorithms} 88 89 In the following, we are psoposing solutions for the spoonerism problem. The complexities of these solutions is $\Omega(n^4)$, but we will gradually parametarize them introducing variables such as the size of the alphabet and the length of the largest word in the dictionary. 90 91 The algorithms presented are based on the graph representation of a sentence $s$. In particular, a graph $G_{s}=(V,E)$ for sentence $s$ represents all possible words that occur in $s$. There is a node $v_{i}$ for each position $i$ of the sentence $s$. There is an edge between nodes $v_{i}$ and $v_{j}$ when $s[i,\ldots,i+j-1]\in \mathcal{D}$. Clearly, if $k$ is the largest word in the dictionary then the number of edges of the graph will be $O(nk)$. Similarly, if $\ell$ is the maximum number of words for which there is a total order of prefix relations, then the size of the graph can be at most $n\ell$. Finally we add one start node $start$ that points to node $v_{1}$ and one end node $end$ that is pointed by $v_{n}$. 92 93 94 95 \subsection{The Na\"{i}ve Algorithm}\label{sec:21} 96 97 The na\"{i}ve algorithm generates all possible sentences by one swap operation from sentence $s$ and then it checks each one until it finds one correctly spelled sentence. At most $\binom{n}{2}$ sentences are generated and each one must be checked. This na\"{i}ve method has a time complexity of $O(n^4)$. 98 99 However, we can reduce this complexity to $O(n^3k)$ by making use of the graph $G$. From each position at most $\ell$ words can start. We construct a graph $G_{s}$. This graph has $n$ vertices and $O(nl)$ edges. Then the goal is to find a path from vertex $start$ to vertex $end$. This can be accomplished in time $O(n\ell)$ by a simple depth first search traversal. The overall complexity of the algorithm is $O(n^3k)$, since there are $O(n^{2})$ swaps. 100 101 102 \subsection{Introducing $\ell$ in the Complexity}\label{sec:22} 103 104 We will build on the na\"{i}ve method. Our algorithm is based on the following property. 105 106 107 \begin{property}\label{pro:local} 108 Changing a character at a position of $s$ results in only localised changes for graph $G_{s}$. 109 \end{property} 110 111 First we construct the graph $G_{s}$ for sentence $s$ in $O(nk)$ time. The idea is that the construction time for $G_{s}$ is $O(nk)$ while its traversal to find a path from the start to the end is $O(n\ell)$, where $\ell\leq k$ if there are no repetitions of words in the dictionary. 112 113 Assume a swap operation between positions $i$ and $j$ in $s$ resulting in sentence $s'$. Graph $G_{s'}$ can be constructed by $G_{s}$ in $O(k^{2})$ time because the change of position $i$ can affect the edges of the positions $i-k+1,\ldots, i$ and each position needs at most $O(k)$ processing time to add the new edges and remove the old ones. After constructing $G_{s'}$ we apply the depth first search and in $O(nl)$ time we can find whether there is a path from $start$ to $end$. Then we go back from $G_{s'}$ to $G_{s}$ by the reverse swap operation in $O(k^{2})$ time. Consequently, the total time complexity is $O(n^{2})$ for all swaps times $O(k^{2})$ for the construction step and $O(nl)$ for the DFS. The total time complexity of the algorithm is therefore $O(n^{2}(k^{2}+nl))$. 114 115 116 \subsection{Replacing a Factor of $n$}\label{sec:23} 117 118 We try to parametarize even further the algorithm by exploiting the localised construction in the computation of the paths in graph $G_{s}=(V,E_{s})$ for the initial sentence $s$. First we construct $G_{s}$ in $O(nk)$ time. Then, we compute the transitive closure of $G_{s}$ by using one of the known $O(|V||E|)$ algorithms for static transitive closure. This is stored in an $O(n^{2})$ size matrix $T_{s}$. In our case the time complexity for this step is $O(n^{2}\ell)$ since $|V|=n+2$ and $|E|=n\ell$. 119 120 \begin{figure} 121 \centering 122 \includegraphics[scale=0.8]{swap_op.eps} 123 \caption{The setting around one of the swap positions.} 124 \label{fig:swap_op} 125 \end{figure} 126 127 As in the previous algorithm we construct the graph $G_{s'}$ from $G_{s}$ for the sentence $s'$ that comes from $s$ after a swap operation in $O(k^{2})$ time. Then we consider the two positions in $s$ that changed due to the swap operation. Assume that these positions are $i$ and $j$. 128 129 At most $k-1$ vertices before and after $i$ and $j$ are affected due to edge changes (see Figure~\ref{fig:swap_op}). With no loss of generality we assume that $i < j$. The $k-1$ vertices before $i$ are put in a set $B_{i}$. In the same way we define the set $A_{i}$. The following observation is straightforward (see Figure~\ref{fig:swap_op}). 130 131 \begin{observation} \label{obs:edges} 132 For all vertices in $B_{i}$ only outgoing edges will change. For all vertices in $A_{i}$ only incoming edges will change. Vertex $i$ is the only node at which both incoming and outgoing edges will change. 133 \end{observation} 134 135 By Observation~\ref{obs:edges} we deduce that matrix $T_{s}$ will correctly answer in $O(1)$ time reachability queries of the form $(start,p)$, where $start$ is the start of the graph and $p\in B_{i}$. The same goes for queries of the form $(q,end)$, where $q\in A_{i}$. There are two cases depending on the distance of $i$ and $j$. 136 137 \begin{itemize} 138 139 \item[$j-i>2k$:]{In this case one position does not affect the other. Take each possible edge $(p,q)$ from $p\in B_{i}$ to $q\in A_{i}$. Reachability from $1$ to $q$ is easy to find by making a query $(start,p)$ and then following the edge to $q$. All edges of $q$ that are connected to $start$ are marked. The time needed for this step is $O(k\ell)$. Take all edges $(p,i)$, $p\in B_{i}$. Then follow the edges $(i,q)$, $q\in A_{i}$. Again we mark all nodes $q$ where there is a pathfrom $start$. The time for this step is also $O(k\ell)$. Symmetrically we do the same for vertex $j$ and sets $B_{j}$ and $A_{j}$. However, we mark all nodes of $B_{j}$ so that there is a path from these nodes to $end$. The time needed for this procedure is also $O(kl)$. 140 141 The only remaining detail concerns reachability queries between $A_{i}$ and $B_{j}$. Since this part of the graph has not been affected by the swap operation we can use $T_{s}$ to answer reachability queries in $O(1)$ time. We do that for all marked combinations between $A_{i}$ and $B_{j}$ until we find one such path. In the worst-case, the total time complexity will be $O(k^{2})$. If we find a path from $start$ to $end$ then $s'$ is a correctly spelled sentence.} 142 143 \begin{figure} 144 \centering 145 \includegraphics[scale=0.7]{swap_nearop.eps} 146 \caption{The setting around the swap positions when they are near.} 147 \label{fig:swap_nearop} 148 \end{figure} 149 150 \item[$j-i\leq 2k$:]{This setting is depicted in Figure~\ref{fig:swap_nearop}. Let $M_{i,j}$ be the set of vertices between $i$ and $j$, where $|M_{i,j}| \leq 2k-1$. The problem here is that there is a set of vertices $M_{i,j} \cup \{i\} \cup \{j\}$ where incoming and outgoing edges change at the same time. 151 152 First, we find deduce reachability from $start$ to $p$, for each $p\in B_{i}$ in $O(k)$ time and we mark these nodes. Then, we deduce reachability from $q$ to $end$, for each $q \in A_{j}$ in $O(k)$ time. Note that $T_{s}$ can be used for reachability queries in these areas because they have not changed. From $B_{i}$, $O(k\ell)$ edges point to $A_{j}$. We check reachability in $O(k\ell)$ time and if we find one such edge that goes from a marked node in $B_{i}$ to a marked node in $A_{j}$ we knwo that there is a path. Otherwise, we move to edges from nodes in $B_{i}$ to nodes in set $M'=M_{i,j}\cup \{i\} \cup \{j\}$. 153 154 First we check for reachability from a node $r\in M'$ to $end$, starting from node $j$ and backwards until we reach $i$. We check all edges from $j$ to $A_{j}$ and we mark $j$ if there is an edge that points to a marked node in $A_{j}$. In general, we check all outgoing edges from $r$. If this edge points to a marked node (note that this node may be already in $M'$) we also check $r$ and move to $r-1$ (this is easy to do by just keeping backwards pointers from vertex $t$ to $t-1$). At thispoint all we have to do is to check whether the edges that leave vertices in $B_{i}$ point to marked vertices in $M'$. If this is the case then there is a path from $start$ to $end$ and sentence $s'$ is correctly spelled.} 155 156 \end{itemize} 157 158 All in all, the time complexity of this algorithm is $O(n^{2})$ (the total number of swaps) times the time to construct and compute the paths. As a result, the total time complexity will be $O(n^{2}k^{2})$. The space complexiti is $O(n^2)$ because we have to store the matrix for the reachability queries. 159 160 161 162 163 \subsection{Introducing the Size of the Alphabet in the Complexity} \label{sec:24} 164 165 The algorithm we propose in this section is a diveide and conquer algorithm with two stages, one preprocessing stage and one reporting stage in each recursion. Its time complexity is $O(n^{2}k+n|\Sigma|^{2}k+|\Sigma|nk^{2}\log{n})$. The $n^2k$ part of the complexity corresponds to the reporting stage. The $n|\Sigma|^{2}k$ corresponds to the preprocessing for small regions while the $|\Sigma|nk^{2}\log{n}$ part corresponds to the preprocessing for big regions. The algorithm is based on the following straightforward observation. 166 167 \begin{property}\label{pro:changes} 168 Each position of $s$ changes at most $|\Sigma|-1$ times as a result of all swap operations. 169 \end{property} 170 171 In a nutshell, we will preprocess the sentence $s$ for each possible change in the characters in all positions. Then, we will use this stored information to answer quickly (in $O(k)$ steps) reachability queries. This procedure is implemented by a divide and conquer approach. In fact, in each recursive level of the algorithm we do first the preprocessing and then the reporting. 172 173 \begin{figure} 174 \centering 175 \includegraphics[scale=0.9]{recursion.eps} 176 \caption{The algorithm recurses on both areas that have an overlapping area of size $k$.} 177 \label{fig:recursion} 178 \end{figure} 179 180 We recursively split the sentence into two subsentences that overlap. The overlapping region has length $k$. This region will be used to speed up reachability queries. The recursion stops when the size of the regions is less than $4k$. 181 182 For the description of the algorithm assume a region $R$ that will be split into two regions $R_{l}$ and $R_{r}$ (the left and the right region). The overlpapping region $V$ has length $k$. For each region, for each position and for each letter we precompute the reachability information from the first node of the $R_{l}$ to all nodes in $V$ and from all nodes in $V$ to the last node in $R_{r}$. We can do that for each position in $O(k^{2})$ time following a procedure similar to that of \ref{sec:23}. We store this information in a list attached to each position and identified by the new letter we put in. 183 184 In this recursion level, after we finish with the processing we move to the reporting stage. So, for each possible swap between $R_{l}$ and $R_{r}$ we check whether the new sentence is correctly spelled. This can be accomplished in $O(k)$ time because we only need to deduce reachability by using region $V$ between the two positions and it can be done by checking each position in $V$ in constant time by using the stored information. 185 186 We apply this procedure recursively until we end up with a region of size at most $4k$. In this region, we find the reachability information for each one of the $O(k^2)$ swaps and $|\Sigma|^{2}$ letters for each swap. This means that for each such region $O(|\Sigma|^{2}k^{2})$ steps are needed. 187 188 We will not elaborate on the proof of the time complexity of this algorithm but we will only provide a small sketch. For the reporting stage, since the number of swaps in total is $O(n^{2})$ and for each such swap we need $O(k)$ time, in total this stage has $O(n^{2}k)$ time complexity. For the preprocessing stage, the time complexity is the sume of the time complexities for the case where the regions are $>4k$ and for the case where they are $\leq 4k$. 189 190 In the latter case, the time complexity will be the time needed for each such region, which is $O(|\Sigma|^2k^2)$, mulptiplied by the number of regions, which is $O(\frac{n}{k})$. In total we get a complexity of $O(n|\Sigma|^{2}k)$. In the former case, a simple recursive formula will provide us with a complexity of $O(n|\Sigma|k^{2}\log{n})$. Intuitively, this time complexity is a straightforward effect of the divide and conquer algorithm. In each recursive level we need $O(n|\Sigma|k^{2})$ time to get the necessary reachability information. Since the recursion always breaks one region into two subregions of the same size with a relatively small overlap (recall that the recursion stops when the size of the region is $\leq 4k$), we get the logarithmic factor. The space complexity is dominated by the space usage of the first recursion level since we always reuse space. The size of the lists for each position in the first recursive level is $O(|\Sigma|k^{2})$. In total the space complexity is $O(n|\Sigma|k^{2})$. 191 192 All in all, we proposed a quite complicated algorithm with time complexity $O(n^{2}k+n|\Sigma|k(|\Sigma|+k\log{n}))$ and space complexity $O(n|\Sigma|k^{2})$. 193 194 \end{document} 195