% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Levenshtein.R
\name{Levenshtein}
\alias{Levenshtein}
\title{Levenshtein String/Sequence Comparator}
\usage{
Levenshtein(
  deletion = 1,
  insertion = 1,
  substitution = 1,
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)
}
\arguments{
\item{deletion}{positive cost associated with deletion of a character
or sequence element. Defaults to unit cost.}

\item{insertion}{positive cost associated insertion of a character
or sequence element. Defaults to unit cost.}

\item{substitution}{positive cost associated with substitution of a
character or sequence element. Defaults to unit cost.}

\item{normalize}{a logical. If TRUE, distances are normalized to the
unit interval. Defaults to FALSE.}

\item{similarity}{a logical. If TRUE, similarity scores are returned
instead of distances. Defaults to FALSE.}

\item{ignore_case}{a logical. If TRUE, case is ignored when comparing
strings.}

\item{use_bytes}{a logical. If TRUE, strings are compared byte-by-byte
rather than character-by-character.}
}
\value{
A \code{Levenshtein} instance is returned, which is an S4 class inheriting from
\code{\linkS4class{StringComparator}}.
}
\description{
The Levenshtein (edit) distance between two strings/sequences \eqn{x} and
\eqn{y} is the minimum cost of operations (insertions, deletions or
substitutions) required to transform \eqn{x} into \eqn{y}.
}
\details{
For simplicity we assume \code{x} and \code{y} are strings in this section,
however the comparator is also implemented for more general sequences.

A Levenshtein similarity is returned if \code{similarity = TRUE}, which
is defined as
\deqn{\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},}{sim(x, y) = (w_d |x| + w_i |y| - dist(x, y))/2}
where \eqn{|x|}, \eqn{|y|} are the number of characters in \eqn{x} and
\eqn{y} respectively, \eqn{\mathrm{dist}}{dist} is the Levenshtein distance,
\eqn{w_d} is the cost of a deletion and \eqn{w_i} is the cost of an
insertion.

Normalization of the Levenshtein distance/similarity to the unit interval
is also supported by setting \code{normalize = TRUE}. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion \eqn{w_i} and deletion \eqn{w_d} are equal.
The normalized distance \eqn{\mathrm{dist}_n}{dist_n} is defined as
\deqn{\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},}{dist_n(x, y) = 2 * dist(x, y) / (w_d |x| + w_i |y| + dist(x, y)),}
and the normalized similarity \eqn{\mathrm{sim}_n}{sim_n} is defined as
\deqn{\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.}{sim_n(x, y) = 1 - dist_n(x, y) = sim(x, y) / (w_d |x| + w_i |y| - sim(x, y)).}
}
\note{
If the costs of deletion and insertion are equal, this comparator is
symmetric in \eqn{x} and \eqn{y}. In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
}
\examples{
## Compare names with potential typos
x <- c("Brian Cheng", "Bryan Cheng", "Kondo Onyejekwe", "Condo Onyejekve")
pairwise(Levenshtein(), x, return_matrix = TRUE)

## When the substitution cost is high, Levenshtein distance reduces to LCS distance
Levenshtein(substitution = 100)("Iran", "Iraq") == LCS()("Iran", "Iraq")

}
\references{
Navarro, G. (2001), "A guided tour to approximate string matching",
\emph{ACM Computing Surveys (CSUR)}, \strong{33}(1), 31-88.

Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric",
\emph{IEEE Transactions on Pattern Analysis and Machine Intelligence}
\strong{29}, 1091–1095.
}
\seealso{
Other edit-based comparators include \code{\link{Hamming}}, \code{\link{LCS}},
\code{\link{OSA}} and \code{\link{DamerauLevenshtein}}.
}
