It is necessary at times to censor certain characters in a string. In this case, censoring simply means replacing a set of characters with a different set of meaningless characters but of the same length. For example, I might wish to censor the substring “low” in the phrase, “Flowers for a friend.” which would result in “F***ers for a friend.”
Given a goal of preserving length, this could be accomplished when working with fixed search strings using mgsub.
library(mgsub)
string = "Time to flip this family into a fun pit of pudding!"
pattern = c("flip", "family", "fun")
replacement = vapply(pattern, function(x) {
paste(rep("*", nchar(x)), collapse = "")
}, FUN.VALUE = "")
mgsub(string, pattern, replacement)
## [1] "Time to **** this ****** into a *** pit of pudding!"
However, this can breakdown when using variable length regular expression matching. The number of censor characters in the replacement is based on the length of the regular expression, not the match itself. So this fails to maintain character length.
string = "Time to flip this family into a fun pit of pudding!"
pattern = c("f[^ ]*i[^ ]*", "fun")
replacement = vapply(pattern, function(x) {
paste(rep("*", nchar(x)), collapse = "")
}, FUN.VALUE = "")
mgsub(string, pattern, replacement)
## [1] "Time to ************ this ************ into a *** pit of pudding!"
Even if you have fixed matches, it shouldn’t be necessary to produce and maintain an equivalent vector of censor replacment.
This is where the idea of mgsub_censor
comes in.
mgsub_censor
provides the same safe, simultaneous string
substitution functionality of mgsub
but with a more narrow
task of censoring strings. You provide patterns to match as well as your
desired censoring character and the censoring is applied
simultaneously.
string = "Time to flip this family into a fun pit of pudding!"
pattern = c("f[^ ]*i[^ ]*", "fun")
mgsub::mgsub_censor(string = string, pattern = pattern, censor = "*")
## [1] "Time to **** this ****** into a *** pit of pudding!"
You may wish to produce the comical censoring effects often used in comic strips. This is suppored through multicharacter censors which can be provided in multiple ways.
If the split
argument is TRUE (by default it is in this
case), the value will be split into individual characters and these will
be sampled to produce the effect.
string = "Why don't you go flip a cookie?"
pattern = "flip"
censor = "?#!*"
print(mgsub::mgsub_censor(string, pattern, censor, split = TRUE))
## [1] "Why don't you go **?* a cookie?"
## [1] "Why don't you go ##!# a cookie?"
The randomness can be limited by setting a seed.
string = "Why don't you go flip a cookie?"
pattern = "flip"
censor = "?#!*"
mgsub::mgsub_censor(string, pattern, censor, split = TRUE, seed = 1002)
## [1] "Why don't you go *?!? a cookie?"
It is also possible to produce output with more characters than the input by setting split to FALSE. In this case, the 4 character censor will be replicated 4 times because of the match length and so the output is 12 characters longer than the input. Use this with caution.
string = "Why don't you go flip a cookie?"
pattern = "flip"
censor = "?#!*"
mgsub::mgsub_censor(string, pattern, censor, split = FALSE)
## [1] "Why don't you go ?#!*?#!*?#!*?#!* a cookie?"
This is the same as the case with a multicharacter, vector of length one and split = TRUE. Note how setting split = FALSE doesn’t impact output character count.
string = "Why don't you go flip a cookie?"
pattern = "flip"
censor = c("?", "#", "!", "*")
print(mgsub::mgsub_censor(string, pattern, censor, split = TRUE))
## [1] "Why don't you go !*?# a cookie?"
## [1] "Why don't you go ?!*# a cookie?"
## [1] "Why don't you go *?!? a cookie?"
In this case, when split = TRUE, the fact that the vector has a
length greater than one doesn’t matter. Each vector element is split,
then the set is unlist
ed.
string = "Why don't you go flip a cookie?"
pattern = "flip"
censor = c("?#", "!*")
print(mgsub::mgsub_censor(string, pattern, censor, split = TRUE))
## [1] "Why don't you go !*?# a cookie?"
## [1] "Why don't you go ?!*# a cookie?"
## [1] "Why don't you go *?!? a cookie?"
When split is set to FALSE, it’s the same case as a length one, multicharacter censor except that the vector elements are sampled. Here we sample between two 2-character elements four times so we end up with 8 characters, 4 more than we started with. Use split = FALSE with caution.
string = "Why don't you go flip a cookie?"
pattern = "flip"
censor = c("?#", "!*")
mgsub::mgsub_censor(string, pattern, censor, split = FALSE)
## [1] "Why don't you go ?#!*?#!* a cookie?"
The most compelling feature of mgsub
is it’s safety.
Here is a quick overview of what is meant by safety:
mgsub_censor
maintains that safety as demonstrated
below. Note how the shorter kilo is ignored when matching kilogram
despite being a substring.
string = "I'm selling 100 kilograms of bleach for $20/kilo"
pattern = c("kilo", "kilogram")
censor = "*"
mgsub::mgsub_censor(string, pattern, censor)
## [1] "I'm selling 100 ********s of bleach for $20/****"