I just finished reviewing a pull request in the knitr repo that tries to improve the error message when it fails to parse YAML, and I feel three base R functions are worth mentioning to more R users. I have been inspired by Maëlle Salmon’s 3 functions blog series, and finally started writing one by myself.
regexec(): get substrings with patterns
If you want to master string processing using regular expressions (regex) with
base R, the two help pages ?grep and ?regexp are pretty much all you need.
Although I had read them many times in the past, I did not discover regexec()
until about three years ago, while this function was first introduced in R
2.14.0 (2011-10-31).
This function gives you the positions of substring groups captured by your
regular expressions. It will be much easier to understand if you actually get
the substrings instead of their positions, which can be done via regmatches(),
another indispensable function when you work with functions like regexec() and
regexpr(). For example:
x = 'abbbbcdefg'
m = regexec('a(b+)', x) # positions
regmatches(x, m) # substrings
[[1]]
[1] "abbbb" "bbbb"
The length of the returned value depends on how many () groups you have in the
regular expression. In the above example, the first value is the whole match
(abbbb is matched by a(b+)), and the second value is for the first group
(b+) (any number of consecutive b’s).
If you do not know regexec() or regmatches(), it is natural to do substr()
like the aforementioned pull request originally did:
message = e$message
regex = "line (?<line>\\d+), column (?<column>\\d+)"
regex_result = regexpr(regex, message, perl = TRUE)
starts = attr(regex_result, "capture.start")
lengths = attr(regex_result, "capture.length")
line_index = substr(message, starts[,"line"], starts[,"line"] + lengths[,"line"] - 1)
column_index = substr(message, starts[,"column"], starts[,"column"] + lengths[,"column"] - 1)
line_index = as.integer(line_index)
column_index = as.integer(column_index)
Its goal is to extract a line and column number from a string of the form
"line x, column y". I rewrote the code (using my obnoxious
one-letter-variable-name style) as:
x = e$message
r = "line (?<row>\\d+), column (?<col>\\d+)"
m = regmatches(x, regexec(r, x, perl = TRUE))[[1]][-1]
row = as.integer(m['row'])
col = as.integer(m['col'])
Note that (<?NAME>...) means a named capture, so you could later extract the
substrings by names instead of numeric indices, e.g., m['row'] instead of
m[1]. But this is not important. It is okay to use a numeric index.
BTW, if you are new to regular expressions and not sure if you should use
perl = TRUE or FALSE (often the default) in the regex family of functions,
I’d recommend perl = TRUE. Perl-compatible regular expressions (PCRE) should
cause you fewer surprises and are more powerful.
strrep(): repeat a string for a number of times
How many times have you done this?
paste(rep('abc', 10), collapse = '')
I have done it for numerous times. Now, no more rep() or paste(). Use
strrep() instead:
strrep('abc', 10)
It is even vectorized like most other base R functions, e.g.,
strrep(c('abc', 'defg'), c(3, 4))
I do not want to pretend that I have always known everything—in fact, I did not discover this function until about two years ago.
It is common to generate N spaces like the original pull request did:
spaces = paste(rep(" ", column_index), collapse = "")
cursor = paste(spaces, "^~~~~~", collapse = "")
And I rewrote it as:
cursor = paste0(strrep(" ", col), "^~~~~~")
append(): insert elements to a vector
Maëlle has mentioned append() in her post. Interestingly, it could be used in
this pull request, too. Original code:
split_indexes = seq_along(meta) <= line_index
before_cursor = meta[split_indexes]
after_cursor = meta[!split_indexes]
error_message = c(
"Failed to parse YAML: ", e$message, "\n",
before_cursor,
cursor,
after_cursor
)
New code:
x = c("Failed to parse YAML: ", x, "\n", append(meta, cursor, row))
I remember when I first learned S-Plus in 2004, I was surprised to see a
classmate wrote a t.test() function by herself (which was actually cool) and
she was equally surprised when I told her that there was a built-in t.test()
function. I think similar things still happen today. If you are not aware of
regexec(), strrep(), or append(), it is easy and tempting to reinvent
them, which can make your code lengthy and complicated.
Donate
As a freelancer (currently working as a contractor) and a dad of three kids, I truly appreciate your donation to support my writing and open-source software development! Your contribution helps me cope with financial uncertainty better, so I can spend more time on producing high-quality content and software. You can make a donation through methods below.
-
Venmo:
@yihui_xie, or Zelle:xie@yihui.name -
Paypal
-
If you have a Paypal account, you can follow the link https://paypal.me/YihuiXie or find me on Paypal via my email
xie@yihui.name. Please choose the payment type as “Family and Friends” (instead of “Goods and Services”) to avoid extra fees. -
If you don’t have Paypal, you may donate through this link via your debit or credit card. Paypal will charge a fee on my side.
-
-
Other ways:
WeChat Pay (微信支付:谢益辉) Alipay (支付宝:谢益辉) 

When sending money, please be sure to add a note “gift” or “donation” if possible, so it won’t be treated as my taxable income but a genuine gift. Needless to say, donation is completely voluntary and I appreciate any amount you can give.
Please feel free to email me if you prefer a different way to give. Thank you very much!
I’ll give back a significant portion of the donations to the open-source community and charities. For the record, I received about $30,000 in total (before tax) in 2024-25, and gave back about $15,000 (after tax).