What are Regular Expressions and How to Use Them in R

외계어가 아니에요!

R
NLP
Author

Changjun Lee

Published

March 8, 2023

Regular expressions are a powerful tool for searching, matching, and manipulating text. They are a type of pattern language that can be used to describe sets of character strings, such as words or sentences, in a concise and flexible way. In this blog post, we will introduce what regular expressions are, why they are useful, and how to use them in R.

A regular expression is a sequence of characters that define a search pattern. The pattern can be used to match (and sometimes replace) strings, or to perform some other manipulation of strings. Regular expressions are commonly used for searching for patterns in text data, such as extracting specific information from a document, validating input data, or matching and replacing specific characters.

One of the most powerful features of regular expressions is their ability to specify patterns using a compact and flexible syntax. For example, the regular expression a.b will match any string that contains an a followed by any single character followed by a b. The dot (.) is a special character that matches any single character. Similarly, the regular expression [abc] will match any string that contains one of the characters a, b, or c.

In R, there are several functions for working with regular expressions, including grep, grepl, sub, and gsub. These functions allow you to search for, match, and replace patterns in strings. For example, the grep function can be used to search for patterns in a vector of strings:

text <- c("apple", "banana", "cherry")
grep("^b", text)
[1] 2

This code uses the grep function to search for strings in the text vector that start with the letter b. The ^ symbol is a special character that matches the start of a string. The result of this code will be a vector of the indices of the strings in the text vector that match the pattern:

The grepl function is similar to grep, but returns a logical vector indicating which elements of the input vector match the pattern, rather than the indices:

grepl("^b", text)
[1] FALSE  TRUE FALSE

The sub and gsub functions can be used to replace matches in a string:

sub("b", "B", text)
[1] "apple"  "Banana" "cherry"
gsub("b", "B", text)
[1] "apple"  "Banana" "cherry"

Regular expressions are a powerful tool for searching, matching, and manipulating text data. They are used to describe patterns in text data in a concise and flexible way, and can be used for a variety of purposes, such as extracting specific information from a document, validating input data, or matching and replacing specific characters. With the grep, grepl, sub, and gsub functions in R, you can easily search for and manipulate patterns in text data.

One of the key features of regular expressions is the ability to use special characters to define patterns. Here are some of the most commonly used special characters in regular expressions:

text <- c("apple", "banana", "cherry")
grep("^[abc]", text)
[1] 1 2 3

This code uses the grep function to search for strings in the text vector that start with the characters a, b, or c. The result of this code will be a vector of the indices of the strings in the text vector that match the pattern:

text <- c("apple", "banana", "cherry")
grep("^[^abc]", text)
integer(0)
text <- c("apple", "banana", "cherry")
grep("^b", text)
[1] 2
text <- c("apple", "banana", "cherry")
grep("b$", text)
integer(0)
text <- c("apple", "banana", "cherry")
grep("e$", text)
[1] 1
text <- c("apple", "banana", "cherry")
grep("a+", text)
[1] 1 2
text <- c("apple", "banana", "cherry")
grep("a?", text)
[1] 1 2 3

If you want to learn more about regular expressions in R, check out this video tutorial: https://www.youtube.com/watch?v=q8SzNKib5-4. The video provides a comprehensive explanation of regular expressions and how to use them in R, with practical examples and hands-on exercises. Whether you’re a beginner or an advanced user, this video is a great resource to deepen your understanding of regular expressions and their applications in R.

한글로 된 설명은 안도현 교수님의 eBook 참고: https://bookdown.org/ahn_media/bookdown-demo/cleantool.html#%EC%A0%95%EA%B7%9C%ED%91%9C%ED%98%84%EC%8B%9Dregular-expressions