how to calculate no.of characters and count the words in R
Count Number of Words in String using R
Method
1: Using strplit and sapply methods
The strsplit() method in
R is used to return a vector of words contained in the specified string based
on matching with regex defined. Each element of this vector is a substring of
the original string. The length of the returned vector is therefore equivalent
to the number of words.
Syntax: strsplit(
str , regex )
Arguments
:
·
str – The
string to count the occurrences of
·
regex – The
character vector (or object which can be coerced ) containing regular
expression, for the pattern to be matched. In the case of finding the number of
words the pattern is simply equivalent to ” “.
sapply()
method: It is used to compute the length of the vector
containing words. The sapply() method is used to apply functions over vectors
or lists, and return outputs based on these computations. In case the second
argument, that is, the function is length, then the length of the split vector
is returned.
sapply (str , FUN)
The combined approach to
determine the composite words is defined by the following syntax in R :
sapply(strsplit(str, " "), length)
# declaring string
str <- "Counting the words in this R sentence?\
Try this approach in GFG! "
print ("Original string")
print (str)
print ("Total number of words")
# splitting a string by spaces
split <- strsplit(str, " ")
sapply( split , length)
Method 2: Using gregexpr method.
This method uses a variety of methods available in base R to compute the
number of occurrences of a specific character in R. The gregexpr() method is
used to return a list of sublists that match a specific pattern of the argument
list of the function. The pattern matching used is case-sensitive in this case.
The pattern in our case is \\W+.
Syntax: gregexpr(pattern, text)
The lengths method is then applied in
order to return the individual lengths of all the elements of the argument
vector.
Syntax: lengths(x)
This method uses the regular
expression symbol \\W to match non-word characters, using + to indicate one or
more in a row. It returns the number of separators between the words, so the
number of words is actually, separators + 1, in most cases.
str <- "Counting the words in this R sentence? \
Try this approach in GFG! "
print ("Original string")
print (str)
print ("Total number of words")
# splitting a string by spaces
lengths(gregexpr("\\W+", str)) + 1
Method
3: Using stringr package
The stringR package in R
is used to perform string manipulations. It needs to be explicitly installed in
the working space to access its methods and routines.
install.packages("stringr")
The
stringr package provides a str_count() method
which is used to count the number of occurrences of a certain pattern specified
as an argument to the function. The pattern may be a single character or a
group of characters. Any instances matching the expression result in the
increment of the count. This method can also be invoked over a vector of
strings, and an individual count vector is returned containing individual
counts of the number of pattern matches found. However, this method is only
considered approximate of regex matching. In case, no matches are found 0 is returned.
Syntax: str_count(str,
pattern = “”)
Arguments
:
·
str – The string to count the
occurrences of
·
pattern – the pattern to match to
library("stringr")
#
declaring string
str
<- "Counting the words in this R sentence? Try this approach in GFG!
"
print
("Original string")
print
(str)
print
("Total number of words")
#
splitting a string by spaces
str_count(str
,"\\W+")
How to calculate the number of occurrences of a character in each row of
R Data Frame?
Method 1: Using stringr package
The stringr
package in R programming language can be used to perform string manipulations
and extraction, which can be installed into the working space.
The str_count() method is used to
return the matching of the specified pattern in the vector of strings. It
returns an integer vector of the number of instances of the pattern found in
the input argument vector. The str_count() method is case-sensitive.
Syntax:
str_count(str, pattern = “”)
Parameter :
·
str
– The vector of strings or a single string to search for the pattern
·
pattern
– The pattern to be searched for. Usually a regular expression.
The pattern may be a single character
or a group of characters stacked together. It may even contain special symbols
or digits. In case, the pattern is not found, an integer value of 0 is
returned.
Example:
# loading the reqd library
library ("stringr")
# creating a data frame
data_frame <- data.frame(
col1 = c(1:5), col2 =
c("Geeks","for","geeks","CSE","portal"))
# character to search for
ch <- "e"
# counting the occurences of character
count <- str_count(data_frame$col2, ch)
print ("Count of e :")
print (count)
Method 2: Using grepexpr method
The gregexpr() method of base R is
used to indicate where a pattern is located within a specified character
vector. It is used to return a vector of vectors of the starting positions of
the matching w.r.t each component of the input character array. The returned
vector’s length is equivalent to the length of the original string
vector.
Syntax:
gregexpr(pattern, str, ignore.case=FALSE)
Parameter :
·
str
– The vector of strings or a single string to search for the pattern
·
pattern
– The pattern to be searched for. Usually a regular expression.
·
ignore.case
– Indicator to ignore case or not
Here, the pattern is the character to
search for and the str is the column of strings to look the pattern in.
The regmatches() method is applied over the output of this function,
which is used to extract or replace the matched substrings from the matched
data. In case, no match of the substring pattern is found, empty string is
returned.
Syntax:
regmatches(str, m)
Parameter :
·
m
– The output vector from the matched data.
This is followed by the application
of lengths() method, which returns the length of each substring component from
the regmatches() vector.
Example:
# creating a data frame
data_frame <- data.frame(
col1 = c(1:5), col2 =
c("!?contains","do!es!nt","Contain","cs!!!e","circus?"))
print ("Original DataFrame")
print (data_frame)
# character to search for
ch <- "!"
count <- regmatches(data_frame$col2, gregexpr(ch,
data_frame$col2))
print ("Count of !")
# returning the number of occurences
lengths(count)
Method 3: Using sapply method
·
The sapply() method in R
is used to apply a user-defined function over the specified input vector taken
as the first argument. The user-defined function, in this case, consists of a
sequence of steps :
Syntax:
sapply ( x , fun)
·
strsplit() method is
applied to split each component of the input vector into components based on ”
” delimiter. It is useful in case a string consists of multiple words. It
returns an array of words in each element of the column.
·
The unlist() method is
then applied to each word in a vector of letters, and check if each letter is
equivalent to the character we wish to search for. The sum() method is then
applied to increment the count each time a match is found.
Syntax:
sum ( unlist( str) == ch)
Example:
# creating a data frame
data_frame <- data.frame(
col1 = c(1:5), col2 =
c("!?contains","do!es!nt","Contain","cs!!!e","circus?"))
print ("Original DataFrame")
print (data_frame)
# character to search for
ch <- "!"
count <- sapply(as.character(data_frame$col2),
function(x, letter = ch){
str <- strsplit(x, split = "")
sum(unlist(str) == letter)
})
print ("Count of !")
# returning the number of occurences
print(count)
Comments
Post a Comment