how to calculate no.of characters and count the words in R

 

Count Number of Words in String using R

 

Method 1: Using strplit and sapply methods

The strsplit() method in R is used to return a vector of words contained in the specified string based on matching with regex defined. Each element of this vector is a substring of the original string. The length of the returned vector is therefore equivalent to the number of words. 

Syntax: strsplit( str , regex )

Arguments :

·         str – The string to count the occurrences of

·         regex – The character vector (or object which can be coerced ) containing regular expression, for the pattern to be matched. In the case of finding the number of words the pattern is simply equivalent to ” “.

sapply() method: It is used to compute the length of the vector containing words. The sapply() method is used to apply functions over vectors or lists, and return outputs based on these computations. In case the second argument, that is, the function is length, then the length of the split vector is returned. 

sapply (str , FUN)

The combined approach to determine the composite words is defined by the following syntax in R :

sapply(strsplit(str, " "), length)

# declaring string

str <- "Counting the words in this R sentence?\

Try this approach in GFG! "

  print ("Original string")

print (str)

print ("Total number of words")

  # splitting a string by spaces

split <- strsplit(str, " ")

sapply( split , length)

Method 2: Using gregexpr method.

This method uses a variety of methods available in base R to compute the number of occurrences of a specific character in R. The gregexpr() method is used to return a list of sublists that match a specific pattern of the argument list of the function. The pattern matching used is case-sensitive in this case. The pattern in our case is \\W+

Syntax: gregexpr(pattern, text)

The lengths method is then applied in order to return the individual lengths of all the elements of the argument vector.

Syntax: lengths(x)

This method uses the regular expression symbol \\W to match non-word characters, using + to indicate one or more in a row. It returns the number of separators between the words, so the number of words is actually, separators + 1, in most cases.

 

str <- "Counting the words in this R sentence? \

Try this approach in GFG! "

  print ("Original string")

print (str)

print ("Total number of words")

  # splitting a string by spaces

lengths(gregexpr("\\W+", str)) + 1  

 

Method 3: Using stringr package 

 

The stringR package in R is used to perform string manipulations. It needs to be explicitly installed in the working space to access its methods and routines.

install.packages("stringr")

The stringr package provides a str_count() method which is used to count the number of occurrences of a certain pattern specified as an argument to the function. The pattern may be a single character or a group of characters. Any instances matching the expression result in the increment of the count. This method can also be invoked over a vector of strings, and an individual count vector is returned containing individual counts of the number of pattern matches found. However, this method is only considered approximate of regex matching. In case, no matches are found 0 is returned.

Syntax: str_count(str, pattern = “”)

Arguments :

·         str – The string to count the occurrences of

·         pattern – the pattern to match to

library("stringr")

  # declaring string

str <- "Counting the words in this R sentence? Try this approach in GFG! "

print ("Original string")

print (str)

print ("Total number of words")

  # splitting a string by spaces

str_count(str ,"\\W+") 

 

How to calculate the number of occurrences of a character in each row of R Data Frame?

 

Method 1: Using stringr package

 

The stringr package in R programming language can be used to perform string manipulations and extraction, which can be installed into the working space.

The str_count() method is used to return the matching of the specified pattern in the vector of strings. It returns an integer vector of the number of instances of the pattern found in the input argument vector. The str_count() method is case-sensitive. 

Syntax:

str_count(str, pattern = “”)

Parameter : 

·         str – The vector of strings or a single string to search for the pattern

·         pattern – The pattern to be searched for. Usually a regular expression.

The pattern may be a single character or a group of characters stacked together. It may even contain special symbols or digits. In case, the pattern is not found, an integer value of 0 is returned. 

Example:

# loading the reqd library

library ("stringr")

  # creating a data frame

data_frame <- data.frame(

  col1 = c(1:5), col2 = c("Geeks","for","geeks","CSE","portal"))

  # character to search for

ch <- "e"

  # counting the occurences of character

count <- str_count(data_frame$col2, ch)

print ("Count of e :")

print (count)

Method 2: Using grepexpr method

The gregexpr() method of base R is used to indicate where a pattern is located within a specified character vector. It is used to return a vector of vectors of the starting positions of the matching w.r.t each component of the input character array. The returned vector’s length is equivalent to the length of the original string vector. 

Syntax:

gregexpr(pattern, str, ignore.case=FALSE)

Parameter :

·         str – The vector of strings or a single string to search for the pattern

·         pattern – The pattern to be searched for. Usually a regular expression.

·         ignore.case – Indicator to ignore case or not

Here, the pattern is the character to search for and the str is the column of strings to look the pattern in.  The regmatches() method is applied over the output of this function, which is used to extract or replace the matched substrings from the matched data. In case, no match of the substring pattern is found, empty string is returned. 

Syntax:

regmatches(str, m)

Parameter : 

·         m – The output vector from the matched data. 

This is followed by the application of lengths() method, which returns the length of each substring component from the regmatches() vector. 

Example:

# creating a data frame

data_frame <- data.frame(

  col1 = c(1:5), col2 = c("!?contains","do!es!nt","Contain","cs!!!e","circus?"))

  print ("Original DataFrame")

print (data_frame)

  # character to search for

ch <- "!"

count <- regmatches(data_frame$col2, gregexpr(ch, data_frame$col2))

  print ("Count of !")

  # returning the number of occurences

lengths(count)

 

Method 3: Using sapply method

 

·         The sapply() method in R is used to apply a user-defined function over the specified input vector taken as the first argument. The user-defined function, in this case, consists of a sequence of steps :

Syntax:

sapply ( x , fun)

·         strsplit() method is applied to split each component of the input vector into components based on ” ” delimiter. It is useful in case a string consists of multiple words. It returns an array of words in each element of the column.

·         The unlist() method is then applied to each word in a vector of letters, and check if each letter is equivalent to the character we wish to search for. The sum() method is then applied to increment the count each time a match is found.

Syntax:

sum ( unlist( str) == ch)

Example:

# creating a data frame

data_frame <- data.frame(

  col1 = c(1:5), col2 = c("!?contains","do!es!nt","Contain","cs!!!e","circus?"))

  print ("Original DataFrame")

print (data_frame)

  # character to search for

ch <- "!"

count <- sapply(as.character(data_frame$col2), 

function(x, letter = ch){

  str <- strsplit(x, split = "")

  sum(unlist(str) == letter)

})

print ("Count of !")

  # returning the number of occurences

print(count)

 

 

Comments

Popular posts from this blog

How to create Animated 3d chart with R.

Linux/Unix Commands frequently used

R Programming Introduction