Steps in Analysing Data

Here are the steps involved in analysing data:

  1. load the datasets
  2. Viewing the data
  3. Cleaning the data
  4. visualizing the data

load the datasets
#For the example below, we will be exploring the tidyverse package.

#activate the tidyverse library.

2. Viewing the Data

# to preview the data we will use the head() function. This function
# will return the first 6 rows by default. To view more rows, more arguments will have to be added to the function… such as head(diamonds, 10). This will return the first 10 rows.
head(diamonds) # the dead() function will return the first 6 rows by default.
str(diamonds) # structure function will return a summary of the columns.
summary(diamonds) #This function provides a summary of numeric data, including minimum, maximum, mean, median, and quartiles
View(data) # view the colums and rows in data
dim(diamonds) # this function will return the dimendions of the dataset (rows x column)
colnames() or names(): # this will return the names of the variables

dplyr functions

summarize() #

3. Cleaning Data

Here is a list of some commonly used R packages and relevant functions for data cleaning:

dplyr: This package provides a set of functions for data manipulation.

filter(): For subsetting data based on conditions.
select(): To select specific columns.
mutate(): To create new variables.
arrange(): For sorting data.
group_by(): For grouping data.

tidyr: This package is used for tidying and reshaping data.

gather(): For converting wide data to long data.
spread(): For converting long data to wide data.
separate(): To split a single column into multiple columns.
unite(): To combine multiple columns into one..

stringr: This package helps with string manipulation.

str_replace(): For replacing text within strings.
str_detect(): To detect patterns within strings.
str_split(): For splitting strings.

data.table: This package provides fast and memory-efficient data manipulation.

:=: For creating or modifying columns by reference.
.SD: Refers to all the columns in a data table.
.I: Refers to the row index.

readr: Part of the tidyverse, this package is used for reading and parsing data files.

read_csv(): For reading CSV files.
read_excel(): For reading Excel files.
read_table(): For reading tabular data.

sqldf: Allows you to run SQL queries on data frames.

sqldf(): To execute SQL queries on data frames.

janitor: A package for data cleaning and tabulation.

clean_names(): To clean column names.
remove_empty(): For removing empty rows and columns.
get_dupes(): To find duplicate rows.

stringi: For advanced string processing.

stri_trans_tolower(): Convert text to lowercase.
stri_trans_toupper(): Convert text to uppercase.
stri_replace_all_regex(): For regex-based string replacements.

forcats: Part of the tidyverse, this package deals with factors (categorical variables).

fct_reorder(): Reorder factor levels.
fct_relevel(): Reorder and add missing levels.
fct_collapse(): Combine factor levels.

zoo: This package is used for handling irregular time series data.

na.locf(): Last observation carried forward for missing values.
na.approx(): Linear interpolation for missing values.

tidytext: For text mining and analysis.

unnest_tokens(): Tokenize text data.
anti_join(): Remove stopwords from text data.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top