UNIT-6
Exploring R
Introduction:
- R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
- Among other things it has
-an effective data handling and storage facility,
-a suite of operators for calculations on arrays, in particular matrices,
-a large, coherent, integrated collection of intermediate tools for data analysis
-graphical facilities for data analysis and display either directly at the computer or on hardcopy
-a well developed, simple and effective programming language (called ‘S’) which includes conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most of the system supplied functions are themselves written in the S language.)
- The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
- R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages.
- However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.
- R Software:
- R is a programming language and free software developed by Ross Ihaka and Robert Gentleman in 1993.
- R possesses an extensive catalog of statistical and graphical methods. It includes machine learning algorithms, linear regression, time series, and statistical inference to name a few.
- Most of the R libraries are written in R, but for heavy computational tasks, C, C++ and FORTRAN codes are preferred.
- R is not only entrusted by academic, but many large companies also use R programming language, including Uber, Google, Airbnb, Facebook and so on.
- Data analysis with R is done in a series of steps; programming, transforming, discovering, modeling and communicate the results
- Program: R is a clear and accessible programming tool
- Transform: R is made up of a collection of libraries designed specifically for data science
- Discover: Investigate the data, refine your hypothesis and analyze them
- Model: R provides a wide array of tools to capture the right model for your data
- Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny apps to share with the world
R programming language is filled with such exciting and amazing features
There are many things R can do for data scientists and analysts. These key features are what set R apart from the crowd of statistical languages:
1. Open-source:
- R is an open-source software environment. It is free of cost and can be adjusted and adapted according to the user’s and the project’s requirements.
- You can make improvements and add packages for additional functionalities.
- R is freely available. You can learn how to install R, Download and start practicing it.
2. Strong Graphical Capabilities
- R can produce static graphics with production quality visualizations and has extended libraries providing interactive graphic capabilities.
- This makes data visualization and data representation very easy.
- From concise charts to elaborate and interactive flow diagrams, all are well within R’s repertoire. Look at the attractive graphical visualizations in R.
Fig 1: Data Visualization in R
3. Highly Active Community
- R has an open-source library which is supported by its growing number of users.
- The R environment is continuously growing. This growth is due to its large user-base.
4. A Wide Selection of Packages
- CRAN or Comprehensive R Archive Network houses more than 10,000 different packages and extensions that help solve all sorts of problems in data science.
- High-quality interactive graphics, web application development, quantitative analysis or machine learning procedures, there is a package for every scenario available.
- R contains a sea of packages for all the forms of disciplines like astronomy, biology, etc. While R was originally used for academic purposes, it is now being used in industries as well.
5. Comprehensive Environment
- R has a very comprehensive development environment meaning it helps in statistical computing as well as software development.
- R is an object-oriented programming language. It also has a robust package called Rshiny which can be used to produce full-fledged web apps.
- Combined with data analysis and data visualization, R can be used for highly interactive online data-driven storytelling.
6. Can Perform Complex Statistical Calculations
- R can be used to perform simple and complex mathematical and statistical calculations on data objects of a wide variety.
- It can also perform such operations on large data sets.
7. Distributed Computing
- In distributed computing, tasks are split between multiple processing nodes to reduce processing time and increase efficiency.
- R has packages like ddR and multiDplyr that enable it to use distributed computing to process large data sets.
8. Running Code without a Compiler
- R is an interpreted language which means that it does not need a compiler to make a program from the code.
- R directly interprets provided code into lower-level calls and pre-compiled code
9. Interfacing with Databases
- R contains several packages that enable it to interact with databases like Roracle, Open Database Connectivity Protocol, RmySQL, etc.
10. Data Variety
- R can handle a variety of structured and unstructured data. It also provides various data modeling and data operation facilities due to its interaction with databases.
11. Machine Learning
- R can be used for machine learning as well. The best use of R when it comes to machine learning is in case of exploration or when building one-off models.
12. Data Wrangling
- Data wrangling is the process of cleaning complex and inconsistent data sets to enable convenient computation and further analysis. This is a very time taking process.
- R with its extensive library of tools can be used for database manipulation and wrangling.
13. Cross-platform Support
- R is machine-independent. It supports the cross-platform operation. Therefore, it can be used on many different operating systems.
14. Compatible with Other Programming Languages
- While most of its functions are written in R itself, C, C++ or FORTRAN can be used for computationally heavy tasks. Java, .NET, Python, C, C++, and FORTRAN can also be used to manipulate objects directly.
15. Data Handling and Storage
- R is integrated with all the formats of data storage due to which data handling becomes easy.
16. Vector Arithmetic
- Vectors are the most basic data structure in R, and most other data structures are derived from vectors.
- R uses vectors and vector arithmetic and does not need a lot of looping to process a large set of values. This makes R much more efficient.
17. Compatibility with Other Data Processing Technologies
- R can be easily paired with other data processing and distributed computing technologies like Hadoop and Spark. It is possible to remotely use a Spark cluster to process large datasets using R.
- R and Hadoop can be paired as well to combine Hadoop’s large scale data processing and distributing computing capabilities with R’s statistical computing power.
18. Generates Report in any Desired Format
- R’s markdown package is the only report generation package you will ever need when working with R. The markdown package can help produce web pages.
- It can also generate reports in the form of word documents or PowerPoint presentations. All with your R code and results embedded into them
6.1.1. Some Unique Features of R Programming
Due to a large number of packages available, there are many other handy features as well:
- Since R can perform operations directly on vectors, it doesn’t require too much looping.
- R can pull data from APIs, servers, SPSS files, and many other formats.
- R is useful for web scraping.
- It can perform multiple complex mathematical operations with a single command.
- Using R Markdown, it can create attractive reports that combine plain text with code and visualizations of the results.
- Due to a large number of researchers and statisticians using it, new ideas and technologies often appear in the R community first.
Key Takeaways
- R programming language is filled with such exciting and amazing features:
- Open-source
- Strong Graphical Capabilities
- Highly Active Community
- A Wide Selection of Packages
- Comprehensive Environment
- Can Perform Complex Statistical Calculations
- Distributed Computing
- Running Code without a Compiler
- Interfacing with Databases
- Data Variety
- Machine Learning
- Data Wrangling
- Cross-platform Support
- Compatible with Other Programming Languages
- Data Handling and Storage
- Vector Arithmetic
- Compatibility with Other Data Processing Technologies
- Generates Report in any Desired Format
- As part of the process of downloading and installing R, you get the standard graphical user interface (GUI), called RGui.
- RGui gives you some tools to manage your R environment — most important, a console window.
- The console is where you type instructions, or scripts, and generally get R to do useful things for you.
6.2.1. Explore the R console
- The standard installation process creates useful menu shortcuts (although this may not be true if you use Linux, because there is no standard RGui editor for Linux).
- In the menu system, look for a folder called R, and then find an icon called R followed by a version number.
- When you open RGui for the first time, you see the R Console screen, which lists some basic information such as your version of R and the licensing conditions.
- Below all this information is the R prompt, denoted by a > symbol. The prompt indicates where you type your commands to R; you see a blinking cursor to the right of the prompt.
- Use the console to issue a very simple command to R. R responds immediately to your command
- One of the clever things about R is that it can deal with calculating many values at the same time, which is called vector operations. You need to know is that R can handle more than one value at a time.
- To quit your R session, type the following code in the console, after the command prompt (>)
- R asks you a question to make sure that you meant to quit,. Click No, because you have nothing to save. This action closes your R session (as well as RGui, if you’ve been using RGui as your code editor).
6.2.2. Windows Installation
- You can download the Windows installer version of R from R-3.2.2 for Windows (32/64 bit) and save it in a local directory.
- As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just double click and run the installer accepting the default settings.
- If your Windows is 32-bit version, it installs the 32-bit version. But if your windows is 64-bit, then it installs both the 32-bit and 64-bit versions.
- After installation you can locate the icon to run the Program in a directory structure "R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon brings up the R-GUI which is the R console to do R Programming.
6.2.3. Linux Installation
- R is available as a binary for many versions of Linux at the location R Binaries.
- The instruction to install Linux varies from flavor to flavor. These steps are mentioned under each type of Linux version in the mentioned link. However, if you are in a hurry, then you can use yum command to install R as follows −
$ yum install R
- Above command will install core functionality of R programming along with standard packages, still you need additional package, then you can launch R prompt
- Now you can use install command at R prompt to install the required package.
Key Takeaways
- As part of the process of downloading and installing R, you get the standard graphical user interface (GUI), called RGui.
- RGui gives you some tools to manage your R environment — most important, a console window.
- You can download the Windows installer version of R from R-3.2.2 for Windows (32/64 bit) and save it in a local directory.
- R is available as a binary for many versions of Linux at the location R Binaries.
- The instruction to install Linux varies from flavor to flavor. These steps are mentioned under each type of Linux version in the mentioned link. However, if you are in a hurry, then you can use yum command to install R as follows −
$ yum install R
- As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just double click and run the installer accepting the default settings.
- If your Windows is 32-bit version, it installs the 32-bit version. But if your windows is 64-bit, then it installs both the 32-bit and 64-bit versions
A vector is a sequence of elements that share the same data type. These elements are known as components of a vector.
R vector comes in two parts: Atomic vectors and Lists. They have three common properties:
- Type function – What it is?
- Length function – How many elements it contains.
- Attribute function – Extra arbitrary metadata.
These data structures share one difference, that is, they differ in the type of their elements: All elements of an atomic vector must be of the same type, whereas the elements of a list can have different types.
6.3.1. Atomic Vectors in R
There are four common types of R atomic vectors:
1. Numeric Data Type
Decimal values are referred to as numeric data types in R. If we assign a decimal value for any variable g, as given below then, g will become a numeric type.
2. Integer Data Type
A numeric value with no fraction called integer data is represented by “Int”. -54 and 23 are two of the examples of an integer. Int size is 2 bytes while long Int size is 4 byte.
In order to assign an integer to a variable, there are two ways:
- The first way is to use the as.integer() function
- The second way is the appending of L to the value
3. Character Data Type
The character is held as the one-byte integer in memory. There are two ways to create a character data type value in R:
- The first method is by typing a string between ” “
- In order to convert a number into character, make use of as.character() function
4. Logical Data Type
A logical data type returns either of the two values – TRUE or FALSE based on which condition is satisfied.
6.3.2. Creating Vectors in R:
The c() function is used for creating a vector in R. This function returns a one-dimensional array, also known as vector.
There are several other ways of creating a vector:
1. Using the Operator
2. Create R vector using seq() function
6.3.3. Access Elements of R Vectors:
With the help of vector indexing, we can access the elements of vectors. Indexing denotes the position where the values in a vector are stored. This indexing can be performed with the help of integer, character or logic.
1. Indexing with Integer Vector
Unlike many programming languages like Python, C++, Java etc. where the indexing starts from 0, the indexing of vectors in R starts with 1.We can perform indexing by specifying integer value in square braces [ ] next to our vector.
2. Indexing with Character Vector
Character vector indexing can be done
3. Indexing with Logic Vector
In logical indexing, the positions whose corresponding position has logical vector TRUE are returned. For example, in the below code, R returns the positions of 1 and 3, where the corresponding logical vectors are TRUE.
6.3.4. Operations in R Vector:
1. Combining Vector in R
2. Arithmetic Operations on Vectors in R
3. Logical Index Vector in R
4. Numeric Index
5. Duplicate Index
6. Range Indexes
7. Out-of-order Indexes
8. Named Vectors Members
Key Takeaways
- A vector is a sequence of elements that share the same data type. These elements are known as components of a vector.
- R vector comes in two parts: Atomic vectors and Lists.
- There are four common types of R atomic vectors:
1. Numeric Data Type
2. Integer Data Type
3. Character Data Type
4. Logical Data Type
- The c() function is used for creating a vector in R. This function returns a one-dimensional array, also known as vector.
- With the help of vector indexing, we can access the elements of vectors. Indexing denotes the position where the values in a vector are stored. This indexing can be performed with the help of integer, character or logic.
- As a beginning R user, it's OK to consider your workspace "real". Very soon, I urge you to evolve to the next level, where you consider your saved R scripts as "real". (In either case, of course the input data is very much real and requires preservation!)
- With the input data and the R code you used, you can reproduce everything.
- You can make your analysis fancier. You can get to the bottom of puzzling results and discover and fix bugs in your code.
- You can reuse the code to conduct similar analyses in new projects. You can remake a figure with different aspect ratio or save is as TIFF instead of PDF. Etc etc.
- First, let's imagine that you regard your workspace as "real". You save it and reload it over and over again (consciously or unconsciously). It's probably heartbreaking when R or your whole machine crashes and you need to start over.
- You're going to either redo a lot of typing (making mistakes all the way) or will have to mine your R history for the commands you used. Rather than becoming an expert on managing the R history, a better use of your time and psychic energy is to keep your "good" R code in a script for future reuse.
- But, because it can be useful sometimes, go ahead and note that the commands you've recently executed appear in the History tab of the upper right pane.
- You don't have to choose right now and the two strategies are not incompatible. First, let's demo the save / reload the workspace approach.
- Upon quitting R, you have to decide if you want to save your workspace, for potential restoration the next time you launch R. Depending on your set up, R or your IDE, eg RStudio, will probably prompt you to make this decision.
- Before proceeding, make sure your workspace contains a few objects. If you cleaned out your workspace above, you could find some assignments in your command history and use the "To Console" button or copy/paste to resubmit.
- Quit R/Rstudio, either from the menu, using a keyboard shortcut, or by typing q() in the Console. You'll get a prompt like this:
Save workspace image to ~/.Rdata?
- Note where the workspace image is to be saved and then click Save. This will probably happen in your home directory, but the exact details will be machine- and OS-dependent.
- Using your favorite method, visit the directory where the image was saved and verify there is a file named .
- RData with a very recent modification timestamp. It's binary file, specific to R, so nothing good will come of trying to open and view this file in, e.g., a text editor.
- You will also see a file .Rhistory, holding the commands submitted in your recent session. This is plain text and feel free to open and view it.
- Restart RStudio. In the Console you will see a line like this: [Workspace loaded from ~/.RData] indicating that your workspace has been restored. Look in the Workspace pane and you'll see the same objects as before.
- In the History tab of the same pane, you should also see your command history.You're back in business. This way of starting and stopping analytical work will not serve you well for long but it's a start.
6.4.1. Working Directory:
- Any process running on your computer has a notion of its "working directory". In R, this is where R will look, by default, for files you ask it to load. It is also where, by default, any files you write to disk will go.
- Chances are your current working directory is the directory we inspected above, i.e. the one where RStudio wanted to save the workspace, which is probably also your home directory.
- You can explicitly check your working directory with:
Getwd()
- It is also displayed at the top of the RStudio console.
- As a beginning R user, it's OK to let your home directory or any other weird directory on your computer be R's working directory.
- Very soon, I urge you to evolve to the next level, where you organize your analytical projects into directories and, when working on a project, set R's working directory to the associated directory.
- Although I do not recommend it, in case you're curious, you can set R's working directory at the command line like so:
Setwd("~/myCoolProject")
- Although I do not recommend it, you can also use RStudio's Files pane to navigate to a directory and then set it as working directory from the menu: Session --> Set Working Directory --> To Files Pane Location. (You'll see even more options there). Or within the Files pane, choose More and Set As Working Directory.
6.4.2. Shortcuts:
To access a menu displaying all the shortcuts in RStudio you can use option + shift + k. Within RStudio you can also access them in the Help menu » Keyboard Shortcuts.
Key Takeaways
- Any process running on your computer has a notion of its "working directory". In R, this is where R will look, by default, for files you ask it to load. It is also where, by default, any files you write to disk will go.
- Chances are your current working directory is the directory we inspected above, i.e. the one where RStudio wanted to save the workspace, which is probably also your home directory.
- You can explicitly check your working directory with:
Getwd()
6.5.1. Functions for Reading Data into R:
There are a few very useful functions for reading data into R.
- Read.table() and read.csv() are two popular functions used for reading tabular data into R.
- ReadLines() is used for reading lines from a text file.
- Source() is a very useful function for reading in R code files from a another R program.
- Dget() function is also used for reading in R code files.
- Load() function is used for reading in saved workspaces
- Unserialize() function is used for reading single R objects in binary format.
6.5.2. Functions for Writing Data to Files:
There are similar functions for writing data to files
- Write.table() is used for writing tabular data to text files (i.e. CSV).
- WriteLines() function is useful for writing character data line-by-line to a file or connection.
- Dump() is a function for dumping a textual representation of multiple R objects.
- Dput() function is used for outputting a textual representation of an R object.
- Save() is useful for saving an arbitrary number of R objects in binary format to a file.
- Serialize() is used for converting an R object into a binary format for outputting to a connection (or
file).
6.5.3. Reading Data Files with read.table():
The read.table() function is one of the most commonly used functions for reading data in R. TO get the help file for read.table() just type ?read.table in R console.
The read.table() function has a few important arguments:
- File, the name of a file, or a connection
- Header, logical indicating if the file has a header line
- Sep, a string indicating how the columns are separated
- ColClasses, a character vector indicating the class of each column in the dataset
- Nrows, the number of rows in the dataset. By default read.table() reads an entire file.
- Comment.char, a character string indicating the comment character. This defalts to “#”. If there are no commented lines in your file, it’s worth setting this to be the empty string “”.
- Skip, the number of lines to skip from the beginning
- StringsAsFactors, should character variables be coded as factors? This defaults to TRUE because back in the old days, if you had data that were stored as strings, it was because those strings represented levels of a categorical variable.
- Now we have lots of data that is text data and they don’t always represent categorical variables. So you may want to set this to be FALSE in those cases. If you always want this to be FALSE, you can set a global option via options(stringsAsFactors = FALSE).
- I’ve never seen so much heat generated on discussion forums about an R function argument than the stringsAsFactors argument.
6.5.4. ReadLines() and writeLines() function in R:
ReadLines() function is mainly used for reading lines from a text file and writeLines() function is useful for writing character data line-by-line to a file or connection or can also write contents into a file using writeLines() function in R
6.5.5. Dput() and dget() Function in R:
- You can create a more descriptive representation of an R object by using the dput() or dump() functions.
- Unlike writing out a table or CSV file, dump() and dput() preserve the metadata, so that another user doesn’t have to specify it all over again. For example, we can preserve the class of each column of a table or the levels of a factor variable.
- You can dump() R objects to a file by passing its names.
6.5.6. Source() Function in R:
The inverse of dump() is source() function. Now you can import that dump_data.R into R
6.5.7. Binary Formats in R:
- The complement to the textual format is the binary format. Binary format is sometimes useful for efficiency purposes. Sometimes, it may happen that there is no useful way to represent your data in a textual manner then binary format helps to import and export data i R.
- The main functions for converting R objects into a binary format are save(), save.image(), and serialize(). Individual R objects can be saved to a file using the save() function.
- If you have a lot of objects that you want to save to a file in one run, you can save all objects in your workspace using the save.image() function.
6.5.8. Serialize() and unserialize() function in R:
- The serialize() function is used to convert individual R objects into a binary format that can be communicated across an arbitrary connection
.
- When you call serialize() on an R object, the output will be a raw vector coded in hexadecimal format.
- The benefit of the serialize() function is that it is the only way to perfectly represent an R object in an exportable format, without losing precision or any metadata. If that is what you need, then serialize() is the function for you.
6.5.9. SaveRDS() and readRDS() in R:
- Now you are familiar with save() and load() function in R. They allow you to save a named R object to a file or other connection and restore that object again.
- When loaded the named object is restored to the current environment with the same name it had when saved.
- This is annoying for example when you have a saved model object resulting from a previous fit and you want to compare it with the model object returned when the R code is rerun.
- Unless you change the name of the model fit object in your script you can’t have both the saved object and the newly created one available in the same environment at the same time.
- SaveRDS() provides a far better solution to this problem and to the general one of saving and loading objects created with R. SaveRDS() serializes an R object into a format that can be saved
- Save() does the same thing, but with one important difference; saveRDS() doesn’t save the both the object and its name it just saves a representation of the object.
- As a result, the saved object can be loaded into a named object within R that is different from the name it had when originally serialized.
- The main difference is that save() can save many objects to a file in a single call, whilst saveRDS(), being a lower-level function, works with a single object at a time
Key Takeaways
- Reading data functions:
Read.table(),read.csv() ,readLines() ,source() ,dget() ,load() unserialize()
- Writing data functions:
Write.table() ,writeLines() ,dump() ,dput(), save() ,serialize()
- Data structures provide the way to represent data in data analytics. We can manipulate data in R for analysis and visualization.
- One of the most important aspects of computing with data in R is its ability to manipulate data and enable its subsequent analysis and visualization. Let us see few basic data structures in R:
A. Vectors in R
- These are ordered container of primitive elements and are used for 1-dimensional data.
- Types – integer, numeric, logical, character, complex
B. Matrices in R
- These are Rectangular collections of elements and are useful when all data is of a single class that is numeric or characters.
- Dimensions – two, three, etc
.
C. Lists in R
- These are ordered container for arbitrary elements and are used for higher dimension data, like customer data information of an organization.
- When data cannot be represented as an array or a data frame, list is the best choice. This is so because lists can contain all kinds of other objects, including other lists or data frames, and in that sense, they are very flexible.
D. Data frames
- These are two-dimensional containers for records and variables and are used for representing data from spreadsheets etc. It is similar to a single table in the database.
6.6.1. Creating Subsets of Data in R
- As we know, data size is increasing exponentially and doing analysis on complete data is very time-consuming. So data is divided into small sized samples and analysis of samples is done. The process of creating samples is called subsetting.
- Different methods of subsetting in R are:
$
The dollar sign operator selects a single element of data. When you use this operator with a data frame, the result is always a vector.
[[
Similar to $ in R, the double square brackets operator in R also returns a single element, but it offers the flexibility of referring to the elements by position rather than by name. It can be used for data frames and lists.
[
The single square bracket operator in R returns multiple elements of data. The index within the square brackets can be a numeric vector, a logical vector, or a character vector.
6.6.2. Sample() command in R
- As we have seen, samples are created from data for analysis. To create samples, sample() command is used and the number of samples to be drawn are mentioned.
- Sample() should always produce random values. But it does not happen with the test code sometimes. If substituted with a seed value, the sample() command always produces random samples.
- Seed value is the starting point for any random number generator formula. Seed value defines both, the initialization of the random number generator along with the path that the formula will follow.
6.6.3. Applications of Subsetting Data
Let us now see few applications of subsetting data in R:
Duplicate data can be removed during analysis using duplicated()function in R
- Below command shows how to find duplicate data in subsets: Duplicated() function finds duplicate values and returns a logical vector that tells you whether the specific value is a duplicate of a previous value.
- For all those values which are duplicate in the sample, true is returned.
Missing data can be identified using complete.cases() function in R
- Complete.cases() command in R is used to find rows which are complete. It gives logical vector with the value TRUE for rows that are complete, and FALSE for rows that have some NA values.
- Rows which have NA values can be removed using na.omit() function as below:
Row_name <- na.omit(file_name)
6.6.4. Adding Calculated Fields to Data
After you have created the appropriate subset of your data, the next step in your analysis is to perform some calculations. R makes it easy to perform calculations on columns of a data frame because each column is itself a vector.
Let us discuss some variations of the operations performed on data frames in R.
A. with() function in R
- This gives output same as above but reduced the task of typing.
B. within() function in R
- With() function allows you to refer to columns inside a data frame without explicitly using the dollar
- Sign or even the name of the data frame itself.
- With and Within can be used interchangeably.
6.6.5. Creating Subgroups or Bins of Data
- Most statisticians often draw histograms to investigate their data. As this type of calculation is common when you use statistics, R has some functions for it.
A.Cut() function in R
- Cut() function groups values of a variable into larger bins. It creates bins of equal size and classifies each element into its appropriate bin.
- This gives the result as a factor with three levels.
- The cut() function creates mathematical labels for the bins. The label names can be provided by the user.
- The result shows three labels in the output.
B.table() function in R
- To count the number of observations in each level of factor, R table() command can be used as below:
x <- cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High"))
Table(x)
- The result shows the output as a table containing the number of elements in each factor.
6.6.6. Combining and Merging Datasets in R
If you want to combine data from different sources in R, you can combine different sets of data in three ways:
A. By Adding Columns using cbind() in R
If the two sets of data have an equal set of rows, and the order of the rows is identical, then adding columns makes sense. This can be done by using the data.frame or cbind() function.
B. By Adding Rows using rbind() function in R
If both sets of data have the same columns and you want to add rows to the bottom, use rbind().
C. By Combining Data With Different Shapes using merge() function in R
- The merge() function combines data based on common columns as well as rows. In database language, this is usually called joining data.
- For merging the existing data, using the merge()function is useful. You can use merge()to combine data only when certain matching conditions are satisfied.
6.6.7. Merge() Function in R
- The merge() function is used to combine data frames.
6.6.8. Different types of merge().
The merge() function allows four ways of combining data:
A. Natural join in R
To keep only rows that match from the data frames, specify the argument all=FALSE
B. Full outer join in R
To keep all rows from both data frames, specify all=TRUE
C. Left outer join in R
To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE
D. Right outer join in R
To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE
The merge()function takes a large number of arguments, as follows:
- x:A data frame
- y: A data frame
- By, by.x, by.y: Names of the columns common to both x and y. By default, it uses columns with common names between the two data frames.
- All, all.x, all.y: Logical values that specify the type of merge. The default value is all = FALSE
6.6.9. Match() function in R
- The R match() function returns the matching positions of two vectors or, more specifically, the positions of the first matches of one vector in the second vector.
Index <- match(cold.states$Name, large.states$Name)
- This is the command to search for large states that also occur in the data frame cold.states
6.6.10. Sorting and Ordering Data in R using sort() in R and Order() in R
- A common task in data analysis and reporting is sorting information. You can answer many everyday questions with sorted tables of data that tell you the best or worst of specific things
- For example, parents want to know which school in their area is the best, and businesses need to know the most productive factories or the most lucrative sales areas.
- Let us first create data frame and then we will sort it.
Some.states <- data.frame( + Region = state.region, + state.x77)
- This is the command to create data frame some.states.
Some.states <- some.states[1:10, 1:3]
//This will create subset of it.
- By default, sorting is done in ascending manner if not specified.
Sort(some.states$Population)
//Command to sort Population in ascending order
Sort(some.states$Population, decreasing=TRUE)
//Command to sort Population in descending order
- This is how sorting of data can be done in R.
- Data frames can also be sorted as below:
Order.pop <- order(some.states$Population)
- Above is the command to show the order of the elements of the data frame some.states
- Now to sort above data frame in ascending order, below command is used:
Some.states[order.pop, ]
- To sort in descending order, we need to specify as below:
Order(some.states$Population, decreasing=TRUE)
- This is how order() and sort() functions are used.
6.6.11. Traversing Data with the Apply() Function in R
- To traverse the data, R uses apply functions. The output of the apply() function depends on the data structure being traversed.
Array or matrix
The apply() function traverses either the rows or columns of a matrix, applies a function to each resulting vector, and returns a vector of summarized results
List
The lapply() function can traverse a list, it applies a function to each element, and returns a list of the results. Sometimes it is possible to simplify the resulting list into a matrix or vector. Lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
- R Apply() function is used as below:
Apply(X, MARGIN, FUN, ...)
- The apply() function takes four arguments as below:
- X: This is the data—an array (or matrix)
- MARGIN: This is a numeric vector that indicates the dimension over which to traverse—1 means rows and 2 means columns
- FUN: This is the function to apply (for example, sum or mean)
- … (dots): If the FUN function requires any additional arguments, they can be added here.
- In essence, the apply function allows us to make entry-by-entry changes to data frames and matrices.
- If MARGIN=1, the function accepts each row of X as a vector argument, and returns a vector of the results. Similarly, if MARGIN=2 the function acts on the columns of X. Most impressively, when MARGIN=c(1,2) the function is applied to every entry of X.
- Let us now discuss the variations of the apply() function:
a. lapply() function in R
- We have already seen it above.
b. sapply() function in R
- It works on a list or vector and returns vector.
c. tapply() function in R
- It is used to create tabular summaries of data.
- This function takes three arguments:
- X: Refers to a vector
- INDEX: Refers to a factor or list of factors
- FUN: Refers to a function
- An illustrative example
Consider the code below:
- #Create the matrix
m<-matrix(c(seq(from=-98,to=100,by=2)),nrow=10,ncol=10)
- # Return the product of each of the rows
Apply(m,1,prod)
- # Return the sum of each of the columns
Apply(m,2,sum)
6.6.12. Introduction to the Formula Interface in R
- The R formula interface allows you to concisely specify which columns to use when fitting a model, as well as the behavior of the model.
- You need the operators when you start building models. Formula notation refers to statistical formulae, as opposed to mathematical formulae.
- The formula operator + means to include a column, not to mathematically add two columns together.
Operator | Example | Meaning |
~ | y ~ x | Model y as a function of x |
+ | y ~ a + b | Include columns a as well as b |
– | y ~ a – b | Include a but exclude b |
: | y ~ a : b | Estimate the interaction of a and b |
* | y ~ a * b | Include columns as well as their interaction |
| | y ~ a | b | Estimate y as a function of a conditional on b |
Above table shows meanings of different operators in formula interfacing.
6.6.13. Variables in R
The two types of R variables are:
Identifier variables in R
Identifier or ID variables identify the observations. These act as the keys that identify the observations.
Measured variables in R
These represent the measurements to be observed.
6.6.14. Getting started with reshape2 Package in R
- Base R has a function, reshape() that works fine for reshaping longitudinal data.
- The problem of data reshaping is far more generic than simply dealing with longitudinal data.
- So package reshape2 that contains several functions to convert data between long and wide format is released.
Install.packages("reshape2")
- //This is the command to install reshape2 package
Library("reshape2")
- //This is the command to load reshape2 package
- R reshape2 package is based on two key functions:
- Melt() in R takes wide-format data and melts it into long-format data.
- Cast() in R takes long-format data and casts it into wide-format data.
Key Takeaways
- Data structures provide the way to represent data in data analytics. We can manipulate data in R for analysis and visualization.
- One of the most important aspects of computing with data in R is its ability to manipulate data and enable its subsequent analysis and visualization.
- Data structures in R:
Data frames, Lists, Vectors, Matrices
- Different methods of subsetting in R are:
1. $
2. [[
3. [
- If you want to combine data from different sources in R, you can combine different sets of data in three ways:
1. By Adding Columns using cbind() in R
2. By Adding Rows using rbind() function in R
3. By Combining Data With Different Shapes using merge() function in R
- The merge() function is used to combine data frames.
- Different types of merge():
A. Natural join in R
B. Full outer join in R
C. Left outer join in R
D. Right outer join in R
- The two types of R variables are:
1. Identifier variables in R
2. Measured variables in R
References:
- Big Data (Black Book) – DT Editorial Services – Dreamtech Press
2. Data Mining: Concepts and Techniques Second Edition- Jaiwei Han and Micheline kamber-Morgan KaufMan publisher
3. Data Mining and Analysis Fundamental Concepts and Algorithms –Mohammed J. Zaki and Wager Meira Jr. Cambridge University Press.