UNIT 6
Exploring R
Q1) Short note on R-software.
A1)
- R is a programming language and free software developed by Ross Ihaka and Robert Gentleman in 1993.
2. R possesses an extensive catalog of statistical and graphical methods. It includes machine learning algorithms, linear regression, time series, and statistical inference to name a few.
3. Most of the R libraries are written in R, but for heavy computational tasks, C, C++ and FORTRAN codes are preferred.
4. R is not only entrusted by academic, but many large companies also use R programming language, including Uber, Google, Airbnb, Facebook and so on.
5. Data analysis with R is done in a series of steps; programming, transforming, discovering, modeling and communicate the results
6. Program: R is a clear and accessible programming tool
7. Transform: R is made up of a collection of libraries designed specifically for data science
8. Discover: Investigate the data, refine your hypothesis and analyze them
9. Model: R provides a wide array of tools to capture the right model for your data
10. Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny apps to share with the world
Q2) Explain the Basic Features of R.
A2)
R programming language is filled with such exciting and amazing features
There are many things R can do for data scientists and analysts. These key features are what set R apart from the crowd of statistical languages:
1. Open-source:
a) R is an open-source software environment. It is free of cost and can be adjusted and adapted according to the user’s and the project’s requirements.
b) You can make improvements and add packages for additional functionalities.
c) R is freely available. You can learn how to install R, Download and start practicing it.
2. Strong Graphical Capabilities
a) R can produce static graphics with production quality visualizations and has extended libraries providing interactive graphic capabilities.
b) This makes data visualization and data representation very easy.
c) From concise charts to elaborate and interactive flow diagrams, all are well within R’s repertoire. Look at the attractive graphical visualizations in R.
Fig 1: Data Visualization in R
3. Highly Active Community
a) R has an open-source library which is supported by its growing number of users.
b) The R environment is continuously growing. This growth is due to its large user-base.
4. A Wide Selection of Packages
a) CRAN or Comprehensive R Archive Network houses more than 10,000 different packages and extensions that help solve all sorts of problems in data science.
b) High-quality interactive graphics, web application development, quantitative analysis or machine learning procedures, there is a package for every scenario available.
c) R contains a sea of packages for all the forms of disciplines like astronomy, biology, etc. While R was originally used for academic purposes, it is now being used in industries as well.
5. Comprehensive Environment
a) R has a very comprehensive development environment meaning it helps in statistical computing as well as software development.
b) R is an object-oriented programming language. It also has a robust package called Rshiny which can be used to produce full-fledged web apps.
c) Combined with data analysis and data visualization, R can be used for highly interactive online data-driven storytelling.
6. Can Perform Complex Statistical Calculations
a) R can be used to perform simple and complex mathematical and statistical calculations on data objects of a wide variety.
b) It can also perform such operations on large data sets.
7. Distributed Computing
a) In distributed computing, tasks are split between multiple processing nodes to reduce processing time and increase efficiency.
b) R has packages like ddR and multiDplyr that enable it to use distributed computing to process large data sets
8. Running Code without a Compiler
a) R is an interpreted language which means that it does not need a compiler to make a program from the code.
b) R directly interprets provided code into lower-level calls and pre-compiled code
9. Interfacing with Databases
R contains several packages that enable it to interact with databases like Roracle, Open Database Connectivity Protocol, RmySQL, etc.
10. Data Variety
R can handle a variety of structured and unstructured data. It also provides various data modeling and data operation facilities due to its interaction with databases.
11. Machine Learning
R can be used for machine learning as well. The best use of R when it comes to machine learning is in case of exploration or when building one-off models.
12. Data Wrangling
a) Data wrangling is the process of cleaning complex and inconsistent data sets to enable convenient computation and further analysis. This is a very time taking process.
b) R with its extensive library of tools can be used for database manipulation and wrangling.
13. Cross-platform Support
R is machine-independent. It supports the cross-platform operation. Therefore, it can be used on many different operating systems.
14. Compatible with Other Programming Languages
While most of its functions are written in R itself, C, C++ or FORTRAN can be used for computationally heavy tasks. Java, .NET, Python, C, C++, and FORTRAN can also be used to manipulate objects directly.
15. Data Handling and Storage
R is integrated with all the formats of data storage due to which data handling becomes easy.
16. Vector Arithmetic
a) Vectors are the most basic data structure in R, and most other data structures are derived from vectors.
b) R uses vectors and vector arithmetic and does not need a lot of looping to process a large set of values. This makes R much more efficient.
17. Compatibility with Other Data Processing Technologies
a) R can be easily paired with other data processing and distributed computing technologies like Hadoop and Spark. It is possible to remotely use a Spark cluster to process large datasets using R.
b) R and Hadoop can be paired as well to combine Hadoop’s large scale data processing and distributing computing capabilities with R’s statistical computing power.
18. Generates Report in any Desired Format
a) R’s markdown package is the only report generation package you will ever need when working with R. The markdown package can help produce web pages.
b) It can also generate reports in the form of word documents or PowerPoint presentations. All with your R code and results embedded into them
Q3) What are the unique features of R programming?
A3)
Due to a large number of packages available, there are many other handy features as well:
- Since R can perform operations directly on vectors, it doesn’t require too much looping.
2. R can pull data from APIs, servers, SPSS files, and many other formats.
3. R is useful for web scraping.
4. It can perform multiple complex mathematical operations with a single command.
5. Using R Markdown, it can create attractive reports that combine plain text with code and visualizations of the results.
6. Due to a large number of researchers and statisticians using it, new ideas and technologies often appear in the R community first.
Q4) Explain the types of R atomic vector.
A4)
There are four common types of R atomic vectors:
1. Numeric Data Type
Decimal values are referred to as numeric data types in R. If we assign a decimal value for any variable g, as given below then, g will become a numeric type.
2. Integer Data Type
A numeric value with no fraction called integer data is represented by “Int”. -54 and 23 are two of the examples of an integer. Int size is 2 bytes while long Int size is 4 byte.
In order to assign an integer to a variable, there are two ways:
a) The first way is to use the as.integer() function
b) The second way is the appending of L to the value
3. Character Data Type
The character is held as the one-byte integer in memory. There are two ways to create a character data type value in R:
a) The first method is by typing a string between ” “
b) In order to convert a number into character, make use of as.character() function
4. Logical Data Type
A logical data type returns either of the two values – TRUE or FALSE based on which condition is satisfied.
Q5) Explain
1. Windows installation of R
2. Linux Installation of R
A5)
1. Windows installation of R:
a) You can download the Windows installer version of R from R-3.2.2 for Windows (32/64 bit) and save it in a local directory.
b) As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just double click and run the installer accepting the default settings.
c) If your Windows is 32-bit version, it installs the 32-bit version. But if your windows is 64-bit, then it installs both the 32-bit and 64-bit versions.
d) After installation you can locate the icon to run the Program in a directory structure "R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon brings up the R-GUI which is the R console to do R Programming.
2. Linux Installation of R:
a) R is available as a binary for many versions of Linux at the location R Binaries.
b) The instruction to install Linux varies from flavor to flavor. These steps are mentioned under each type of Linux version in the mentioned link. However, if you are in a hurry, then you can use yum command to install R as follows −
c) $ yum install R
d) Above command will install core functionality of R programming along with standard packages, still you need additional package, then you can launch R prompt
e) Now you can use install command at R prompt to install the required package.
Q6) What are the Applications of subsetting Data.
A6)
1. Duplicate data can be removed during analysis using duplicated () function in R
2. Below command shows how to find duplicate data in subsets: Duplicated () function finds duplicate values and returns a logical vector that tells you whether the specific value is a duplicate of a previous value.
3. For all those values which are duplicate in the sample, true is returned.
4. Missing data can be identified using complete. Cases () function in R
5. complete. Cases () command in R is used to find rows which are complete. It gives logical vector with the value TRUE for rows that are complete, and FALSE for rows that have some NA values.
6. Rows which have NA values can be removed using na. Omit () function as below:
Row_name <- na.omit(file_name)
Q7) Explain Basic GUI of R
A7)
- As part of the process of downloading and installing R, you get the standard graphical user interface (GUI), called RGui.
2. RGui gives you some tools to manage your R environment — most important, a console window.
3. The console is where you type instructions, or scripts, and generally get R to do useful things for you.
4. The standard installation process creates useful menu shortcuts (although this may not be true if you use Linux, because there is no standard RGui editor for Linux).
5. In the menu system, look for a folder called R, and then find an icon called R followed by a version number.
6. When you open RGui for the first time, you see the R Console screen, which lists some basic information such as your version of R and the licensing conditions.
7. Below all this information is the R prompt, denoted by a > symbol. The prompt indicates where you type your commands to R; you see a blinking cursor to the right of the prompt.
8. Use the console to issue a very simple command to R. R responds immediately to your command
9. One of the clever things about R is that it can deal with calculating many values at the same time, which is called vector operations. You need to know is that R can handle more than one value at a time.
10. To quit your R session, type the following code in the console, after the command prompt (>)
11. R asks you a question to make sure that you meant to quit,. Click No, because you have nothing to save. This action closes your R session (as well as RGui, if you’ve been using RGui as your code editor).
Q8) How can you access elements of R vectors?
A8)
With the help of vector indexing, we can access the elements of vectors. Indexing denotes the position where the values in a vector are stored. This indexing can be performed with the help of integer, character or logic.
1. Indexing with Integer Vector
Unlike many programming languages like Python, C++, Java etc. where the indexing starts from 0, the indexing of vectors in R starts with 1.We can perform indexing by specifying integer value in square braces [ ] next to our vector.
2. Indexing with Character Vector
Character vector indexing can be done
3. Indexing with Logic Vector
In logical indexing, the positions whose corresponding position has logical vector TRUE are returned. For example, in the below code, R returns the positions of 1 and 3, where the corresponding logical vectors are TRUE.
Q9) Give some of the operation of R vectors
A9)
1. Combining Vector in R
2. Arithmetic Operations on Vectors in R
3. Logical Index Vector in R
4. Numeric Index
5. Duplicate Index
6. Range Indexes
7. Out-of-order Indexes
8. Named Vectors Members
Q10) Write the functions for “Reading data in R”.
A10)
There are a few very useful functions for reading data into R.
- Read.table() and read.csv() are two popular functions used for reading tabular data into R.
- ReadLines() is used for reading lines from a text file.
- Source() is a very useful function for reading in R code files from a another R program.
- Dget() function is also used for reading in R code files.
- Load() function is used for reading in saved workspaces
- Unserialize() function is used for reading single R objects in binary format.
Q11) Write the functions for “Writing data in R”
A11)
There are similar functions for writing data to files
- Write.table() is used for writing tabular data to text files (i.e. CSV).
- WriteLines() function is useful for writing character data line-by-line to a file or connection.
- Dump() is a function for dumping a textual representation of multiple R objects.
- Dput() function is used for outputting a textual representation of an R object.
- Save() is useful for saving an arbitrary number of R objects in binary format to a file.
- Serialize() is used for converting an R object into a binary format for outputting to a connection (or
file).
Q12) Give some data structures in R
A12)
One of the most important aspects of computing with data in R is its ability to manipulate data and enable its subsequent analysis and visualization. Let us see few basic data structures in R:
1. Vectors in R
a) These are ordered container of primitive elements and are used for 1-dimensional data.
b) Types – integer, numeric, logical, character, complex
2. Matrices in R
a) These are Rectangular collections of elements and are useful when all data is of a single class that is numeric or characters.
b) Dimensions – two, three, etc
.
3. C. Lists in R
a) These are ordered container for arbitrary elements and are used for higher dimension data, like customer data information of an organization.
b) When data cannot be represented as an array or a data frame, list is the best choice. This is so because lists can contain all kinds of other objects, including other lists or data frames, and in that sense, they are very flexible.
4. D. Data frames
These are two-dimensional containers for records and variables and are used for representing data from spreadsheets etc. It is similar to a single table in the database.
Q13) Explain different types of merge in R
A13)
The merge() function allows four ways of combining data:
1. Natural join in R
To keep only rows that match from the data frames, specify the argument all=FALSE
2. Full outer join in R
To keep all rows from both data frames, specify all=TRUE
3. Left outer join in R
To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE
4. Right outer join in R
To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE
Q14) Give some operators used in R
A14)
Some of the frequently used operators in R are:
Operator | Example | Meaning |
~ | y ~ x | Model y as a function of x |
+ | y ~ a + b | Include columns a as well as b |
– | y ~ a – b | Include a but exclude b |
: | y ~ a : b | Estimate the interaction of a and b |
* | y ~ a * b | Include columns as well as their interaction |
| | y ~ a | b | Estimate y as a function of a conditional on b |
Q15) Explain variables in R
A15)
The two types of R variables are:
1. Identifier variables in R
Identifier or ID variables identify the observations. These act as the keys that identify the observations.
2. Measured variables in R
These represent the measurements to be observed.