Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It is not the beauty of a building you should look at; it is the construction of the foundation that will stand the test of time. ~ David Allan Coe
This week we focus on getting data from MYSQL, HDF5, API and the WEB.
Extracting Data from MYSQL
Mysql is one of the most widely used open source databases.The data is stored in tabular format (similar to dataframes in R) with rows representing a record and columns representing variable names.
Connecting to MYSQL
> install.packages(“RMySQL”) #install the library required > library(RMySQL) #loading the required package into R > ucscDb<-dbConnect(MySQL(),user=”genome”,host=”genome- mysql.cse.ucsc.edu”) # Connecting to the database > result<-dbGetQuery(ucscDb,"show databases;")#extracting the result of a query > dbDisconnect(ucscDb) # Disconnecting from the database
Accessing tables and data within tabular data
> DbCon<-dbConnect(MySQL(),user=”genome”,host=”genome-mysql.cse.ucsc.edu”,db=”ailMel1") #Connects to the database # named ailMel1 > Tables<-dbListTables(DbCon) #Fetches the tables within the database > Tables # Displays the tables
> dbListFields(DbCon,”nestedRepeats”) #Lists all the fields within a #table
Extracting a table directly
> DbCon<-dbConnect(MySQL(),user=”genome”,host=”genome-mysql.cse.ucsc.edu”,db=”ailMel1") #Connects to the database > Data<-dbReadTable(DbCon,”nestedRepeats”) #Reads the data > head(Data) #Displays the top 6 rows
Select Clause to fetch rows
> DbCon<-dbConnect(MySQL(),user=”genome”,host=”genome-mysql.cse.ucsc.edu”,db=”ailMel1") # Connects to the db > query<-dbSendQuery(DbCon,”Select * from chainCanFam2;”) #Sends the query to the server > Result<-fetch(query)#Fetches the result > head(Result)#displays the top 6 rows
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Extracting Data from HDF5
HDF5 stands for Hierarchical Data Format v5 and is closely modeled on file systems. In HDF5, a “group” is analogous to a directory, a “dataset” is like a file. HDF5 also uses “attributes” to associate metadata with a particular group or dataset. HDF5 uses ASCII names for these different objects, and objects can be accessed by UNIX-like pathnames, e.g., “/sample1/tempsensor/firsttrial” for a top-level group “sample1”, a subgroup “tempsensor”, and a dataset “firsttrial”.
Getting the required packages
> source(“https://bioconductor.org/biocLite.R") > bicLite("rhdf5") #installs the rhdf5 package
Creating a HDF5 File
> sample<-h5createFile("exampleHDF5.h5")#create a hDF5 file > sample<-h5createGroup("exampleHDF5.h5","History")#creates a group #called History > sample<-h5createGroup("exampleHDF5.h5","English")#Creates a group called English > sample<-h5createGroup("exampleHDF5.h5","English/English1")#Creates a group called English1 within English > h5ls("example.h5")#displays all the components of the HDF5 file.
Loading Data into Files
> A=matrix(1901:1910,nr=5,nc=2)#generating data > h5write(A,"exampleHDF5.h5","History/A")#inserting data into file
Reading from HDF5 file
> result_HDF5<-h5read(“exampleHDF5.h5”,”History/A”)#reads the HDF5 file > result_HDF5 #displays the result
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —-
Extracting Data from the Web
Web Scraping — Programmatically extracting data from the HTML code of websites.
httr package- The aim of httr is to provide a wrapper for the curl package, customised to the demands of modern web APIs.
Key features:
- Functions for the most important http verbs: GET(), HEAD(), PATCH(), PUT(), DELETE() and POST().
- Automatic connection sharing across requests to the same website (by default, curl handles are managed automatically), cookies are maintained across requests, and a up-to-date root-level SSL certificate store is used.
> install.packages("XML") > install.packages("httr") > library(httr) > library(XML) > html2=GET("https://www.datazar.com/") > content2=content(html2,as="text") > parsedHtml<-htmlParse(content2,asText=TRUE) > xpathSApply(parsedHtml,"//title",xmlValue)
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Extracting Data from API’s
An API is a software intermediary that makes it possible for application programs to interact with each other and share data. It’s often an implementation of REST that exposes a specific software functionality while protecting the rest of the application.
Getting Data from Twitter API
Loading the libraries
> library(wordcloud) > library(ggmap) > library(stringr) > library(tm) > library(RCurl) > library(dplyr) > library(plyr) > library(stringr) > library(ROAuth) > library(twitteR)
Creating a twitter developer account
Step 1
Go to https://apps.twitter.com/
Step 2
Step-3
Step 4
Go to Keys and Token
Step 5 — You need to click the create Access token which will appear and then you can view your access tokens like this
Authentication in R
> consumer_key <- "obtained from twitter" > consumer_secret <- "obtained from twitter" > access_token <- "obtained from twitter" > access_secret <- "obtained from twitter" > setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
If you want to test your authentication just try to get some tweets with:
searchTwitter(“iphone”)
To get a detailed understanding about how to use API’s to do sentimental analysis, Please refer to the project here.
References-https://www.wikipedia.org/, https://leanpub.com/u/rdpeng
Mastering the art of Data Preparation II was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.