SQL Saturday statistics – Web Scraping with R and SQL Server
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.
Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.
First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:
EXEC sp_execute_external_script @language = N'R' ,@script = N' library(rvest) library(XML) library(dplyr) #URL to schedule url_schedule <- ''http://www.sqlsaturday.com/687/Sessions/Schedule.aspx'' #Read HTML webpage <- read_html(url_schedule) # Event schedule schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK # Extracting HTML content ht <- html_text(schedule_info) df <- data.frame(data=ht) #create empty DF df_res <- data.frame(title=c(), speaker=c()) for (i in 1:nrow(df)){ #print(df[i]) if (i %% 2 != 0) #odd flow print(paste0("title is: ", df$data[i])) if (i %% 2 == 0) #even flow print(paste0("speaker is: ", df$data[i])) df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1])) } df_res_new = df_res[seq(1, nrow(df_res), 2), ] OutputDataSet <- df_res_new'
Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.
SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:
So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:
http://www.sqlsaturday.com/600/Sessions/Schedule.aspx
http://www.sqlsaturday.com/601/Sessions/Schedule.aspx
http://www.sqlsaturday.com/602/Sessions/Schedule.aspx
and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.
Creating stored procedure will go the job:
USE SqlSaturday; GO CREATE OR ALTER PROCEDURE GetSessions @eventID SMALLINT AS DECLARE @URL VARCHAR(500) SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx' PRINT @URL DECLARE @TEMP TABLE ( SqlSatTitle NVARCHAR(500) ,SQLSatSpeaker NVARCHAR(200) ) DECLARE @RCODE NVARCHAR(MAX) SET @RCODE = N' library(rvest) library(XML) library(dplyr) library(httr) library(curl) library(selectr) #URL to schedule url_schedule <- "' DECLARE @RCODE2 NVARCHAR(MAX) SET @RCODE2 = N'" #Read HTML webpage <- html_session(url_schedule) %>% read_html() # Event schedule schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK # Extracting HTML content ht <- html_text(schedule_info) df <- data.frame(data=ht) #create empty DF df_res <- data.frame(title=c(), speaker=c()) for (i in 1:nrow(df)){ #print(df[i]) if (i %% 2 != 0) #odd flow print(paste0("title is: ", df$data[i])) if (i %% 2 == 0) #even flow print(paste0("speaker is: ", df$data[i])) df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1])) } df_res_new = df_res[seq(1, nrow(df_res), 2), ] OutputDataSet <- df_res_new '; DECLARE @FINAL_RCODE NVARCHAR(MAX) SET @FINAL_RCODE = CONCAT(@RCODE, @URL, @RCODE2) INSERT INTO @Temp EXEC sp_execute_external_script @language = N'R' ,@script = @FINAL_RCODE INSERT INTO SQLSatSessions (sqlSat,SqlSatTitle,SQLSatSpeaker) SELECT @EventID AS sqlsat ,SqlSatTitle ,SqlSatSpeaker FROM @Temp
Before you run this, just a little environement setup:
USE [master]; GO CREATE DATABASe SQLSaturday; GO USE SQLSaturday; GO CREATE TABLE SQLSatSessions ( id SMALLINT IDENTITY(1,1) NOT NULL ,SqlSat SMALLINT NOT NULL ,SqlSatTitle NVARCHAR(500) NOT NULL ,SQLSatSpeaker NVARCHAR(200) NOT NULL )
There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):
EXECUTE GetSessions @eventID = 687
or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.
For Slovenian SQLSaturday, I get the following sessions and speakers list:
Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!
So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until SqlSat690), here is the frequency table:
Or presented with pandas graph:
Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people!
As always, code is available at Github.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.