SAS: “The query requires remerging summary statistics back with the original data”
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Coming from a background writing SQL code directly for “real” RDBMS (Microsoft SQL Server, MySQL, and SQLite), I was initially confused when SAS would give me the following ‘note’ for a simple summary PROC SQL query:
429 proc sql; 430 create table undel_monthly as 431 select 432 year(date) as year, 433 month(date) as month, 434 count(*) as n_undel 435 from 436 comments2 437 group by 438 year(date), 439 month(date) 440 order by 441 year(date), 442 month(date) 443 ; NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT clause. NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.UNDEL_MONTHLY created, with 31004 rows and 3 columns.
The same code executes fine in “real” RDBMS. As is often the case, SAS has its own way of doing things. SAS expects its special keyword calculated in each summary/grouping statistic like below. The calculated keyword simply refers to the value calculated in the SELECT section.
proc sql; create table undel_monthly as select year(date) as year, month(date) as month, count(*) as n_undel from comments2 group by calculated year, calculated month order by calculated year, calculated month ; quit;
So how does SAS optimize such a simple “group by” query when sending it to a RDBMS? It doesn’t! Below is a simple query to pull monthly phone call volume from CIC 2.4. Notice in the log I turned on SQL tracing, and SAS shows it pulls individual records instead of letting Microsoft SQL Server produce the summary. SAS complains about an error, but it doesn’t describe it. In SAS’s defense, it is probably confused because the date-type is a date-time type instead of a date-only data type, but Microsoft SQL Server doesn’t support date-only data types. A pass-through query would be faster for large data sets, but then you lose some of the usefulness of PROC SQL.
532 options sastrace=',,,d' sastraceloc=saslog nostsuffix; 533 534 proc sql; 535 create table calls_volume as 536 select 537 year(DatePart(InitiatedDate)) as year, 538 month(DatePart(InitiatedDate)) as month, 539 count(*) as number_of_calls 540 from 541 i3eic.CallDetail 542 where 543 InitiatedDate > '01jan09'd 544 group by 545 calculated year, 546 calculated month 547 ; ODBC: AUTOCOMMIT is NO for connection 10 ODBC: AUTOCOMMIT turned ON for connection id 10 ODBC_4: Prepared: SELECT * FROM "dbo"."CallDetail" ODBC: AUTOCOMMIT is NO for connection 11 SAS_SQL: Unable to convert the query to a DBMS specific SQL statement due to an error. ACCESS ENGINE: SQL statement was not passed to the DBMS, SAS will do the processing. ODBC: COMMIT performed on connection 11. ODBC_5: Prepared: SELECT "InitiatedDate" FROM "dbo"."CallDetail" WHERE ( "InitiatedDate" > {d '2009-01-01' } ) ODBC_6: Executed: Prepared statement ODBC_5 NOTE: Table WORK.CALLS_VOLUME created, with 9 rows and 3 columns. 548 quit; NOTE: PROCEDURE SQL used (Total process time): real time 2.86 seconds user cpu time 1.81 seconds system cpu time 0.31 seconds Memory 34697k
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.