Site icon R-bloggers

SAS: “The query requires remerging summary statistics back with the original data”

[This article was first published on Heuristic Andrew, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Coming from a background writing SQL code directly for “real” RDBMS (Microsoft SQL Server, MySQL, and SQLite), I was initially confused when SAS would give me the following ‘note’ for a simple summary PROC SQL query:

429  proc sql;
430      create table undel_monthly as
431      select
432          year(date) as year,
433          month(date) as month,
434          count(*) as n_undel
435      from
436          comments2
437      group by
438          year(date),
439          month(date)
440      order by
441          year(date),
442          month(date)
443      ;
NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT clause.
NOTE: The query requires remerging summary statistics back with the original  data.
NOTE: Table WORK.UNDEL_MONTHLY created, with 31004 rows and 3 columns.

The same code executes fine in “real” RDBMS. As is often the case, SAS has its own way of doing things. SAS expects its special keyword calculated in each summary/grouping statistic like below. The calculated keyword simply refers to the value calculated in the SELECT section.

proc sql;
	create table undel_monthly as
	select
		year(date) as year,
		month(date) as month,
		count(*) as n_undel
	from
		comments2
	group by
		calculated year,
		calculated month
	order by
		calculated year,
		calculated month
	;
quit;

So how does SAS optimize such a simple “group by” query when sending it to a RDBMS? It doesn’t! Below is a simple query to pull monthly phone call volume from CIC 2.4. Notice in the log I turned on SQL tracing, and SAS shows it pulls individual records instead of letting Microsoft SQL Server produce the summary. SAS complains about an error, but it doesn’t describe it. In SAS’s defense, it is probably confused because the date-type is a date-time type instead of a date-only data type, but Microsoft SQL Server doesn’t support date-only data types. A pass-through query would be faster for large data sets, but then you lose some of the usefulness of PROC SQL.

532  options sastrace=',,,d' sastraceloc=saslog nostsuffix;
533
534  proc sql;
535      create table calls_volume as
536      select
537          year(DatePart(InitiatedDate)) as year,
538          month(DatePart(InitiatedDate)) as month,
539          count(*) as number_of_calls
540      from
541          i3eic.CallDetail
542      where
543          InitiatedDate > '01jan09'd
544      group by
545          calculated year,
546          calculated month
547      ;
ODBC: AUTOCOMMIT is NO for connection 10
ODBC: AUTOCOMMIT turned ON for connection id 10

ODBC_4: Prepared:
SELECT * FROM "dbo"."CallDetail"

ODBC: AUTOCOMMIT is NO for connection 11
SAS_SQL:  Unable to convert the query to a DBMS specific SQL statement due to
an error.
ACCESS ENGINE:  SQL statement was not passed to the DBMS, SAS will do the
processing.
ODBC: COMMIT performed on connection 11.

ODBC_5: Prepared:
SELECT  "InitiatedDate"  FROM "dbo"."CallDetail"  WHERE  ( "InitiatedDate" > {d
'2009-01-01' } )

ODBC_6: Executed:
Prepared statement ODBC_5

NOTE: Table WORK.CALLS_VOLUME created, with 9 rows and 3 columns.

548  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           2.86 seconds
      user cpu time       1.81 seconds
      system cpu time     0.31 seconds
      Memory                            34697k

To leave a comment for the author, please follow the link and comment on their blog: Heuristic Andrew.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.