Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< size="2" face="arial,helvetica,sans-serif">ore.stepwise for high-performance stepwise regression< >
< size="2" face="arial,helvetica,sans-serif">This guest post from Alexandre Ardelea describes the Oracle R Enterprise deployment in the Oracle micro-processor tools environment.< >< size="2" face="arial,helvetica,sans-serif"> < >
< !--[if gte mso 9]>< xml> < w:WordDocument> < w:View>Normal < w:Zoom>0 < w:TrackMoves/> < w:TrackFormatting/> < w:PunctuationKerning/> < w:ValidateAgainstSchemas/> < w:SaveIfXMLInvalid>false < w:IgnoreMixedContent>false < w:AlwaysShowPlaceholderText>false < w:DoNotPromoteQF/> < w:LidThemeOther>EN-US < w:LidThemeAsian>X-NONE < w:LidThemeComplexScript>X-NONE < w:Compatibility> < w:BreakWrappedTables/> < w:SnapToGridInCell/> < w:WrapTextWithPunct/> < w:UseAsianBreakRules/> < w:DontGrowAutofit/> < w:SplitPgBreakAndParaMark/> < w:DontVertAlignCellWithSp/> < w:DontBreakConstrainedForcedTables/> < w:DontVertAlignInTxbx/> < w:Word11KerningPairs/> < w:CachedColBalance/> < m:mathPr> < m:mathFont m:val="Cambria Math"/> < m:brkBin m:val="before"/> < m:brkBinSub m:val="--"/> < m:smallFrac m:val="off"/> < m:dispDef/> < m:lMargin m:val="0"/> < m:rMargin m:val="0"/> < m:defJc m:val="centerGroup"/> < m:wrapIndent m:val="1440"/> < m:intLim m:val="subSup"/> < m:naryLim m:val="undOvr"/> < ![endif]-->< !--[if gte mso 9]>< xml> < w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/> < w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/> < w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/> < w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/> < w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/> < w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/> < w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/> < w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/> < w:LsdException Locked="false" Priority="39" Name="toc 1"/> < w:LsdException Locked="false" Priority="39" Name="toc 2"/> < w:LsdException Locked="false" Priority="39" Name="toc 3"/> < w:LsdException Locked="false" Priority="39" Name="toc 4"/> < w:LsdException Locked="false" Priority="39" Name="toc 5"/> < w:LsdException Locked="false" Priority="39" Name="toc 6"/> < w:LsdException Locked="false" Priority="39" Name="toc 7"/> < w:LsdException Locked="false" Priority="39" Name="toc 8"/> < w:LsdException Locked="false" Priority="39" Name="toc 9"/> < w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/> < w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/> < w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/> < w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/> < w:LsdException Locked="false" Priority="37" Name="Bibliography"/> < w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/> < ![endif]-->< !--[if gte mso 10]> < ![endif]-->
< size="2" face="arial,helvetica,sans-serif">The deployment of ORE within the Oracle micro-processor tools environment introduced a technology that significantly expands our data analysis capabilities and opens the door to new applications.< >
< size="2" face="arial,helvetica,sans-serif">Oracle R Enterprise (ORE) has been recently deployed in the Oracle micro-processor tools environment, replacing a popular commercial tool as a production engine for data analysis. < >< size="2" face="arial,helvetica,sans-serif"> Fit/response models are important components of the simulation flows in the Oracle microprocessor tools environment; such models are used for a variety of purposes ranging from library generation to design yield prediction and optimization. < >< size="2" face="arial,helvetica,sans-serif">Several tools were targeted for the migration to ORE; t< >< size="2" face="arial,helvetica,sans-serif">hese tools are constructed around an iterative loop processing hundreds of data structures. At each iteration, a simulator engine generates data for multiple figures of metrics (targets), and a fit engine is called to construct response models for each target. The fit models are assembled into libraries for subsequent simulation within other flows or used on the fly.< >
< size="2" face="arial,helvetica,sans-serif">A common characteristic of these models is that they need to express strong nonlinear relations between the targets and large sets of explanatory variables. Multiple interactions and many non-linear dependencies are considered as candidate effects for the model construction; they often result from an automatic generation procedure attempting to cover a large variety of anticipated relations between the dependent and independent variables. For example, < >< size="2" face="arial,helvetica,sans-serif" style="color: #000000;">for a case with < >< size="2" face="arial,helvetica,sans-serif" style="color: #000000;">O[10^2] main < >< size="2" face="arial,helvetica,sans-serif" style="color: #000000;">explanatory variables, the < >< size="2" face="arial,helvetica,sans-serif" style="color: #000000;"> total number of candidate regressors, < face="courier new,courier,monospace">nregs_tot< >, could quickly rise to< >< size="2" face="arial,helvetica,sans-serif"> O[10^3-10^4]. Linear regression models with such a high number of terms are not only too expensive to use, as they have, potentially, a large number of terms with negligible coefficients, but are also likely to lead to instability and inaccuracy problems. < >< size="2" face="arial,helvetica,sans-serif"> For example, overfitting is an important concern < style="color: #000000;">with models expressing< > fluctuations in the data rather than < style="color: #000000;">capturing< > the trend. Moreover, < style="color: #000000;">if strong quasi-linear interactions occur between large numbers of regressors, the variance of the calculated (model) coefficients can < >< >< size="2" style="color: #000000;">< face="arial,helvetica,sans-serif">be massively inflated.< >< >
< size="2" face="arial,helvetica,sans-serif">In order to reduce the size of the fit models while retaining significant dependencies, we use stepwise regression. Stepwise regression is an iterative fitting method which builds the model incrementally by adding and eliminating regressors (from a specified candidate set) using a pre-defined selection mechanism/test of statistical significance – the model converges when all regressors retained in the model satisfy the significance test criteria. Commercial and open source regression packages offer various flavors of stepwise regression which differ in multiple ways through the choice of the selection mechanism (F-test, Information Criteria: AIC, BIC, etc), choice of the ‘direction’ (forward, backward, or both), flexibility for specifying the model scope, the significance threshold(s), the handling of interaction terms, etc.< >
< size="2" face="arial,helvetica,sans-serif">ORE has developed a proprietary in-database algorithm for stepwise regression, < face="courier new,courier,monospace">ore.stepwise< >, which complements the functionality of R’s < face="courier new,courier,monospace">step< > and,
especially, offers a very significant performance improvement through faster, scalable algorithms and in-database execution. The basic syntax is the following:
< >
< size="2" face="arial,helvetica,sans-serif"> < face="courier new,courier,monospace"> < >< face="courier new,courier,monospace">ore.stepwise(start_model, DB_data, model_scope, direction, add.p, drop.p)< >
< >
< size="2" face="arial,helvetica,sans-serif">ore.stepwise allows the user to specify a model scope and stepwise direction, and uses the F-test for regressor selection with the < face="courier new,courier,monospace">add.p< > and < face="courier new,courier,monospace">drop.p< > significance levels for adding and removing regressors while the model is iteratively constructed.< >
< size="2" face="arial,helvetica,sans-serif">A typical usage for < face="courier new,courier,monospace">ore.stepwise< > is illustrated below:< >
< size="2" face="arial,helvetica,sans-serif"> < >
< size="2" face="arial,helvetica,sans-serif">< >
list_modnames <- c("mdlA","mdlB",...) Effects_mdlA <- ... Effects_mdlB <- ... ... for(modname in list_modnames){ Upper_model <- as.formula(sprintf("%s ~ %s", modname,...) get(sprintf("Effects_%s", modname))) Lower_model <- as.formula(sprintf("%s ~ 1", modname)) Scope_model <- list(lower = Lower_model, upper = Upper_model) ... ... assign(sprintf("%s_model", modname), ore.stepwise(Lower_model, data = dataDB, scope = Scope_model, direction="both", add.p=..., drop.p=...)) ... The in-database ore frame dataDB contains observations for several target models mdlA,mdlB,..and a list with all target names (list_modnames) is assembled for iterative processing. For each target, a model scope is specified within lower and upper bounds. In the example above the lower bound is the intercept but the upper bound is customized so that each target model can be constructed from its own collection of regressors. The results shown in Figure1 illustrate the performance difference between ore.stepwise in using ORE and base R’s step function for both a bi-linear and a fully quadratic model constructed from 34 independent variables and 10k data observations.
< size="2" face="arial,helvetica,sans-serif"> < >
< size="2" face="arial,helvetica,sans-serif"> < >
< size="2" face="arial,helvetica,sans-serif"> < >
< size="2" face="arial,helvetica,sans-serif">Bilinear model< >
< size="2" face="arial,helvetica,sans-serif">method< > |
< size="2" face="arial,helvetica,sans-serif">R^2< > |
< size="2" face="arial,helvetica,sans-serif">Number of Regressors |
< size="2" face="arial,helvetica,sans-serif">mean(rel_error)< > |
< size="2" face="arial,helvetica,sans-serif">Elapsed Time (seconds)< > |
< size="2" face="arial,helvetica,sans-serif">step< > |
< size="2" face="arial,helvetica,sans-serif">0.9658 < > |
< size="2" face="arial,helvetica,sans-serif">86< > |
< size="2" face="arial,helvetica,sans-serif">3.52e-02 < > |
< size="2" face="arial,helvetica,sans-serif">2110.0< > |
< size="2" face="arial,helvetica,sans-serif">ore.stepwise< > |
< size="2" face="arial,helvetica,sans-serif">0.9966 < > |
< size="2" face="arial,helvetica,sans-serif">124< > |
< size="2" face="arial,helvetica,sans-serif">3.50e-02 < > |
< size="2" face="arial,helvetica,sans-serif">32.1< > |
< size="2" face="arial,helvetica,sans-serif">performance difference< > |
< size="2" face="arial,helvetica,sans-serif"> < > |
< size="2" face="arial,helvetica,sans-serif"> < > |
< size="2" face="arial,helvetica,sans-serif"> < > < face="arial,helvetica,sans-serif">< > |
< size="2" face="arial,helvetica,sans-serif">< face="courier new,courier,monospace">< face="arial,helvetica,sans-serif">ore.stepwise < >< >is approx. 65X faster than step at similar R^2 and relative error as stepwise.< > |
< size="2" face="arial,helvetica,sans-serif"> < >
< size="2" face="arial,helvetica,sans-serif">Quadratic model< >
< size="2" face="arial,helvetica,sans-serif">method< > |
< size="2" face="arial,helvetica,sans-serif">R^2< > |
< size="2" face="arial,helvetica,sans-serif">Number of Regressors |
< size="2" face="arial,helvetica,sans-serif">mean(rel_error)< > |
< size="2" face="arial,helvetica,sans-serif">Elapsed Time (seconds)< > |
< size="2" face="arial,helvetica,sans-serif">step< > |
< size="2" face="arial,helvetica,sans-serif">0.9962 < > |
< size="2" face="arial,helvetica,sans-serif">154< > |
< size="2" face="arial,helvetica,sans-serif">1.05e-02 < > |
< size="2" face="arial,helvetica,sans-serif">12600.0< > |
< size="2" face="arial,helvetica,sans-serif">ore.stepwise< > |
< size="2" face="arial,helvetica,sans-serif">0.9963 < > |
< size="2" face="arial,helvetica,sans-serif">210< > |
< size="2" face="arial,helvetica,sans-serif">1.04e-02 < > |
< size="2" face="arial,helvetica,sans-serif">69.5< > |
< size="2" face="arial,helvetica,sans-serif">performance difference< > |
< size="2" face="arial,helvetica,sans-serif"> < > |
< size="2" face="arial,helvetica,sans-serif"> < > |
< size="2" face="arial,helvetica,sans-serif"> < > |
< size="2" face="arial,helvetica,sans-serif">ore.stepwise is approx. 180X faster than step at similar R^2 relative error.< > |
< size="2" face="arial,helvetica,sans-serif">< size="2">Figure 1: Comparison of results for R’s step function and ORE’s ore.stepwise function for both bi-linear and quadratic models< >< >
< size="2" face="arial,helvetica,sans-serif">
< face="courier new,courier,monospace">ore.stepwise< > is faster than < face="courier new,courier,monospace">R’s step< > by a factor of 66-180X.
The larger the data set and the number
of regressors, we observed greater performance with ore.stepwise compared to
R’s < face="courier new,courier,monospace">step< >. The models produced by R’s < face="courier new,courier,monospace">step< > and < face="courier new,courier,monospace">ore.stepwise< > have a different number of regressors because both the selection mechanisms and interaction terms are handled differently. < face="courier new,courier,monospace">step< > favors the main terms – < face="courier new,courier,monospace">x1:x2< > will be added only if < face="courier new,courier,monospace">x1< > and < face="courier new,courier,monospace">x2< > were previously added, and, reversibly, < face="courier new,courier,monospace">x1:x2< > will be eliminated before < face="courier new,courier,monospace">x1< > and < face="courier new,courier,monospace">x2< > are eliminated, whereas < face="courier new,courier,monospace">ore.stepwise< > does not differentiate between main terms and interactions. With respect to collinearity, < face="courier new,courier,monospace">ore.stepwise< > detects strict linear dependencies and eliminates from start the regressors involved in multi-collinear relations.< >
< size="2" face="arial,helvetica,sans-serif">In summary, the ORE capabilities for stepwise regression < >< size="2" face="arial,helvetica,sans-serif">far surpass similar functionality in tools we considered as alternatives to ORE< >< size="2" face="arial,helvetica,sans-serif">. The deployment of ORE within the Oracle micro-processor tools environment introduced a technology which significantly expands the data analysis capabilities through the R ecosystem combined with in-database high performance algorithms and opens the door to new applications. This technology leverages the flexibility and extensibility of the R environment and allows massive and complex data analysis sustained by the scalability and performance of the Oracle database for Big Data. < >
< size="2" face="arial,helvetica,sans-serif">Alexandre Ardelea is a principal hardware engineer at Oracle Corporation. Alex has a PhD and MS in Physics < >< size="2" face="arial,helvetica,sans-serif">from
Ecole polytechnique fédérale de Lausanne and post-doctoral research in
non-linear physics, CFD and parallel methods. Alex’s specialities
include response surface modeling, optimization strategies for
multi-parametric/objective/constraint problems, statistical process
characterization, circuit analysis and RF algorithms. < >
< size="2" face="arial,helvetica,sans-serif">_______________________________________________________________________________________________________________< >
< size="2" face="tahoma,arial,helvetica,sans-serif">Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations specified in the R environment to be pushed for execution in Oracle Database. A mapping is established between a special R object called an ore frame and a corresponding database table or view, allowing < face="courier new,courier,monospace">< face="arial,helvetica,sans-serif">analytical functions such as < >ore.stepwise< > to be executed on these mapped objects in Oracle Database. The overloaded functions in the ORE packages generate SQL statements in the background, export the expensive computations to the database for execution, and return results to the R environment. Here is a simple example using < face="courier new,courier,monospace">ore.stepwise< > with the longley data, which is shipped in the datasets package with R:< >
# load the Oracle R Enterprise library and connect to Oracle Database R> library(ORE) R> ore.connect(user = "rquser", host = "localhost", sid = "oracle", password = "rquser") R> # push the longley data to a database table: R> LONGLEY <- ore.push(longley) R> class(LONGLEY) [1] "ore.frame" attr(,"package") [1] "OREbase" # Fit full model R> oreFit <- ore.lm(Employed ~ ., data = LONGLEY) R> summary(oreFit) Call: ore.lm(formula = Employed ~ ., data = LONGLEY) Residuals: Min 1Q Median 3Q Max -0.41011 -0.15980 -0.02816 0.15681 0.45539 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.482e+03 8.904e+02 -3.911 0.003560 ** GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141 GNP -3.582e-02 3.349e-02 -1.070 0.312681 Unemployed -2.020e-02 4.884e-03 -4.136 0.002535 ** Armed.Forces -1.033e-02 2.143e-03 -4.822 0.000944 *** Population -5.110e-02 2.261e-01 -0.226 0.826212 Year 1.829e+00 4.555e-01 4.016 0.003037 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3049 on 9 degrees of freedom Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925 F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10 # perform stepwise variable selection R> oreStep <- ore.stepwise(Employed ~ .^2, data = LONGLEY, add.p = 0.1, drop.p = 0.1) # View a summary of ore.stepwise object R> summary(oreStep) Call: ore.stepwise(formula = Employed ~ .^2, data = LONGLEY, add.p = 0.1, drop.p = 0.1) Residuals: Min 1Q Median 3Q Max 18.42 22.08 24.08 29.40 33.19 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.539e-01 8.455e-03 -41.852 < 2e-16 *** Year 3.589e-05 1.821e-05 1.971 0.0675 . GNP.deflator:GNP -2.978e-03 3.039e-04 -9.800 6.51e-08 *** GNP.deflator:Unemployed 2.326e-04 5.720e-06 40.656 < 2e-16 *** GNP.deflator:Population 2.303e-05 2.293e-06 10.044 4.72e-08 *** GNP:Armed.Forces 6.875e-06 8.415e-07 8.170 6.64e-07 *** GNP:Year 2.007e-04 1.890e-05 10.618 2.26e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9829 on 15 degrees of freedom Multiple R-squared: 0.9972, Adjusted R-squared: 0.9961 F-statistic: 3242 on 6 and 15 DF, p-value: < 2.2e-16
< size="2" face="arial,helvetica,sans-serif">For more information on how Oracle R Enterprise leverages Oracle Database as computational engine, s< >< size="2" face="arial,helvetica,sans-serif">ee the Oracle R Training Series. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.< >< size="2" face="arial,helvetica,sans-serif"> We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop. As always, we welcome comments and questions on the Oracle R Forum.< >
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.