Site icon R-bloggers

solr – an R interface to Solr

[This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A number of the APIs we interact with (e.g., PLOS full text API, and USGS's BISON API in rplos and rbison, respectively) expose Solr endpoints. Solr is an Apache hosted project – it is a powerful search server. Given that at least two, and possibly more in the future, of the data providers we interact with provide Solr endpoints, it made sense to create an R package to make robust functions to interact with Solr that work across any Solr endpoint. This is then useful to us, and hopefully others.

The following are a few examples covering some of things you can do in Solr that fall in to six categories:

The solr package generally has two steps for any query: a) send the request given your inputs, and b) parse the output into a useful R data structure. Part a) is quite easy. However, part b) is harder. We are working hard on making parsers that are as general as possible for each of the data formats that are returned by group, facet, highlight, etc., but of course we will still definitely fail in many cases. Please do submit bug reports to our issue tracker so we can make the parsers work better.


Installation

solr is on CRAN, so you can install the more stable version there, and some dependencies.

install.packages("solr")
install.packages(c("rjson", "plyr", "httr", "XML", "data.table", "assertthat"))

You can install the development version from Github as follows. Below we'll use the Github version – most of below is available in the CRAN version too, except solr_group.

install.packages("devtools")
library(devtools)
install_github("solr", "ropensci")

Load the library

library(solr)

Define url endpoint and key

As solr is a general interface to Solr endpoints, you need to define the url. Here, we'll work with the Public Library of Science full text search API (docs here). Some Solr endpoints will require authentication – I should note that we don't yet handle authentication schemes other than passing in a key in the url, but that's on the to do list.

url <- "http://api.plos.org/search"

Search

solr_search(q = "*:*", rows = 2, fl = "id", url = url)

## http://api.plos.org/search?q=*:*&start=0&rows=2&fl=id&wt=json

##                                      id
## 1    10.1371/journal.pone.0040927/title
## 2 10.1371/journal.pone.0040927/abstract

Search for words "sports" and "alcohol" within seven words of each other

solr_search(q = "everything:\"sports alcohol\"~7", fl = "title", rows = 3, url = url)

## http://api.plos.org/search?q=everything:"sports alcohol"~7&start=0&rows=3&fl=title&wt=json

##                                                                                                                                                                         title
## 1 “Like Throwing a Bowling Ball at a Battle Ship” Audience Responses to Australian News Stories about Alcohol Pricing and Promotion Policies: A Qualitative Focus Group Study
## 2                                            Development and Validation of a Risk Score Predicting Substantial Weight Gain over 5 Years in Middle-Aged European Men and Women
## 3                                                                                                      Adolescent Lifestyle and Behaviour: A Survey from a Developing Country

Groups

Most recent publication by journal

solr_group(q = "*:*", group.field = "journal", rows = 5, group.limit = 1, group.sort = "publication_date desc", 
    fl = "publication_date, score", url = url)

## http://api.plos.org/search?group.field=journal&q=*:*&start=0&rows=5&wt=json&group.limit=1&group.sort=publication_date desc&group.sort=publication_date desc&group=true&fl=publication_date, score

##                         groupValue numFound start     publication_date
## 1                         plos one   687745     0 2014-01-27T00:00:00Z
## 2                    plos genetics    33801     0 2014-01-23T00:00:00Z
## 3 plos neglected tropical diseases    19140     0 2014-01-23T00:00:00Z
## 4                     plos biology    24165     0 2014-01-21T00:00:00Z
## 5                             none    62556     0 2012-10-23T00:00:00Z
##   score
## 1     1
## 2     1
## 3     1
## 4     1
## 5     1

First publication by journal

solr_group(q = "*:*", group.field = "journal", group.limit = 1, group.sort = "publication_date asc", 
    fl = "publication_date, score", fq = "publication_date:[1900-01-01T00:00:00Z TO *]", 
    url = url)

## http://api.plos.org/search?group.field=journal&q=*:*&start=0&fq=publication_date:[1900-01-01T00:00:00Z TO *]&wt=json&group.limit=1&group.sort=publication_date asc&group.sort=publication_date asc&group=true&fl=publication_date, score

##                          groupValue numFound start     publication_date
## 1                          plos one   687745     0 2006-12-01T00:00:00Z
## 2                     plos genetics    33801     0 2005-06-17T00:00:00Z
## 3  plos neglected tropical diseases    19140     0 2007-08-30T00:00:00Z
## 4                      plos biology    24165     0 2003-08-18T00:00:00Z
## 5                              none    57574     0 2012-07-17T00:00:00Z
## 6        plos computational biology    25196     0 2005-06-24T00:00:00Z
## 7                     plos medicine    17149     0 2004-09-07T00:00:00Z
## 8                    plos pathogens    29691     0 2005-07-22T00:00:00Z
## 9                  plos collections        5     0 2013-12-04T00:00:00Z
## 10             plos clinical trials      521     0 2006-04-21T00:00:00Z
##    score
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1
## 7      1
## 8      1
## 9      1
## 10     1

Facet

solr_facet(q = "*:*", facet.field = "journal", facet.query = "cell,bird", url = url)

## http://api.plos.org/search?q=*:*&facet.query=cell&facet.query=bird&facet.field=journal&wt=json&fl=DOES_NOT_EXIST&facet=true

## $facet_queries
##   term value
## 1 cell 81733
## 2 bird  8194
## 
## $facet_fields
## $facet_fields$journal
##                                  X1     X2
## 1                          plos one 687745
## 2                     plos genetics  33801
## 3                    plos pathogens  29691
## 4        plos computational biology  25196
## 5                      plos biology  24165
## 6  plos neglected tropical diseases  19140
## 7                     plos medicine  17149
## 8              plos clinical trials    521
## 9                      plos medicin      9
## 10                 plos collections      5
## 
## 
## $facet_dates
## NULL
## 
## $facet_ranges
## NULL

Range faceting with > 1 field

out <- solr_facet(q = "*:*", url = url, facet.range = "counter_total_all,alm_twitterCount", 
    facet.range.start = 5, facet.range.end = 1000, facet.range.gap = 10)

## http://api.plos.org/search?q=*:*&facet.range=counter_total_all&facet.range=alm_twitterCount&facet.range.start=5&facet.range.end=1000&facet.range.gap=10&wt=json&fl=DOES_NOT_EXIST&facet=true

lapply(out$facet_ranges, head)

## $counter_total_all
##   X1  X2
## 1  5   0
## 2 15 421
## 3 25 800
## 4 35 867
## 5 45 690
## 6 55 789
## 
## $alm_twitterCount
##   X1    X2
## 1  5 37537
## 2 15  7940
## 3 25  3559
## 4 35  1627
## 5 45  1416
## 6 55   795

Highlight

solr_highlight(q = "alcohol", hl.fl = "abstract", rows = 2, url = url)

## http://api.plos.org/search?wt=json&q=alcohol&start=0&rows=2&hl=true&hl.fl=abstract&fl=DOES_NOT_EXIST

## $`10.1371/journal.pmed.0040151`
## $`10.1371/journal.pmed.0040151`$abstract
## [1] "Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting"
## 
## 
## $`10.1371/journal.pone.0027752`
## $`10.1371/journal.pone.0027752`$abstract
## [1] "Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking"

Stats

out <- solr_stats(q = "ecology", stats.field = "counter_total_all,alm_twitterCount", 
    stats.facet = "journal,volume", url = url)

## http://api.plos.org/search?q=ecology&stats.field=counter_total_all&stats.field=alm_twitterCount&stats.facet=journal&stats.facet=volume&start=0&rows=0&wt=json&stats=true

out$data

##                   min    max count missing      sum sumOfSquares     mean
## counter_total_all   0 294080 18696       0 61103092    1.020e+12 3268.244
## alm_twitterCount    0   1345 18696       0    63470    8.682e+06    3.395
##                    stddev
## counter_total_all 6623.95
## alm_twitterCount    21.28

out$facet

## $counter_total_all
## $counter_total_all$journal
##    min    max count missing      sum sumOfSquares  mean stddev
## 1  669  38118   414       0  2160463    1.870e+10  5219   4241
## 2  582  42775   543       0  3170356    2.951e+10  5839   4505
## 3    0 294080 14463       0 37291578    5.716e+11  2578   5734
## 4 4401   8325     2       0    12726    8.867e+07  6363   2775
## 5  157  54715   373       0  1971606    2.074e+10  5286   5268
## 6  507  83913   209       0  2288958    5.031e+10 10952  11016
## 7  893 161403   751       0  8647100    2.228e+11 11514  12820
## 8  211 160180   690       0  2252119    3.722e+10  3264   6584
##                        facet_field
## 1                   plos pathogens
## 2                    plos genetics
## 3                         plos one
## 4             plos clinical trials
## 5       plos computational biology
## 6                    plos medicine
## 7                     plos biology
## 8 plos neglected tropical diseases
## 
## $counter_total_all$volume
##     min    max count missing      sum sumOfSquares  mean stddev
## 1   840 108024   741       0  5138959    9.339e+10  6935   8834
## 2  1144  85924   482       0  3999437    7.887e+10  8298   9746
## 3   157  78129   104       0   870304    2.037e+10  8368  11270
## 4  1385 109254    81       0  1075033    3.657e+10 13272  16698
## 5   362 179012  4823       0 12626395    1.785e+11  2618   5492
## 6   539 160180  2946       0 10172036    1.292e+11  3453   5652
## 7   491  74271  1538       0  7411041    8.495e+10  4819   5660
## 8   507 294080  1010       0  6330604    1.848e+11  6268  11994
## 9     0 161403   812       0  2260500    4.591e+10  2784   6989
## 10   78 166682  6093       0 10519943    1.515e+11  1727   4678
## 11 1979  70376    62       0   670098    1.534e+10 10808  11521
## 12  893  24278     4       0    28742    5.969e+08  7186  11407
##    facet_field
## 1            3
## 2            2
## 3           10
## 4            1
## 5            7
## 6            6
## 7            5
## 8            4
## 9            9
## 10           8
## 11          11
## 12          12
## 
## 
## $alm_twitterCount
## $alm_twitterCount$journal
##   min  max count missing   sum sumOfSquares   mean stddev
## 1   0   73   414       0  1255        31525  3.031  8.193
## 2   0   99   543       0  1444        36060  2.659  7.710
## 3   0  741 14463       0 43477      4699037  3.006 17.773
## 4   0    3     2       0     3            9  1.500  2.121
## 5   0  102   373       0  1140        36082  3.056  9.361
## 6   0  456   209       0  2098       357306 10.038 40.207
## 7   0 1345   751       0  5871      2518003  7.818 57.412
## 8   0  785   690       0  1803       628569  2.613 30.091
##                        facet_field
## 1                   plos pathogens
## 2                    plos genetics
## 3                         plos one
## 4             plos clinical trials
## 5       plos computational biology
## 6                    plos medicine
## 7                     plos biology
## 8 plos neglected tropical diseases
## 
## $alm_twitterCount$volume
##    min  max count missing   sum sumOfSquares     mean  stddev facet_field
## 1    0   18   741       0   308         2380   0.4157   1.744           3
## 2    0   36   482       0   276         4426   0.5726   2.979           2
## 3    0  456   104       0  2580       370684  24.8077  54.566          10
## 4    0   28    81       0    82         1630   1.0123   4.397           1
## 5    0  734  4823       0 17106      1585234   3.5468  17.781           7
## 6    0  785  2946       0  2710       761766   0.9199  16.057           6
## 7    0  110  1538       0  1058        39912   0.6879   5.049           5
## 8    0  147  1010       0   505        27497   0.5000   5.196           4
## 9    0  206   812       0  5509       297295   6.7845  17.902           9
## 10   0  741  6093       0 28449      3110039   4.6691  22.107           8
## 11   1 1345    62       0  4292      2185830  69.2258 175.962          11
## 12  14  543     4       0   595       295767 148.7500 262.844          12

More like this

solr_mlt is a function to return similar documents to the ones searched for.

out <- solr_mlt(q = "title:\"ecology\" AND body:\"cell\"", mlt.fl = "title", 
    mlt.mindf = 1, mlt.mintf = 1, fl = "counter_total_all", rows = 5, url = url)

## http://api.plos.org/search?q=title:"ecology" AND body:"cell"&mlt=true&fl=id,counter_total_all&mlt.fl=title&mlt.mintf=1&mlt.mindf=1&start=0&rows=5&wt=json

out$docs

##                             id counter_total_all
## 1 10.1371/journal.pbio.0020440             16052
## 2 10.1371/journal.pone.0040117              1654
## 3 10.1371/journal.pone.0072525               681
## 4 10.1371/journal.ppat.1002320              4708
## 5 10.1371/journal.pone.0015143             11161

Raw data?

You can optionally get back raw json or xml from all functions by setting parameter raw=TRUE. You can then parse after the fact with solr_parse, or just process as you wish. For example:

(out <- solr_highlight(q = "alcohol", hl.fl = "abstract", rows = 2, url = url, 
    raw = TRUE))

## http://api.plos.org/search?wt=json&q=alcohol&start=0&rows=2&hl=true&hl.fl=abstract&fl=DOES_NOT_EXIST

## [1] "{\"response\":{\"numFound\":11628,\"start\":0,\"docs\":[{},{}]},\"highlighting\":{\"10.1371/journal.pmed.0040151\":{\"abstract\":[\"Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting\"]},\"10.1371/journal.pone.0027752\":{\"abstract\":[\"Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking\"]}}}\n"
## attr(,"class")
## [1] "sr_high"
## attr(,"wt")
## [1] "json"

Then parse

solr_parse(out, "df")

##                          names
## 1 10.1371/journal.pmed.0040151
## 2 10.1371/journal.pone.0027752
##                                                                                                    abstract
## 1   Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting
## 2 Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking

Verbosity

As you have noticed, we include in each function the acutal call to the Solr endpoint made so you know exactly what was submitted to the remote or local Solr instance. You can suppress the message with verbose=FALSE. This message isn't in the CRAN version.


Advanced: Function Queries

Function Queries allow you to query on actual numeric fields in the SOLR database, and do addition, multiplication, etc on one or many fields to stort results. For example, here, we search on the product of counter_total_all and alm_twitterCount, using a new temporary field "_val_"

solr_search(q = "_val_:\"product(counter_total_all,alm_twitterCount)\"", rows = 5, 
    fl = "id,title", fq = "doc_type:full", url = url)

## http://api.plos.org/search?q=_val_:"product(counter_total_all,alm_twitterCount)"&start=0&rows=5&fq=doc_type:full&fl=id,title&wt=json

##                             id
## 1 10.1371/journal.pmed.0020124
## 2 10.1371/journal.pone.0046362
## 3 10.1371/journal.pntd.0001969
## 4 10.1371/journal.pone.0069841
## 5 10.1371/journal.pbio.1001535
##                                                                                                           title
## 1                                                                Why Most Published Research Findings Are False
## 2            The Power of Kawaii: Viewing Cute Images Promotes a Careful Behavior and Narrows Attentional Focus
## 3 An In-Depth Analysis of a Piece of Shit: Distribution of Schistosoma mansoni and Hookworm Eggs in Human Stool
## 4                                       Facebook Use Predicts Declines in Subjective Well-Being in Young Adults
## 5                                                                An Introduction to Social Media for Scientists

Here, we search for the papers with the most citations

solr_search(q = "_val_:\"max(counter_total_all)\"", rows = 5, fl = "id,counter_total_all", 
    fq = "doc_type:full", url = url)

## http://api.plos.org/search?q=_val_:"max(counter_total_all)"&start=0&rows=5&fq=doc_type:full&fl=id,counter_total_all&wt=json

##                             id counter_total_all
## 1 10.1371/journal.pmed.0020124            802095
## 2 10.1371/journal.pmed.0050045            301752
## 3 10.1371/journal.pone.0007595            294080
## 4 10.1371/journal.pone.0044864            234574
## 5 10.1371/journal.pone.0033288            200899

Or with the most tweets

solr_search(q = "_val_:\"max(alm_twitterCount)\"", rows = 5, fl = "id,alm_twitterCount", 
    fq = "doc_type:full", url = url)

## http://api.plos.org/search?q=_val_:"max(alm_twitterCount)"&start=0&rows=5&fq=doc_type:full&fl=id,alm_twitterCount&wt=json

##                             id alm_twitterCount
## 1 10.1371/journal.pbio.1001535             1345
## 2 10.1371/journal.pone.0046362             1127
## 3 10.1371/journal.pmed.0020124             1016
## 4 10.1371/journal.pntd.0001969              785
## 5 10.1371/journal.pone.0065263              741

Further reading on Solr

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.