Site icon R-bloggers

A Limited-but-Functional Couchbase Free Text Search & Retrieval Un-package; or, “How I Abused Couchbase & R to Perform Bulk IP Whois Full-text Searches” (a Cobbler’s Tale)

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Researching “the internet” (i.e. $DAYJOB) means having to deal with a ton of “unique” (I’m being kind) data formats. This is ultimately a tale of how I performed full-text searches across one of them.

It all started off innocently enough. This past week I need to be able to do full-text searches across metadata about who is using which parts of the internet. Normally I don’t need to do that at scale and can just go to RIPE’s excellent resource and manage to find what I need on the first page. However, this time I needed all the resultant info and noticed an interesting foible on that full text search interface. To reproduce it. Enter something like “domino's” (for the record, I’m not researching Domino’s Pizza — nor would I ever consume it — but a Twitter ad happened to fly by for Domino’s and I just typed it for kicks) into the field and page around, keeping an eye on the results. I think they still use Solr for indexing/searching and aren’t passing in all they need to keep session context or something. Anyway, suffice it to say it was fairly useless (I filed a bug report, so I’m not just complaining, and I wish more sites had the same easy error reporting filing capability the RIPE folks do).

If it were just searching for precise data in one field, that’s not really an issue since we have ALL THE WHOIS IP THINGS in Parquet. But:

Since I have a modest analytics server setup at home, I figured I’d take the opportunity to re-brush-up on either Elasticsearch or Couchbase since both are pretty great at free text searching JSON data. Except…this isn’t JSON data, It’s records formatted like this:

#
# The contents of this file are subject to 
# RIPE Database Terms and Conditions
#
# http://www.ripe.net/db/support/db-terms-conditions.pdf
#

as-block:       AS7 - AS7
descr:          RIPE NCC ASN block
remarks:        These AS Numbers are assigned to network operators in the RIPE NCC service region.
mnt-by:         RIPE-NCC-HM-MNT
created:        2018-11-22T15:27:05Z
last-modified:  2018-11-22T15:27:05Z
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

as-block:       AS28 - AS28
descr:          RIPE NCC ASN block
remarks:        These AS Numbers are assigned to network operators in the RIPE NCC service region.
mnt-by:         RIPE-NCC-HM-MNT
created:        2018-11-22T15:27:05Z
last-modified:  2018-11-22T15:27:05Z
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

They “keys” (the colon-ified line prefixes) vary and there are other record types (which I don’t need) that have other prefixes in them plus those #-prefixed comments are not necessarily only at the top. But, after judicious use of stringi::stri::stri_enc_toutf8(), stringi::stri_split_regex() and some vectorized record targeting they’re pretty easily converted to lovely ndjson data like this (random selection further in the conversion):

{"descr":"Reseau Teleinformatique de l'Education Nationale Educational and research network for Luxembourg","admin_c":"DUMY-RIPE","as_set":"AS-RESTENA","members":"AS2602, AS42909, AS51966, AS49624","mnt_by":"AS2602-MNT","notify":"noc@restena.lu","tech_c":"DUMY-RIPE"}
{"descr":"CWIX ASes announced to EBONE","admin_c":"DUMY-RIPE","as_set":"AS-TMPEBONECWIX","members":"AS3727, AS4445, AS4610, AS4624, AS4637, AS4654, AS4655, AS4656, AS4659 AS4681, AS4696, AS4714, AS4849, AS5089, AS5090, AS5532, AS5551, AS5559 AS5655, AS6081, AS6255, AS6292, AS6618, AS6639","mnt_by":"EBONE-MNT","notify":"staff@ebone.net","tech_c":"DUMY-RIPE"}
{"descr":"ASs accepted by DFN from the University of Cologne","admin_c":"DUMY-RIPE","as_set":"AS-DFNFROMCOLOGNE","members":"AS5520 AS6733","mnt_by":"DFN-MNT","tech_c":"DUMY-RIPE"}
{"descr":"NetMatters UK","admin_c":"DUMY-RIPE","as_set":"AS-NETMATTERS","members":"AS6765 AS3344","mnt_by":"AS8407-MNT","tech_c":"DUMY-RIPE"}

I went with Couchbase since it handles ndjson import by default and — as you know since you read the comparison in the aforelinked article — it can easily index all fields by default without you having to do virtually anything. Plus, Couchbase has been around long enough that it generally installs without pain and has a fairly decent web admin panel. Here’s a snapshot of the final import:

and here’s the config for the “all” full text index:

{
  "type": "fulltext-index",
  "name": "all",
  "uuid": "481bc7ed642dddfb",
  "sourceType": "couchbase",
  "sourceName": "ripe",
  "sourceUUID": "3ffbbe0c0923f233ffe0fc96c652262d",
  "planParams": {
    "maxPartitionsPerPIndex": 171
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "type"
    },
    "mapping": {
      "analysis": {},
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": true
      },
      "default_type": "_default",
      "docvalues_dynamic": true,
      "index_dynamic": true,
      "store_dynamic": false,
      "type_field": "_type"
    },
    "store": {
      "indexType": "scorch",
      "kvStoreName": ""
    }
  },
  "sourceParams": {}
}

You Said This Is A Post With R Code

Very true! We’ll get to that in a minute.

Going with Couchbase introduced a different problem: there’s almost no R support for Couchbase. Sure, Couchbase has a gnarly, two-year old, raw httr::-prefixed bit of a tutorial post but that’s not really as cool as if there were a library(couchbase). I mean, you can check GitUgh or CRAN or a more general search yourself if you’d like but it’s going to come up bupkis.

If you were expecting a big reveal, right now, that I’ve got a feature-packed, full R Couchbase package ready to roll…you didn’t actually read the title of the post. What I do have is a set of functions that — given server/connection metadata, a bucket, a full text index, and a query — will return all matching documents (I still do not like that term for “record”) for said set of parameters:

# function code is in: https://paste.sr.ht/~hrbrmstr/051f5d5400644952a3ad2cf8664b84e2cbb9ac6b

cb_fts("domino's", "all", "ripe")
## # A tibble: 120 x 9
##    admin_c   country descr                      inetnum                  mnt_by      netname  status    tech_c  notify         
##    <chr>     <chr>   <chr>                      <chr>                    <chr>       <chr>    <chr>     <chr>   <chr>          
##  1 DUMY-RIPE FR      OPEN IP DOMINO'S PIZZA     79.141.8.44 - 79.141.8.… ALPHALINK-… OPEN-IP  ASSIGNED… DUMY-R… NA             
##  2 DUMY-RIPE NL      Domino's Pizza TILBURG     62.21.176.160 - 62.21.1… AS286-MNT   OTS2634… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  3 DUMY-RIPE NL      Domino's Pizza EINDHOVEN   62.132.252.168 - 62.132… AS286-MNT   OTS2270… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  4 DUMY-RIPE NL      Domino's Pizza SPYKENISSE  194.123.233.232 - 194.1… AS286-MNT   OTS69259 ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  5 DUMY-RIPE NL      Domino's AMSTERDAM         37.74.38.188 - 37.74.38… AS286-MNT   OTS6103… ASSIGNED… DUMY-R… kpn-ip-office@…
##  6 DUMY-RIPE NL      Domino's Pizza VOORSCHOTEN 92.66.116.136 - 92.66.1… AS286-MNT   OTS1914… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  7 DUMY-RIPE NL      Domino's Pizza Doetinchem… 212.241.42.136 - 212.24… AS286-MNT   OTS2301… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  8 DUMY-RIPE NL      Domino's Pizza AMSTERDAM   194.120.45.224 - 194.12… AS286-MNT   OTS82906 ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  9 DUMY-RIPE NL      Domino's Pizza [Woerden] … 62.41.228.80 - 62.41.22… AS286-MNT   OTS2024… ASSIGNED… DUMY-R… ip-reg@kpn.net 
## 10 DUMY-RIPE NL      Domino's Pizza GRONINGEN   188.203.128.0 - 188.203… AS286-MNT   OTS3767… ASSIGNED… DUMY-R… kpn-ip-office@…
## # … with 110 more rows

It’s not fancy.

It’s meets the needs of a narrow use-case.

It’s not in a standalone package (which is triggering my R code OCD something fierce).

But, it’s seriously fast, got me back to “work mode” with a minimum of hassle, and now there’s some google-able Couchbase R code that isn’t just bare httr calls that may help someone else who’s on a quest for how to work with Couchbase in R.

The first primary function – cb_fts() — uses the /api/index/{index-name}/query API endpoint to paginate through results of the full text search and retrieves all matching record doc id keys, then calls the last primary function — cb_get_records_from_keys() — which uses the /query/service API endpoint, issues a SELECT * FROM {bucket} USE KEYS {keys} query with all the found document (record) key ids and returns the result set. Nothing more fancy than that.

FIN

While I do not have these functions in a standalone, Couchbase-focused package I do have them in the package associated with this particular project. If you do know of a Couchbase R package (please don’t link to JDBC/ODBC drivers as I’m not going to buy) please link to them in the comments.

If you have other strategies for how to deal with these “un-packages”, please blog about it and post a link as well! I’m curious how others balance the package/not-a-package/un-package tension, especially when you may need to depend on a series of functions across projects.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.