A Limited-but-Functional Couchbase Free Text Search & Retrieval Un-package; or, “How I Abused Couchbase & R to Perform Bulk IP Whois Full-text Searches” (a Cobbler’s Tale)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Researching “the internet” (i.e. $DAYJOB) means having to deal with a ton of “unique” (I’m being kind) data formats. This is ultimately a tale of how I performed full-text searches across one of them.
It all started off innocently enough. This past week I need to be able to do full-text searches across metadata about who is using which parts of the internet. Normally I don’t need to do that at scale and can just go to RIPE’s excellent resource and manage to find what I need on the first page. However, this time I needed all the resultant info and noticed an interesting foible on that full text search interface. To reproduce it. Enter something like “domino's
” (for the record, I’m not researching Domino’s Pizza — nor would I ever consume it — but a Twitter ad happened to fly by for Domino’s and I just typed it for kicks) into the field and page around, keeping an eye on the results. I think they still use Solr for indexing/searching and aren’t passing in all they need to keep session context or something. Anyway, suffice it to say it was fairly useless (I filed a bug report, so I’m not just complaining, and I wish more sites had the same easy error reporting filing capability the RIPE folks do).
If it were just searching for precise data in one field, that’s not really an issue since we have ALL THE WHOIS IP THINGS in Parquet. But:
- I really hate giving Amazon money (even if it’s $WORK money) for Athena queries
- Full text search across all columns is not one of Parquet’s strengths
- This is a third bullet b/c I feel compelled to have a minimum of three points in bullet lists likely thanks to an overbearing middle-school English teacher
Since I have a modest analytics server setup at home, I figured I’d take the opportunity to re-brush-up on either Elasticsearch or Couchbase since both are pretty great at free text searching JSON data. Except…this isn’t JSON data, It’s records formatted like this:
# # The contents of this file are subject to # RIPE Database Terms and Conditions # # http://www.ripe.net/db/support/db-terms-conditions.pdf # as-block: AS7 - AS7 descr: RIPE NCC ASN block remarks: These AS Numbers are assigned to network operators in the RIPE NCC service region. mnt-by: RIPE-NCC-HM-MNT created: 2018-11-22T15:27:05Z last-modified: 2018-11-22T15:27:05Z source: RIPE remarks: **************************** remarks: * THIS OBJECT IS MODIFIED remarks: * Please note that all data that is generally regarded as personal remarks: * data has been removed from this object. remarks: * To view the original object, please query the RIPE Database at: remarks: * http://www.ripe.net/whois remarks: **************************** as-block: AS28 - AS28 descr: RIPE NCC ASN block remarks: These AS Numbers are assigned to network operators in the RIPE NCC service region. mnt-by: RIPE-NCC-HM-MNT created: 2018-11-22T15:27:05Z last-modified: 2018-11-22T15:27:05Z source: RIPE remarks: **************************** remarks: * THIS OBJECT IS MODIFIED remarks: * Please note that all data that is generally regarded as personal remarks: * data has been removed from this object. remarks: * To view the original object, please query the RIPE Database at: remarks: * http://www.ripe.net/whois remarks: ****************************
They “keys” (the colon-ified line prefixes) vary and there are other record types (which I don’t need) that have other prefixes in them plus those #
-prefixed comments are not necessarily only at the top. But, after judicious use of stringi::stri::stri_enc_toutf8()
, stringi::stri_split_regex()
and some vectorized record targeting they’re pretty easily converted to lovely ndjson data like this (random selection further in the conversion):
{"descr":"Reseau Teleinformatique de l'Education Nationale Educational and research network for Luxembourg","admin_c":"DUMY-RIPE","as_set":"AS-RESTENA","members":"AS2602, AS42909, AS51966, AS49624","mnt_by":"AS2602-MNT","notify":"[email protected]","tech_c":"DUMY-RIPE"} {"descr":"CWIX ASes announced to EBONE","admin_c":"DUMY-RIPE","as_set":"AS-TMPEBONECWIX","members":"AS3727, AS4445, AS4610, AS4624, AS4637, AS4654, AS4655, AS4656, AS4659 AS4681, AS4696, AS4714, AS4849, AS5089, AS5090, AS5532, AS5551, AS5559 AS5655, AS6081, AS6255, AS6292, AS6618, AS6639","mnt_by":"EBONE-MNT","notify":"[email protected]","tech_c":"DUMY-RIPE"} {"descr":"ASs accepted by DFN from the University of Cologne","admin_c":"DUMY-RIPE","as_set":"AS-DFNFROMCOLOGNE","members":"AS5520 AS6733","mnt_by":"DFN-MNT","tech_c":"DUMY-RIPE"} {"descr":"NetMatters UK","admin_c":"DUMY-RIPE","as_set":"AS-NETMATTERS","members":"AS6765 AS3344","mnt_by":"AS8407-MNT","tech_c":"DUMY-RIPE"}
I went with Couchbase since it handles ndjson import by default and — as you know since you read the comparison in the aforelinked article — it can easily index all fields by default without you having to do virtually anything. Plus, Couchbase has been around long enough that it generally installs without pain and has a fairly decent web admin panel. Here’s a snapshot of the final import:
and here’s the config for the “all
” full text index:
{ "type": "fulltext-index", "name": "all", "uuid": "481bc7ed642dddfb", "sourceType": "couchbase", "sourceName": "ripe", "sourceUUID": "3ffbbe0c0923f233ffe0fc96c652262d", "planParams": { "maxPartitionsPerPIndex": 171 }, "params": { "doc_config": { "docid_prefix_delim": "", "docid_regexp": "", "mode": "type_field", "type_field": "type" }, "mapping": { "analysis": {}, "default_analyzer": "standard", "default_datetime_parser": "dateTimeOptional", "default_field": "_all", "default_mapping": { "dynamic": true, "enabled": true }, "default_type": "_default", "docvalues_dynamic": true, "index_dynamic": true, "store_dynamic": false, "type_field": "_type" }, "store": { "indexType": "scorch", "kvStoreName": "" } }, "sourceParams": {} }
You Said This Is A Post With R Code
Very true! We’ll get to that in a minute.
Going with Couchbase introduced a different problem: there’s almost no R support for Couchbase. Sure, Couchbase has a gnarly, two-year old, raw httr::
-prefixed bit of a tutorial post but that’s not really as cool as if there were a library(couchbase)
. I mean, you can check GitUgh or CRAN or a more general search yourself if you’d like but it’s going to come up bupkis.
If you were expecting a big reveal, right now, that I’ve got a feature-packed, full R Couchbase package ready to roll…you didn’t actually read the title of the post. What I do have is a set of functions that — given server/connection metadata, a bucket, a full text index, and a query — will return all matching documents (I still do not like that term for “record”) for said set of parameters:
# function code is in: https://paste.sr.ht/~hrbrmstr/051f5d5400644952a3ad2cf8664b84e2cbb9ac6b cb_fts("domino's", "all", "ripe") ## # A tibble: 120 x 9 ## admin_c country descr inetnum mnt_by netname status tech_c notify ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 DUMY-RIPE FR OPEN IP DOMINO'S PIZZA 79.141.8.44 - 79.141.8.… ALPHALINK-… OPEN-IP ASSIGNED… DUMY-R… NA ## 2 DUMY-RIPE NL Domino's Pizza TILBURG 62.21.176.160 - 62.21.1… AS286-MNT OTS2634… ASSIGNED… DUMY-R… [email protected] ## 3 DUMY-RIPE NL Domino's Pizza EINDHOVEN 62.132.252.168 - 62.132… AS286-MNT OTS2270… ASSIGNED… DUMY-R… [email protected] ## 4 DUMY-RIPE NL Domino's Pizza SPYKENISSE 194.123.233.232 - 194.1… AS286-MNT OTS69259 ASSIGNED… DUMY-R… [email protected] ## 5 DUMY-RIPE NL Domino's AMSTERDAM 37.74.38.188 - 37.74.38… AS286-MNT OTS6103… ASSIGNED… DUMY-R… kpn-ip-office@… ## 6 DUMY-RIPE NL Domino's Pizza VOORSCHOTEN 92.66.116.136 - 92.66.1… AS286-MNT OTS1914… ASSIGNED… DUMY-R… [email protected] ## 7 DUMY-RIPE NL Domino's Pizza Doetinchem… 212.241.42.136 - 212.24… AS286-MNT OTS2301… ASSIGNED… DUMY-R… [email protected] ## 8 DUMY-RIPE NL Domino's Pizza AMSTERDAM 194.120.45.224 - 194.12… AS286-MNT OTS82906 ASSIGNED… DUMY-R… [email protected] ## 9 DUMY-RIPE NL Domino's Pizza [Woerden] … 62.41.228.80 - 62.41.22… AS286-MNT OTS2024… ASSIGNED… DUMY-R… [email protected] ## 10 DUMY-RIPE NL Domino's Pizza GRONINGEN 188.203.128.0 - 188.203… AS286-MNT OTS3767… ASSIGNED… DUMY-R… kpn-ip-office@… ## # … with 110 more rows
It’s not fancy.
It’s meets the needs of a narrow use-case.
It’s not in a standalone package (which is triggering my R code OCD something fierce).
But, it’s seriously fast, got me back to “work mode” with a minimum of hassle, and now there’s some google-able Couchbase R code that isn’t just bare httr
calls that may help someone else who’s on a quest for how to work with Couchbase in R.
The first primary function – cb_fts()
— uses the /api/index/{index-name}/query
API endpoint to paginate through results of the full text search and retrieves all matching record doc id keys, then calls the last primary function — cb_get_records_from_keys()
— which uses the /query/service
API endpoint, issues a SELECT * FROM {bucket} USE KEYS {keys}
query with all the found document (record) key ids and returns the result set. Nothing more fancy than that.
FIN
While I do not have these functions in a standalone, Couchbase-focused package I do have them in the package associated with this particular project. If you do know of a Couchbase R package (please don’t link to JDBC/ODBC drivers as I’m not going to buy) please link to them in the comments.
If you have other strategies for how to deal with these “un-packages”, please blog about it and post a link as well! I’m curious how others balance the package/not-a-package/un-package tension, especially when you may need to depend on a series of functions across projects.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.