Hands-On Differential Privacy [book review]
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hands-On Differential Privacy was published just a few months ago (from September 2024!) by (the US publisher) O’Reilly, famous for its programming and technical books with animal covers! A slate pencil sea urchin in the present case. The book is indeed classical O’Reilly’s, with lots of notes, little theory (or maths!) and symbols, a loose structuring of the chapters (no section numbers) and highly detailed examples, and of course plenty of OpenDP code inserts. For instance, in the present case, a case study about the privatization of a sample average x̄ that takes about ten pages. Terrible equation rendering btw (what’s wrong with LATEX?!). Overall, I am quickly lost in most of the chapters due to a lack of a driving narrative, facing instead a catalogue of possible scenari and procedures, appearing one after the other as in a fashion show.
Hands-On Differential Privacy is written by Ethan Cowan, Michael Shoemate, and Mayana Pereira. I came across the book during the OpenDP workshop at Harvard [that took place right after my return from the Pacific Northwest] and it is definitely linked with OpenDP, all authors being actually involved at one stage or another in the OpenDP Team. The style of the book is once again in tune with the O’Reilly manuals, which sort of clashes with my preferences. For instance, the introduction of differential privacy (Chapter 2) is quite extensive. Chapter 3 proceeds to teach about private data transform(ation)s, stability (a rewording of Lipschitz-ianity), with code illustrations, often repeating the earlier derivation (see eg p203), while Chapter 4 is its equivalent for private mechanisms. (With the diagrams Figures 3-1 and 4-1 differing only in highlighting/bolding different functions in a privatized data processing pipeline.) Returning to differential privacy with a privacy loss parameter and to Laplace and exponential mechanisms, Chapter 5 proposes several notions of privacy, all closed under post-processing. This includes Wasserman and Zhou (2010) interpretation of privacy as hypothesis testing, except it is not exploited further than connecting type I and type II with (ε,δ) parameters. Chapter 6 concludes Part I about concepts with a series of (fearless) combinators, keeping stability and privacy. With an increasing proportion of coding excerpts which I [imho] did not find particularly helpful.
Nothing about statistical loss of information or efficiency, bias, &tc. until Chapter 8 (p199) and even then so little. Part II is about practice, with a first Chapter 7 on setting a privacy unit (e.g., a person-month) before ensuring their privacy is protected. And discussing unbounded contributions (not unbounded data!). While Chapter 8 very thinly covers statistical modelling, while remaining agnostic about the choice of statistical procedures (Bayes being solely and naïvely mentioned for classification, furthermore with data-based evaluation of the class “prior” probabilities, p211). At this stage, procedures are often only defined through spinets of code, like the private Theil-Sen estimator (pp204-205). The continuous case boils to a Normality assumption, with its pmf being defined (p212) as
which contains at least three errors! Chapter 9 is the equivalent of Chapter 8 for machine learning, mostly centred on private gradient descent. And a Pytorch section (pp232-235). Completed by a light Chapter 10 on synthetic data, which does not seem to broach upon the issue of large dimension covariates, providing instead a list of GAN synthetizers.
Part III (Deploying differential privacy) is even more about practice, with Chapter 11 on privacy attacks, Chapter 12 on calibrating a privacy mechanism (co-written with Jayshree Sarathy), and good practice (like codebooks and data annotations), with the appearance of contextual integrity I discovered if not perfectly understood last year at the BIRS workshop in Kelowna. And Chapter 13 on planning a privacy project, with an 11 step checklist, most of which are quite vague [imho] and do include strategies to make the data owners confident their privacy is safe.
[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE]
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.