Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week, OpenAI released version 2 of an updated neural net called Whisper that approaches human level robustness and accuracy on speech recognition. You can now directly call from R a C/C++ inference engine which allow you to transcribe .wav audio files.
To allow to easily do this in R, BNOSAC created an R wrapper around the whisper.cpp code. This R package is available at https://github.com/bnosac/audio.whisper and can be installed as follows.
remotes::install_github("bnosac/audio.whisper")
The following code shows how you can transcribe an example 16-bit wav file with a fragment of a speech by JFK available here.
< video controls="controls" width="450" height="80">< source src="https://user-images.githubusercontent.com/1991296/199337465-dbee4b5e-9aeb-48a3-b1c6-323ac4db5b2c.mp4" type="video/mp4" /> Your browser does not support the video tag.
library(audio.whisper)
model <- whisper("tiny")
path <- system.file(package = "audio.whisper", "samples", "jfk.wav")
trans <- predict(model, newdata = path, language = "en", n_threads = 2)
trans
$n_segments
[1] 1
$data
segment from to text
1 00:00:00.000 00:00:11.000 And so my fellow Americans ask not what your country can do for you ask what you can do for your country.
$tokens
segment token token_prob
1 And 0.7476438
1 so 0.9042299
1 my 0.6872202
1 fellow 0.9984470
1 Americans 0.9589157
1 ask 0.2573057
1 not 0.7678108
1 what 0.6542882
1 your 0.9386917
1 counstry 0.9854987
1 can 0.9813995
1 do 0.9937403
1 for 0.9791515
1 you 0.9925495
1 ask 0.3058807
1 what 0.8303462
1 you 0.9735528
1 can 0.9711444
1 do 0.9616748
1 for 0.9778513
1 your 0.9604713
1 country 0.9923630
1 . 0.4983074
Another example based on a Micro Machines commercial from the 1980’s.
< video controls="controls" width="450" height="80">< source src="https://user-images.githubusercontent.com/1991296/199337504-cc8fd233-0cb7-4920-95f9-4227de3570aa.mp4" type="video/mp4" /> Your browser does not support the video tag.
I’ve always wanted to get the transcription of the performances of Francis E. Dec available on UbuWeb Sound – Francis E. Dec like this performance: https://www.ubu.com/media/sound/dec_francis/Dec-Francis-E_rant1.mp3. This is how you can now do that from R.
library(av)
download.file(url = "https://www.ubu.com/media/sound/dec_francis/Dec-Francis-E_rant1.mp3",
destfile = "rant1.mp3", mode = "wb")
av_audio_convert("rant1.mp3", output = "output.wav", format = "wav", sample_rate = 16000)
trans <- predict(model, newdata = "output.wav", language = "en",
duration = 30 * 1000, offset = 7 * 1000,
token_timestamps = TRUE)
trans
$n_segments
[1] 11
$data
segment from to text
1 00:00:07.000 00:00:09.000 Look at the picture.
2 00:00:09.000 00:00:11.000 See the skull.
3 00:00:11.000 00:00:13.000 The part of bone removed.
4 00:00:13.000 00:00:16.000 The master race Frankenstein radio controls.
5 00:00:16.000 00:00:18.000 The brain thoughts broadcasting radio.
6 00:00:18.000 00:00:21.000 The eyesight television. The Frankenstein earphone radio.
7 00:00:21.000 00:00:25.000 The threshold brain wash radio. The latest new skull reforming.
8 00:00:25.000 00:00:28.000 To contain all Frankenstein controls.
9 00:00:28.000 00:00:31.000 Even in thin skulls of white pedigree males.
10 00:00:31.000 00:00:34.000 Visible Frankenstein controls.
11 00:00:34.000 00:00:37.000 The synthetic nerve radio, directional and an alloop.
$tokens
segment token token_prob token_from token_to
1 Look 0.4281234 00:00:07.290 00:00:07.420
1 at 0.9485379 00:00:07.420 00:00:07.620
1 the 0.9758387 00:00:07.620 00:00:07.940
1 picture 0.9734664 00:00:08.150 00:00:08.580
1 . 0.9688568 00:00:08.680 00:00:08.910
2 See 0.9847929 00:00:09.000 00:00:09.420
2 the 0.7588121 00:00:09.420 00:00:09.840
2 skull 0.9989663 00:00:09.840 00:00:10.310
2 . 0.9548351 00:00:10.550 00:00:11.000
3 The 0.9914295 00:00:11.000 00:00:11.170
3 part 0.9789217 00:00:11.560 00:00:11.600
3 of 0.9958754 00:00:11.600 00:00:11.770
3 bone 0.9759618 00:00:11.770 00:00:12.030
3 removed 0.9956936 00:00:12.190 00:00:12.710
3 . 0.9965582 00:00:12.710 00:00:12.940
...
Maybe in the near future we will put it on CRAN, currently it is only at https://github.com/bnosac/audio.whisper.
Get in touch if you are interested in this and let us know what you plan to use it for.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
