Site icon R-bloggers

BythMusteR: Mobile Phone Evolution

[This article was first published on Jkunst - R category, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

TL DRs

Time ago I changed my humble moto G 1st gen with a more big Mate Ascend 7 and my thumb did things I didn’t know to reach the opposite top corner. In one of this efforts to reach a far icon I remembered the phone evolution image. And then I asked how true is this image. How similiar is this trend in reality? So, just for fun (and to show what highcharter can do too!) I coded to kwow the truth.

You can the see the standalone resukting charts here

The data

When you have doubts about the cellphones specficiaction you always finish in the gsmarena.com site. The gsmarena people did an analisys about this topic. It was very descriptive but I think they could do better in terms of visualization:

About the code: basically I wrote a script to get all the phone brand first. Then for each of them download all brands’ phones (and not only phones, but some AIOs and some watches)

url <- "http://www.gsmarena.com"

tabletd <- file.path(url, "makers.php3") %>% 
  read_html() %>% 
  html_nodes("table td")

dfbrands <- data_frame(
  td1 = tabletd[seq(1, length(tabletd), 2)],
  td2 = tabletd[seq(2, length(tabletd), 2)]
  ) %>%  
  mutate(brand_name = html_node(td2, "a") %>% html_text(),
         brand_url = html_node(td1, "a") %>% html_attr("href"),
         brand_image_url = html_node(td1, "img") %>% html_attr("src"),
         brand_n_phn = str_extract(brand_name, "\(\d+\)"),
         brand_n_phn = str_replace_all(brand_n_phn, "\(|\)", ""),
         brand_n_phn = as.numeric(brand_n_phn),
         brand_name = str_replace_all(brand_name, " phones \(\d+\)", "")) %>% 
  select(-td1, -td2) %>% 
  arrange(-brand_n_phn)

head(dfbrands)
brand_name brand_url brand_image_url brand_n_phn
Samsung samsung-phones-9.php http://cdn2.gsmarena.com/vv/logos/lg_samsu.gif 1067
LG lg-phones-20.php http://cdn2.gsmarena.com/vv/logos/lg_lg.gif 560
Nokia nokia-phones-1.php http://cdn2.gsmarena.com/vv/logos/lg_nokia.gif 441
Motorola motorola-phones-4.php http://cdn2.gsmarena.com/vv/logos/lg_motor.gif 425
Alcatel alcatel-phones-5.php http://cdn2.gsmarena.com/vv/logos/lg_alcat2.gif 346
HTC htc-phones-45.php http://cdn2.gsmarena.com/vv/logos/lg_htc.gif 233

We have the brand logo so we can extract the main color of image via caTools::read.gif. And then we will ready to show our fist chart. Here we use the htmltootls package to code the tooltip using the tags which the package provide.

brand_color <- map_chr(dfbrands$brand_image_url, function(url){
  # url <- sample(dfbrands$brand_image_url, size = 1)
  # url <- "http://cdn2.gsmarena.com/vv/logos/lg_mmax.gif"
  img <- caTools::read.gif(url)
  
  colors <- count(data_frame(col = as.vector(img$image)), col) %>% 
    arrange(desc(n)) %>% 
    left_join(data_frame(hex = img$col, col = seq(length(img$col))),
              by = "col") %>% 
    filter(!is.na(hex) & !str_detect(hex, "#F[A-z0-9]F[A-z0-9]F[A-z0-9]"))
  
  str_sub(colors$hex[1], 0, 7)
  
})

dfbrands <- dfbrands %>% mutate(brand_color = brand_color)

n <- 30

dsbrands <- dfbrands %>% 
  head(n) %>% 
  mutate(x = brand_name,
         y = brand_n_phn) %>% 
  list.parse3()

tooltip <- tagList(
  tags$span(style = "float:right;color:#3C3C3C", "{point.y} models"),
  tags$br(),
  tags$img(src = '{point.brand_image_url}')
) %>% as.character()

highchart() %>% 
  hc_title(text = sprintf("Top %s Brands with more phone models", n)) %>% 
  hc_subtitle(text = "data from: http://www.gsmarena.com/") %>% 
  hc_chart(zoomType = "x") %>% 
  hc_tooltip(
    useHTML = TRUE,
    backgroundColor = "white",
    borderWidth = 2,
    headerFormat = "<table style ='width:92px;height:22px' >",
    pointFormat = tooltip,
    footerFormat = "</table>"
  ) %>% 
  hc_yAxis(title = list(text = "Models")) %>% 
  hc_xAxis(categories = map_chr(dsbrands, function(x) x$brand_name)) %>% 
  hc_add_series(data = dsbrands,
                showInLegend = FALSE,
                colorByPoint = TRUE,
                name = "phones models",
                type = "bar") %>% 
  hc_add_theme(
    hc_theme_merge(
      hc_theme_538(),
      hc_theme(colors = map_chr(dsbrands, function(x) x$brand_color))
      )
  )

I know I know! Too many colors. I’m sorry but I wanted to represent each bar with the associated brand color. I think the chart look less boring.

Back to the data: Samsung have over 1000 models! This don’t say too much because there are over 10 Galaxy 5 version. Besides gsmarena don’t have data about the marketshare. But we can make and idea about the status in this market.

Phones Data

Now we’ll scrape the phones data, brand by brand. This part of the code took a little long time but does its job.

dfphones <- map_df(sample(dfbrands$brand_url), function(burl){
  
  # burl <- "dell-phones-61.php" # burl <- "samsung-phones-9.php"
  
  frdata <- paste0(burl, ".RData")
  
  if(file.exists(frdata))
    return(data_frame(burl))
  
  extract_page_info <- function(pburl) {
    message(pburl)
    phns <- read_html(pburl) %>% 
      html_nodes(".makers > ul > li")
    data_frame(
      phn = html_node(phns, "a") %>% html_text(),
      phn_url = html_node(phns, "a") %>% html_attr("href"),
      phn_image_url = html_node(phns, "img") %>% html_attr("src")  
    )
  }
  
  # check if have pages
  pages <- file.path(url, burl) %>% 
    read_html() %>% 
    html_nodes(".nav-pages a")
  
  if (length(pages) > 0) {
    
    dres <- pages %>% 
      html_attr("href") %>% 
      file.path(url, .) %>% 
      map_df(extract_page_info) %>% 
      rbind(extract_page_info(file.path(url, burl)))
    
  } else {
    
    dres <- extract_page_info(file.path(url, burl))
    
  }
  
  dres <- dres %>% 
    mutate(brand_url = burl)
  
  dres2 <- map_df(dres$phn_url, function(purl){
    # purl <- sample(dfphones$phn_url, size = 1);purl <- "samsung_galaxy_s5_mini-6252.php"
    message(purl)
    dfphn <- file.path(url, purl) %>% 
      read_html() %>% 
      html_table(fill = TRUE) %>% 
      map_df(function(t){
        c(t[1, 1]) %>% 
          cbind(rbind(as.matrix(t[1, 2:3]),
                      as.matrix(t[2:nrow(t), 1:2]))) %>% 
          as.data.frame(stringsAsFactors = FALSE) %>% 
          setNames(c("spec", "spec2", "value")) %>% 
          mutate(spec2 = str_replace_all(spec2, "Â", ""),
                 key = paste(spec, spec2, sep = "_") %>% str_to_lower(),
                 key = str_trim(key),
                 key = str_replace(key, "_$", "_other"),
                 key = str_replace_all(key, "\.", "")) %>% 
          select(key, value)
      }) %>%
      distinct(key) %>% 
      spread(key, value) %>%
      mutate(phn_url = purl) 
  })
  
  dfphonebrand <- left_join(dres, dres2, by = "phn_url")
  
  save(dfphonebrand, file = frdata)
  
  data_frame(burl)
  
  
})

dfphones <- map_df(dfphones$burl, function(burl){
  load(paste0(burl, ".RData"))
  dfphonebrand
})

With this data ready we can join with the previous brand data to get the big one data_frame.

dfphns <- dfbrands %>% 
  right_join(dfphones, by = "brand_url") 

I once read something like:

The code to clean data is a dirty code

This is so true. Here we separate some variables like body_dimensions which have value in the form: 12 x 12 x 12 and we need these values separately so we use the tidyr::separate function and parse differents time formats in the same variable among other secrets.

dfphns <- dfphns %>% 
  mutate(body_dimensions = str_replace(body_dimensions, "mm \(.*\)", ""),
         weight = as.numeric(str_extract(body_weight, "\d+"))) %>% 
  separate(body_dimensions, into = c("height", "width", "depth"), sep = " x ",
           remove = FALSE, convert = TRUE) %>% 
  mutate(height = as.numeric(height),
         width = as.numeric(width),
         depth = as.numeric(depth),
         r = t(col2rgb(brand_color))[, 1],
         g = t(col2rgb(brand_color))[, 2],
         b = t(col2rgb(brand_color))[, 3],
         brand_color_2 = paste("rgba(", r, ", ", g, ", ", b, ", 0.2)"),
         screen_body_ratio = str_extract(display_size, "\d+\.\d+%"),
         screen_body_ratio = str_replace(screen_body_ratio, "%", ""),
         screen_body_ratio = as.numeric(screen_body_ratio),
         screen_ppi = str_extract(display_resolution, "~\d+"),
         screen_ppi = as.numeric(str_replace(screen_ppi, "~", "")),
         talk_time = as.numeric(str_extract(`battery_talk time`, "\d+")),
         camera = str_extract(camera_primary, ".* MP"),
         camera = as.numeric(str_replace(camera," MP", "")),
         year = str_extract(launch_announced, "\d{4}"),
         month = str_extract(launch_announced, paste(month.abb, collapse = "|")),
         month = ifelse(str_detect(launch_announced, "1Q|Q1"), "Jan", month),
         month = ifelse(str_detect(launch_announced, "2Q|Q2"), "Apr", month),
         month = ifelse(str_detect(launch_announced, "3Q|Q3"), "Jul", month),
         month = ifelse(str_detect(launch_announced, "4Q|Q4"), "Oct", month),
         month = ifelse(is.na(month), "Jan", month)) %>% 
  # Cancelled Not officially announced yet 
  left_join(data_frame(month = month.abb, monthn = seq(12)), by = "month") %>% 
  mutate(launch_date = paste(year, monthn, 1, sep = "-"),
         launch_date = ymd(launch_date)) %>% 
  filter(!(is.na(year) | is.na(month) | is.na(height)),
         screen_body_ratio < 100)


dfbrandcolors <- dfphns %>% 
  select(brand_name, brand_color) %>% 
  distinct() %>% 
  {setNames(.$brand_color, .$brand_name)}

Now we have a more tidier data. Nice!

We’ll extract some features/specifications like the pixels camera, screen_body_ratio, height and plot them vs time.

dfphns %>%
  select(launch_date, brand_name, height, 
         depth, screen_body_ratio, camera) %>% 
  gather(key, value, -launch_date, -brand_name) %>% 
  ggplot(aes(launch_date, value)) + 
  geom_point(aes(color = brand_name), alpha = 0.25) +
  geom_smooth(color = "black", size = 1.2, alpha = 0.5) + 
  scale_color_manual(values = dfbrandcolors) +
  facet_wrap(~key, scales = "free") + 
  ggtitle("Release date vs phone specifications")

Clearly the megapixels are getting bigger and the phones are getting more and more thinners nothing to worry (right?) about.

Now, the screen body ratio start to growth near of 2007 same date the first iPhone was realeased, coincidence? Nah. It suppose in that date the touch sreens were the new fancy technology so all brands followed the stream.

But what we see in height? We see a similiar trend as the first image. But this trend it’s seem so slight but this is scale effect beacuse as I said before, there are some NO phones in the data.

Now we’ll transform the data to list to chart using highcharter to get a more interactive chart.

dsphns <- dfphns %>% 
  filter(!is.na(height), !is.na(launch_date)) %>% 
  select(launch_date, height, brand_name, brand_color_2, 
         brand_image_url,
         phn, phn_image_url) %>% 
  mutate(x = datetime_to_timestamp(launch_date),
         y = height,
         color = brand_color_2) %>% 
  list.parse3()

dsphnsiphones <- dsphns[map_lgl(dsphns, function(x) str_detect(x$phn, "^iPhone") )]

glxys <- c("I9000 Galaxy S", "I9100G Galaxy S II", "I9300 Galaxy S III",
           "I9500 Galaxy S4", "Galaxy S5", "Galaxy S6", "Galaxy S6 edge",
           "Galaxy S7", "Galaxy S7 edge")

dsphnsgalaxy <- dsphns[map_lgl(dsphns, function(x) x$phn %in% glxys )]

To get the loess fit data we’ll use the broom package to get a tidy data frame to then transform it a list format.

fit <- loess(height ~ datetime_to_timestamp(launch_date),
             data = dfphns) %>% 
  augment() %>% 
  tbl_df()

head(fit)
height datetime_to_timestamp.launch_date. .fitted .se.fit .resid
132 1.37e+12 141 0.645 -9.3
126 1.37e+12 141 0.645 -15.0
124 1.37e+12 141 0.645 -16.9
194 1.37e+12 141 0.645 53.4
148 1.37e+12 141 0.645 7.0
152 1.37e+12 141 0.645 11.0
dssmooth <- fit %>% 
  select(x = datetime_to_timestamp.launch_date.,
         y = .fitted) %>% 
  distinct(x) %>% # imporant!
  arrange(x) %>% # really important!
  list.parse3()

dssarea <- fit %>% 
  mutate(x = datetime_to_timestamp.launch_date.,
         y = .fitted - .se.fit,
         z = .fitted + .se.fit) %>% 
  select(x, y, z) %>% 
  distinct(x) %>% # imporant!
  arrange(x) %>% # really important!
  list.parse2()

The data is ready. Now make the tooltip and chart!

tooltip <- tagList(
  tags$span(style = "color:#3C3C3C", "{point.phn}"),
  tags$hr(),
  tags$img(src = '{point.brand_image_url}'),
  tags$br(),
  tags$img(src = '{point.phn_image_url}', width = "95%")
  ) %>% as.character()

highchart() %>% 
  hc_title(text = "Release date vs Heigth") %>% 
  hc_subtitle(text = "data from: http://www.gsmarena.com/") %>% 
  hc_chart(zoomType = "xy") %>% 
  hc_plotOptions(series = list(
    turboThreshold = 8000,
    marker = list(symbol = "circle", radius = 2)
    )) %>% 
  hc_yAxis(title = list(text = "Heigth"),
           min = 60, max = 170, maxZoom = 10) %>%
  hc_xAxis(type = "datetime", maxZoom = 2 * 30 * 24 * 3600 * 1000) %>% 
  hc_add_serie(data = dsphnsgalaxy, type = "scatter",
               dataLabels = list(enabled = TRUE, format = "{point.phn}"),
               name = "Galaxy S",zIndex = 1, color = "rgba( 64 , 54 , 132 , 0.5)",
               marker = list(radius = 3)) %>%
  hc_add_serie(data = dsphnsiphones, type = "scatter",
               dataLabels = list(enabled = TRUE, format = "{point.phn}"),
               name = "IPhones",zIndex = 1, color = "rgba( 51 , 51 , 51 , 0.5)",
               marker = list(radius = 3)) %>%
  hc_add_serie(data = dsphns, type = "scatter", legendIndex = -1,
               name = "All Phones", zIndex = -5) %>%
  hc_add_serie(data = dssmooth, name = "Trend",
               type = "spline", lineWidth = 3, color = "#000",
               enableMouseTracking = FALSE,
               marker = list(enabled = FALSE)) %>%
  hc_add_serie(data = dssarea, 
               type = "arearange", fillOpacity = 0.25, color = "#c3c3c3",
               linkedTo = 'previous', name = "se",
               lineWidth = 1.5, enableMouseTracking = FALSE) %>% 
  hc_tooltip(
    useHTML = TRUE,
    backgroundColor = "white",
    borderWidth = 4,
    headerFormat = "<table style ='width:160px;height:200px' >",
    pointFormat = tooltip,
    footerFormat = "</table>"
  ) %>% 
  hc_add_theme(hc_theme_538())

Do you see the same trend? I do! And what do you think about the chart? IMHO t looks nice to be craeted using only R :D.

Remember: You can the see the standalone charts here

To leave a comment for the author, please follow the link and comment on their blog: Jkunst - R category.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.