Scraping table from any web page with R or CloudStat

[This article was first published on PR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Scraping table from any web page with R or CloudStat:

You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.

Thanks to XML package from R. It provides amazing readHTMLtable() function.

For a study case,

I want to scrape data:

  1. US Airline Customer Score.
  2. World Top Chess Players (Men).

A. Scraping US Airline Customer Score table from
http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines

Code:

airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’
airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)

Result:

> library(XML)

Warning message:

package "XML" was built under R version 2.14.1 

> airline = "http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines"

> airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)

> airline.table

                     Base-line 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10

1          Southwest        78 76 76 76 74 72 70 70 74 75 73 74 74 76 79 81 79

2         All Others        NM 70 74 70 62 67 63 64 72 74 73 74 74 75 75 77 75

3           Airlines        72 69 69 67 65 63 63 61 66 67 66 66 65 63 62 64 66

4        Continental        67 64 66 64 66 64 62 67 68 68 67 70 67 69 62 68 71

5           American        70 71 71 62 67 64 63 62 63 67 66 64 62 60 62 60 63

6             United        71 67 70 68 65 62 62 59 64 63 64 61 63 56 56 56 60

7         US Airways        72 67 66 68 65 61 62 60 63 64 62 57 62 61 54 59 62

8              Delta        77 72 67 69 65 68 66 61 66 67 67 65 64 59 60 64 62

9 Northwest Airlines        69 71 67 64 63 53 62 56 65 64 64 64 61 61 57 57 61

  11 PreviousYear%Change FirstYear%Change

1 81                 2.5              3.8

3 65                -1.5             -9.7

4 64                -9.9             -4.5

5 63                 0.0            -10.0

7 61                -1.6            -15.3

8 56                -9.7            -27.3

9  #                 N/A              N/A

> 

B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men

Code:

chess = ‘http://ratings.fide.com/top.phtml?list=men’
chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)

Result:

> chess = "http://ratings.fide.com/top.phtml?list=men"

> chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)

> chess.table

     Rank                       Name Title Country Rating Games B-Year

1      1           Carlsen, Magnus    g    NOR  2835   17  1990

2      2            Aronian, Levon    g    ARM  2805   25  1982

3      3         Kramnik, Vladimir    g    RUS  2801   17  1975

4      4        Anand, Viswanathan    g    IND  2799   17  1969

5      5         Radjabov, Teimour    g    AZE  2773    9  1987

6      6          Topalov, Veselin    g    BUL  2770    9  1975

7      7          Karjakin, Sergey    g    RUS  2769   16  1990

8      8         Ivanchuk, Vassily    g    UKR  2766   16  1969

9      9     Morozevich, Alexander    g    RUS  2763    6  1977

10    10           Gashimov, Vugar    g    AZE  2761    9  1986

11    11       Grischuk, Alexander    g    RUS  2761    8  1983

12    12          Nakamura, Hikaru    g    USA  2759   17  1987

13    13            Svidler, Peter    g    RUS  2749   17  1976

14    14    Mamedyarov, Shakhriyar    g    AZE  2747    9  1985

15    15       Tomashevsky, Evgeny    g    RUS  2740    0  1987

16    16            Gelfand, Boris    g    ISR  2739    9  1968

17    17          Caruana, Fabiano    g    ITA  2736   19  1992

18    18       Nepomniachtchi, Ian    g    RUS  2735   16  1990

19    19                 Wang, Hao    g    CHN  2733    6  1989

20    20              Kamsky, Gata    g    USA  2732    0  1974

21    21  Dominguez Perez, Leinier    g    CUB  2730    6  1983

22    22         Jakovenko, Dmitry    g    RUS  2729    0  1983

23    23        Ponomariov, Ruslan    g    UKR  2727   13  1983

24    24          Vitiugov, Nikita    g    RUS  2726    1  1987

25    25            Adams, Michael    g    ENG  2724   17  1971

26    26               Leko, Peter    g    HUN  2720    9  1979

27    27            Almasi, Zoltan    g    HUN  2717    8  1976

28    28               Giri, Anish    g    NED  2714   15  1994

29    29            Le, Quang Liem    g    VIE  2714    0  1991

30    30             Navara, David    g    CZE  2712    8  1985

31    31            Shirov, Alexei    g    LAT  2710   13  1972

32    32             Polgar, Judit    g    HUN  2710    0  1976

33    33     Riazantsev, Alexander    g    RUS  2710    0  1985

34    34       Wojtaszek, Radoslaw    g    POL  2706    8  1987

35    35      Moiseenko, Alexander    g    UKR  2706    7  1980

36    36   Vallejo Pons, Francisco    g    ESP  2705   15  1982

37    37        Malakhov, Vladimir    g    RUS  2705    0  1980

38    38            Jobava, Baadur    g    GEO  2704   23  1983

39    39           Bacrot, Etienne    g    FRA  2704   14  1983

40    40          Laznicka, Viktor    g    CZE  2704    8  1988

41    41            Sutovsky, Emil    g    ISR  2703    8  1977

42    42        Naiditsch, Arkadij    g    GER  2702   14  1985

43    43         Movsesian, Sergei    g    ARM  2700    9  1978

44    44       Sasikiran, Krishnan    g    IND  2700    9  1981

45    45   Vachier-Lagrave, Maxime    g    FRA  2699   13  1990

46    46            Dreev, Aleksey    g    RUS  2698    6  1969

47    47           Efimenko, Zahar    g    UKR  2695    8  1985

48    48         Volokitin, Andrei    g    UKR  2695    0  1986

49    49                 Wang, Yue    g    CHN  2694    6  1987

50    50        Fressinet, Laurent    g    FRA  2693   17  1981

51    51                Li, Chao b    g    CHN  2693    6  1989

52    52            Grachev, Boris    g    RUS  2693    0  1986

53    53      Nielsen, Peter Heine    g    DEN  2693    0  1973

54    54            Van Wely, Loek    g    NED  2692   13  1972

55    55    Bruzon Batista, Lazaro    g    CUB  2691   19  1982

56    56           McShane, Luke J    g    ENG  2691    8  1984

57    57            Eljanov, Pavel    g    UKR  2690   10  1983

58    58      Kasimdzhanov, Rustam    g    UZB  2689   14  1979

59    59         Inarkiev, Ernesto    g    RUS  2689    6  1985

60    60         Zvjaginsev, Vadim    g    RUS  2688    8  1976

61    61         Andreikin, Dmitry    g    RUS  2688    0  1990

62    62    Areshchenko, Alexander    g    UKR  2688    0  1986

63    63         Rublevsky, Sergei    g    RUS  2686    0  1974

64    64         Akopian, Vladimir    g    ARM  2685    8  1971

65    65          Potkin, Vladimir    g    RUS  2684    0  1982

66    66       Sargissian, Gabriel    g    ARM  2683   15  1983

67    67            Berkes, Ferenc    g    HUN  2682   16  1985

68    68           Bologan, Viktor    g    MDA  2680   15  1971

69    69          Bauer, Christian    g    FRA  2679   24  1977

70    70          Tiviakov, Sergei    g    NED  2677   22  1973

71    71            Short, Nigel D    g    ENG  2677   15  1965

72    72        Motylev, Alexander    g    RUS  2677    6  1979

73    73         Gharamian, Tigran    g    FRA  2676    0  1984

74    74          Kobalia, Mikhail    g    RUS  2673    0  1978

75    75              Meier, Georg    g    GER  2671    9  1987

76    76       Onischuk, Alexander    g    USA  2670   13  1975

77    77              Bu, Xiangzhi    g    CHN  2670    6  1985

78    78          Alekseev, Evgeny    g    RUS  2670    0  1985

79    79            Azarov, Sergei    g    BLR  2667    0  1983

80    80        Kryvoruchko, Yuriy    g    UKR  2666    0  1986

81    81             Balogh, Csaba    g    HUN  2665    8  1987

82    82           Harikrishna, P.    g    IND  2665    6  1986

83    83       Khismatullin, Denis    g    RUS  2664    8  1984

84    84   Nguyen, Ngoc Truong Son    g    VIE  2662    6  1990

85    85           Fridman, Daniel    g    GER  2660   11  1976

86    86              Smirin, Ilia    g    ISR  2660    7  1968

87    87               Ding, Liren    g    CHN  2660    6  1992

88    88         Sadler, Matthew D    g    ENG  2660    3  1974

89    89            Korobov, Anton    g    UKR  2660    0  1985

90    90          Cheparinov, Ivan    g    BUL  2659   18  1986

91    91          Timofeev, Artyom    g    RUS  2659    0  1985

92    92           Georgiev, Kiril    g    BUL  2658   17  1965

93    93           Bartel, Mateusz    g    POL  2658    9  1985

94    94          Zhigalko, Sergei    g    BLR  2658    8  1989

95    95         Feller, Sebastien    g    FRA  2658    0  1991

96    96            Ragger, Markus    g    AUT  2655   17  1988

97    97         Jones, Gawain C B    g    ENG  2653   27  1987

98    98                So, Wesley    g    PHI  2653    5  1993

99    99              Milov, Vadim    g    SUI  2653    0  1972

100  100           Gupta, Abhijeet    g    IND  2652    9  1989

101  101            Postny, Evgeny    g    ISR  2652    8  1981

102  102             Roiz, Michael    g    ISR  2652    6  1983

103  103           Gyimesi, Zoltan    g    HUN  2652    4  1977

104  104          Nikolic, Predrag    g    BIH  2652    2  1960

> 
					

Done. You had successfully scraping data from any web page with R or CloudStat.

Then, you can analyze as usual! Great! No more retype the data. Enjoy!

Tags: scrape, scraping, data collection

To leave a comment for the author, please follow the link and comment on their blog: PR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)