Scraping table from any web page with R or CloudStat
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.
Thanks to XML package from R. It provides amazing readHTMLtable() function.
For a study case,
I want to scrape data:
A. Scraping US Airline Customer Score table from
http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines
Code:
airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’
airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)
Result:
> library(XML) Warning message: package "XML" was built under R version 2.14.1 > airline = "http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines" > airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F) > airline.table Base-line 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 1 Southwest 78 76 76 76 74 72 70 70 74 75 73 74 74 76 79 81 79 2 All Others NM 70 74 70 62 67 63 64 72 74 73 74 74 75 75 77 75 3 Airlines 72 69 69 67 65 63 63 61 66 67 66 66 65 63 62 64 66 4 Continental 67 64 66 64 66 64 62 67 68 68 67 70 67 69 62 68 71 5 American 70 71 71 62 67 64 63 62 63 67 66 64 62 60 62 60 63 6 United 71 67 70 68 65 62 62 59 64 63 64 61 63 56 56 56 60 7 US Airways 72 67 66 68 65 61 62 60 63 64 62 57 62 61 54 59 62 8 Delta 77 72 67 69 65 68 66 61 66 67 67 65 64 59 60 64 62 9 Northwest Airlines 69 71 67 64 63 53 62 56 65 64 64 64 61 61 57 57 61 11 PreviousYear%Change FirstYear%Change 1 81 2.5 3.8 3 65 -1.5 -9.7 4 64 -9.9 -4.5 5 63 0.0 -10.0 7 61 -1.6 -15.3 8 56 -9.7 -27.3 9 # N/A N/A >
B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men
Code:
chess = ‘http://ratings.fide.com/top.phtml?list=men’
chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)
Result:
> chess = "http://ratings.fide.com/top.phtml?list=men" > chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F) > chess.table Rank Name Title Country Rating Games B-Year 1 1 Carlsen, Magnus g NOR 2835 17 1990 2 2 Aronian, Levon g ARM 2805 25 1982 3 3 Kramnik, Vladimir g RUS 2801 17 1975 4 4 Anand, Viswanathan g IND 2799 17 1969 5 5 Radjabov, Teimour g AZE 2773 9 1987 6 6 Topalov, Veselin g BUL 2770 9 1975 7 7 Karjakin, Sergey g RUS 2769 16 1990 8 8 Ivanchuk, Vassily g UKR 2766 16 1969 9 9 Morozevich, Alexander g RUS 2763 6 1977 10 10 Gashimov, Vugar g AZE 2761 9 1986 11 11 Grischuk, Alexander g RUS 2761 8 1983 12 12 Nakamura, Hikaru g USA 2759 17 1987 13 13 Svidler, Peter g RUS 2749 17 1976 14 14 Mamedyarov, Shakhriyar g AZE 2747 9 1985 15 15 Tomashevsky, Evgeny g RUS 2740 0 1987 16 16 Gelfand, Boris g ISR 2739 9 1968 17 17 Caruana, Fabiano g ITA 2736 19 1992 18 18 Nepomniachtchi, Ian g RUS 2735 16 1990 19 19 Wang, Hao g CHN 2733 6 1989 20 20 Kamsky, Gata g USA 2732 0 1974 21 21 Dominguez Perez, Leinier g CUB 2730 6 1983 22 22 Jakovenko, Dmitry g RUS 2729 0 1983 23 23 Ponomariov, Ruslan g UKR 2727 13 1983 24 24 Vitiugov, Nikita g RUS 2726 1 1987 25 25 Adams, Michael g ENG 2724 17 1971 26 26 Leko, Peter g HUN 2720 9 1979 27 27 Almasi, Zoltan g HUN 2717 8 1976 28 28 Giri, Anish g NED 2714 15 1994 29 29 Le, Quang Liem g VIE 2714 0 1991 30 30 Navara, David g CZE 2712 8 1985 31 31 Shirov, Alexei g LAT 2710 13 1972 32 32 Polgar, Judit g HUN 2710 0 1976 33 33 Riazantsev, Alexander g RUS 2710 0 1985 34 34 Wojtaszek, Radoslaw g POL 2706 8 1987 35 35 Moiseenko, Alexander g UKR 2706 7 1980 36 36 Vallejo Pons, Francisco g ESP 2705 15 1982 37 37 Malakhov, Vladimir g RUS 2705 0 1980 38 38 Jobava, Baadur g GEO 2704 23 1983 39 39 Bacrot, Etienne g FRA 2704 14 1983 40 40 Laznicka, Viktor g CZE 2704 8 1988 41 41 Sutovsky, Emil g ISR 2703 8 1977 42 42 Naiditsch, Arkadij g GER 2702 14 1985 43 43 Movsesian, Sergei g ARM 2700 9 1978 44 44 Sasikiran, Krishnan g IND 2700 9 1981 45 45 Vachier-Lagrave, Maxime g FRA 2699 13 1990 46 46 Dreev, Aleksey g RUS 2698 6 1969 47 47 Efimenko, Zahar g UKR 2695 8 1985 48 48 Volokitin, Andrei g UKR 2695 0 1986 49 49 Wang, Yue g CHN 2694 6 1987 50 50 Fressinet, Laurent g FRA 2693 17 1981 51 51 Li, Chao b g CHN 2693 6 1989 52 52 Grachev, Boris g RUS 2693 0 1986 53 53 Nielsen, Peter Heine g DEN 2693 0 1973 54 54 Van Wely, Loek g NED 2692 13 1972 55 55 Bruzon Batista, Lazaro g CUB 2691 19 1982 56 56 McShane, Luke J g ENG 2691 8 1984 57 57 Eljanov, Pavel g UKR 2690 10 1983 58 58 Kasimdzhanov, Rustam g UZB 2689 14 1979 59 59 Inarkiev, Ernesto g RUS 2689 6 1985 60 60 Zvjaginsev, Vadim g RUS 2688 8 1976 61 61 Andreikin, Dmitry g RUS 2688 0 1990 62 62 Areshchenko, Alexander g UKR 2688 0 1986 63 63 Rublevsky, Sergei g RUS 2686 0 1974 64 64 Akopian, Vladimir g ARM 2685 8 1971 65 65 Potkin, Vladimir g RUS 2684 0 1982 66 66 Sargissian, Gabriel g ARM 2683 15 1983 67 67 Berkes, Ferenc g HUN 2682 16 1985 68 68 Bologan, Viktor g MDA 2680 15 1971 69 69 Bauer, Christian g FRA 2679 24 1977 70 70 Tiviakov, Sergei g NED 2677 22 1973 71 71 Short, Nigel D g ENG 2677 15 1965 72 72 Motylev, Alexander g RUS 2677 6 1979 73 73 Gharamian, Tigran g FRA 2676 0 1984 74 74 Kobalia, Mikhail g RUS 2673 0 1978 75 75 Meier, Georg g GER 2671 9 1987 76 76 Onischuk, Alexander g USA 2670 13 1975 77 77 Bu, Xiangzhi g CHN 2670 6 1985 78 78 Alekseev, Evgeny g RUS 2670 0 1985 79 79 Azarov, Sergei g BLR 2667 0 1983 80 80 Kryvoruchko, Yuriy g UKR 2666 0 1986 81 81 Balogh, Csaba g HUN 2665 8 1987 82 82 Harikrishna, P. g IND 2665 6 1986 83 83 Khismatullin, Denis g RUS 2664 8 1984 84 84 Nguyen, Ngoc Truong Son g VIE 2662 6 1990 85 85 Fridman, Daniel g GER 2660 11 1976 86 86 Smirin, Ilia g ISR 2660 7 1968 87 87 Ding, Liren g CHN 2660 6 1992 88 88 Sadler, Matthew D g ENG 2660 3 1974 89 89 Korobov, Anton g UKR 2660 0 1985 90 90 Cheparinov, Ivan g BUL 2659 18 1986 91 91 Timofeev, Artyom g RUS 2659 0 1985 92 92 Georgiev, Kiril g BUL 2658 17 1965 93 93 Bartel, Mateusz g POL 2658 9 1985 94 94 Zhigalko, Sergei g BLR 2658 8 1989 95 95 Feller, Sebastien g FRA 2658 0 1991 96 96 Ragger, Markus g AUT 2655 17 1988 97 97 Jones, Gawain C B g ENG 2653 27 1987 98 98 So, Wesley g PHI 2653 5 1993 99 99 Milov, Vadim g SUI 2653 0 1972 100 100 Gupta, Abhijeet g IND 2652 9 1989 101 101 Postny, Evgeny g ISR 2652 8 1981 102 102 Roiz, Michael g ISR 2652 6 1983 103 103 Gyimesi, Zoltan g HUN 2652 4 1977 104 104 Nikolic, Predrag g BIH 2652 2 1960 >
Done. You had successfully scraping data from any web page with R or CloudStat.
Then, you can analyze as usual! Great! No more retype the data. Enjoy!
Tags: scrape, scraping, data collection
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.