Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
AI in R and Python
Greetings, humanists, social and data scientists!
Welcome to this enlightening journey to extract historical data from the Web using AI! In this lesson, we delve into the innovative ways of harnessing the power of ChatGPT to gather data for your analyses. Specifically, you’ll learn how to command ChatGPT to retrieve historical data from a Wikipedia article and present it in a neat R and Pandas dataframe.
ChatGPT is an advanced large language model that has the remarkable capability to generate human-like text based on the prompts it receives. With ChatGPT, you can ask questions or provide instructions, and it will respond with relevant and coherent written answers. By the end of this post, the vast historical archives of Wikipedia will be just a prompt away, all thanks to the capabilities of ChatGPT.
Data source
The data used in this tutorial is presented in the Wikipedia article German casualties in World War II. This article offers an overview of the casualties sustained by the German Army during the Second World War.
1. Use ChatGPT to retrieve data in a Wikipedia table
The first step to gather the data is to access German casualties in World War II. This article provides an exploration of the number of deaths, wounded, and sick people during World War II in Germany.
Suppose you are interested in studying how the war differed across the different fronts and geographical regions. You see the second table presented in the article and would like to make a visualization showing the number of deaths in the German Army across the different war fronts. For small tables like this, you can simply select and copy the data you would like to transform in a dataframe, as shown in the figure below.
In this example, we will only gather the Army section of the table. Once you have copied the data, you are ready to use ChatGPT.
2. A good ChatGPT prompt
When writing prompts for ChatGPT, clarity and specificity are key. Start by providing enough context so the model understands what you’re asking. Being explicit about the format or type of answer you want can also help guide ChatGPT in generating a response that meets your expectations.
Additionally, including examples of desired responses can be beneficial. Always consider the “garbage in, garbage out” principle; a well-crafted prompt is more likely to result in a useful and coherent response.
For the R prompt, consult step 3 below and for the Python prompt jump to step 4.
3. AI for R
Below you can see a suggestion of prompt to ask ChatGPT to transform the text input copied from Wikipedia in an R dataframe. In the figure, we highlight each part of the prompt. Moreover, ChatGPT response is provided.
If you would like, you can copy the prompt here:
content_copy Copy
And here is the model’s answer:
content_copy Copy
And that’s it. In a few seconds you have your data formatted in a nice dataframe, ready to be analyzed.
Now we will make a simple bar plot to visualize the total number of deaths by front. To do that, we use ggplot. Note that we reorder the variable Description
so that it is ordered from the largest number of deaths to the smallest. Moreover, we also divide Total
by one million to have millions of deaths in the x axis. To add a customized ggplot2 theme, like the one in the plot presented below, check the lesson ‘Climate data visualization with ggplot2’
content_copy Copy
It is remarkable that most of casualties occurred in the Eastern Front while the least lethal front was in the nordic countries.
4. AI for Python
The prompt for achieving the same result in Python is almost the same, but we ask for a Pandas data frame, instead of an R data frame.
content_copy Copy
This prompts correctly organizes the data in a Pandas data frame, as shown below:
content_copy Copy
Keep in mind that accuracy is not one of ChatGPT’s strengths. While I have not seen any mistakes in the code above, it is always good to revise ChatGPT’s work. Let me know if you have any suggestions on how to improve this method. Also, leave any questions or concerns you might have in the comments below.
Conclusions
- ChatGPT is a powerful tool that can be utilized to efficiently gather and organize historical data from the web, specifically Wikipedia, into structured data frames for analysis in R and Python;
- Clarity and specificity are essential when crafting prompts for ChatGPT to ensure accurate and relevant responses;
- While this method is highly effective for small tables, it’s important to note that ChatGPT may suggest alternative solutions such as web scraping for larger and more complex tables, and to always review ChatGPT’s output for accuracy.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.