Why R and Python
🚧 This website and documentation is still in very active development and evolving quickly, so content could undergo substantial changes at any time 🚧
Context and problem statement
Doing data analysis is part of the lifecycle of a research project. To achieve our reproducibility aim, we need to use tools that are code-based and not GUI-based since GUI-based tools are inherently difficult to make reproducible. Using an open source tool greatly simplifies the process of sharing and reproducing results. But for this decision post, we will compare closed sources tools as well. So we need to decide on the software we will use for data analysis. The question is:
What is the best software we can use for data analysis for our purposes, principles, and aims?
Decision drivers
The main things driving this decision are:
- Comparatively easy way to reproduce results we obtain.
- A tool we can all standardize on and use to help support each other and more effectively collaborate.
- A tool that is widely available and is installed or can easily be installed on our computers or servers (or already installed).
- Still needs to follow our core principles and aims.
- Is not a GUI-based tool.
- Should be relatively familiar to most of us or easy to learn.
Considered options
Based on our context, question, and drivers, we will consider the following options:
Common GUI-based tools that are excluded from consideration are:
R
R is an open-source programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data analysts for developing statistical methods and doing data analysis.
Benefits
- Widely used in the research community
- Massive, friendly, and supportive community
- Large number of packages, that can do nearly any task
- Has all the latest statistical methods that can be installed as packages
- Has a large number of resources and learning material to use (including the DDEA’s R courses that run every year)
- Beginner friendly data analysis tools via the tidyverse
- Open source, free to use, easy to install
- Because it is open source, it has more support with AI helper tools like ChatGPT or GitHub Copilot
- Has a lot of templating tools available to simplify many aspects of data analysis work through packages like usethis
Drawbacks
- Can be difficult to learn, so has a steeper learning curve
- Still not as widely used in the research environment, at least the Danish context, as Stata, so doesn’t have as much institutional support
Python
Python is an interpreted, high-level, general-purpose programming language. Python’s design philosophy emphasizes code readability with its notable use of significant whitespace. Its language uses an object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects. Because it is a general-purpose language, there are packages available for doing data analysis tasks.
Benefits
- Massive community
- Is a general-purpose programming language, so can be used for more than just data analysis
- Has a large number of packages for many different programming tasks
- Is more commonly used in the data science and machine learning community
- Has all necessary machine learning models available as packages
- Because it is open source, it has more support with AI helper tools like ChatGPT or GitHub Copilot
- Open source, free to use, and easy to install
Drawbacks
- Because it is a more formal and general-purpose programming language, it can be more difficult to learn for non-software engineers/developers
- Not as many statistical packages as R
- Not as easy to get started and setup when first making a project (need to create virtual environments)
- Community, and thus the packages available, take a more “set it up yourself” approach to common tasks, which can be a major barrier to entry for beginners
- There are less beginner-friendly or -focused resources and learning material available
- Still not as widely used in the research environment, at least the Danish context, as Stata, so doesn’t have as much institutional support
- There are few in-person courses that teach Python for data analysis, at least within the Danish context
SAS
SAS is a software suite developed by SAS Institute for advanced analytics, multivariate analysis, business intelligence, data management, and predictive analytics. It is used for data management, business intelligence, predictive analysis, and more.
Benefits
- Widely used in certain industries, such as the pharmaceutical industry, as well as some Danish research institutions like Denmark Statistics
- Is a very old language and so has many of the established statistical methods built-in
- As it is a paid software, there is more personal support available
Drawbacks
- Is closed source, so you need a license to install and use it, which makes it difficult to share and reproduce results
- Was designed in a different time period when software development practices were less advanced, so it’s language is less modern and more difficult to use
- Nearly no new PhD student is familiar with SAS
- Very few courses teach SAS now
- Because it is closed source, there is much less online and free resources for learning and using it
- Because it is closed source, it has less support with AI helper tools like ChatGPT or GitHub Copilot
- The whole ecosystem is closed, which means learning all the features only available through SAS
- Because it is closed source, it is more difficult to integrate with other tools and software
Stata
Stata is a general-purpose statistical software package created by StataCorp. It is used by many researchers in the social sciences, public health, and economics.
Benefits
- Has strong institutional support within Danish universities
- Is widely used in the research community, especially in the Danish context
- Many established researchers (in Denmark) are familiar with Stata
- Has many of the established statistical methods built-in
- Within the Danish context, there are many courses at the university level that teach Stata
Drawbacks
- Has all of the same drawbacks as SAS
- In Denmark, fewer courses are teaching Stata compared to previous years
Decision outcome
We decided on using R for most of our data analysis tasks because of the benefits we listed above. It is also the tool most new PhD students are familiar with or learn to use during their studies. However, because of tools that make it very easy to mix R and Python code together, for instance with Quarto, we also decide to use Python for some tasks where it is more appropriate, such as with machine learning analyses.
Consequences
- Stata has been used, supported, taught, and recommended at institutional levels in Danish universities for many years, many established researchers will be more familiar with Stata. Because of this, we may initially need to do more onboarding and training for those working on DP-Next
- Few people may be familiar with mixing Python and R, so may need to do more onboarding and training on this