This page documents the general design of fastreg. It covers some requirements, the public-facing interface, and some diagrams highlighting the general flow of the main functions.
Note
Using R to read SAS can’t guarantee perfect preservation of the SAS values, since reading SAS files in R relies on haven, which is based on ReadStat, a reverse-engineered effort to read the proprietary SAS file format.
However, haven and the underlying ReadStat are mature packages and explicitly support reading sas7bdat files, which is the register format used by Statistics Denmark (DST).
Requirements
The core requirements of fastreg are to:
- Convert Danish register data from SAS files to the modern and efficient Parquet format.
- Read register Parquet files into R as a DuckDB table.
- Provide a targets pipeline template to convert multiple registers in parallel.
- Provide helper functions to list available SAS or Parquet register files directly from R.
Interface
The interface (the functions and objects that are exposed to users) is based on some specific naming conventions. Specifically, we generally name functions by the action they perform and the object(s) they perform it on in the format {action}_{object}(). Actions are verbs that describe what a function does, while objects are nouns that represent the objects that the functions operate on. Below is an overview of the main actions and objects within fastreg.
The actions are:
-
get: Get or guess some information, e.g., the project ID, workdata directory, or rawdata directory from the current working directory, or e.g., a register name or year from a file name.
-
list: List files in a directory, e.g., SAS or Parquet files.
-
convert: Convert a register SAS file (or multiple) to Parquet.
-
read: Read a Parquet register into R as a DuckDB table.
-
use: Set up _targets.R and a Quarto log template.
The objects are:
-
chunk_size: Number of rows to read per chunk during conversion.
-
path: A character vector of one or more paths.
-
project_id: A number indicating the project ID on DST.
-
output_dir: The directory to save the Parquet output to.
The settings are:
-
fastreg.project_rawdata_dir: The directory where either the SAS or Parquet files are stored. The rawdata/ directory is read-only on the DST server and contains the original SAS files. A project manager with the correct permissions can move (or request to move) Parquet files into this directory.
-
fastreg.project_workdata_dir: The workdata/ directory is where Parquet files are stored for projects without a project manager and where the users don’t have permissions to save the converted files into rawdata/. Usually, this directory is used to store and edit R scripts, documents, and other files, but it can also store data files (e.g., SAS or Parquet files).
These two settings are used to help make the experience of working with and managing the conversion and reading of registers smoother.
Tip
For a list of all the public functions, see the Reference page.
Converting multiple registers in parallel
Reading Parquet files
fastreg provides three ways to read Parquet registers depending on the use case.
read_register() is the main read function. We wanted a function that could make it really easy to use and read in a particular register (with data from all available years if it is in a partitioned Parquet format). For example, to read in bef (population register) as a DuckDB table, we wanted it as simple as read_register("bef"). It should automatically find the relevant Parquet dataset (as partition) and read them in as a single DuckDB table.
However, we can’t guarantee that the read_register() function will correctly guess and/or find the register as a Parquet dataset. So we also provide two more flexible functions: read_parquet_dataset() and read_parquet_file().
read_parquet_dataset() underlies read_register(), but without guessing the path (or when the setting hasn’t been set). It takes a direct path to the Parquet dataset (the directory containing the Hive-partitioned Parquet files), applies some settings to more smoothly read in the datasets, and reads it as a DuckDB table. This function can be used if read_register() failed to correctly read the right dataset.
read_parquet_file() is the simplest read function. It takes a direct path to a .parquet file (not a partitioned dataset) and reads it as a DuckDB table. This can be used if the register isn’t in a partitioned format.
List SAS and Parquet files
To help with management as well as discovery of available registers, we also provide helper functions to list the available SAS and Parquet files and partitioned datasets.
list_parquet_files() takes the directories given within the settings and lists all Parquet files found within those directories that follow the part-*.parquet pattern. If no setting is given, the project ID will be guessed from the working directory path and the default location will be the rawdata/ and workdata/ directories (e.g., E:/rawdata/<project-id>/ on DST). If those locations are different than the expected default, the setting must be set. That way, users can use list_parquet_files() without any arguments and it will automatically find and list all the Parquet files within the project. We decided to look in both rawdata/ (where the original SAS files are also kept) as well as workdata/ because some projects have managers with access to saving files (like Parquet files) to rawdata/ but other projects don’t, so they need to save files in workdata/.
list_parquet_datasets() builds on top of list_parquet_files(). It takes the output of list_parquet_files(), goes to the Parquet partition root (hard-coded to two levels back, before the folders with year=), and lists all the datasets. We use this function internally in read_register() as a check to see whether the register name provided by the user matches any of the available Parquet datasets. But this function is also useful to interactively discover the different Parquet datasets that are available within the project.
list_sas_files() lists all SAS files found within the rawdata/ directory set in the settings. We only look in rawdata because DST stores the original SAS files there. Like list_parquet_files(), if the setting isn’t set, it will also guess the project ID and look in the rawdata/ of that project for any SAS files.
Conversion log
The purpose of the conversion log is to describe the details of the conversion to provide an audit trail. Since we can’t be sure that the SAS files within the same register contain exactly the same columns and data types, the conversion log helps identify any differences between these files. It also includes any warnings produced by the targets pipeline.
Note
Discrepancies (different columns or incompatible data types) between files within the same register do not stop the conversion, but they are included in the log.
convert() returns a metadata tibble with one row per written chunk. The returned tibble can be queried with dplyr directly or rendered into a Quarto log.
The tibble has the following format:
input_path |
Path to the source SAS file |
output_path |
Path to the written Parquet part file |
row_count |
Number of rows in the chunk |
schema |
Nested tibble with columns name and type
|
The information is derived from the chunk already in memory, not by reading the converted Parquet file, so the schema reflects the types as read by haven rather than as stored in Parquet.
use_template() copies both a targets pipeline, _targets.R, and a conversion log template, conversion-log.qmd, into the current working directory. The conversion log (a PDF) is created as the final step of the targets pipeline.
The conversion log has the following sections:
- A table of contents providing an overview of the converted registers.
- A warnings section, if the targets pipeline produced any warnings.
- One section per converted register, consisting of:
- A subsection listing each Parquet chunk and its row count.
- A subsection showing the most common schema and how many converted files share it.
- A subsection showing schema differences, if any occur.
If you want to customise the log, e.g., the output format or the sections, you can edit the conversion-log.qmd.