🌪 weather-sp – Weather Splitter

Splits NetCDF and Grib files into several files by variable (alpha).

Features

  • Format-Aware Processing: Care is taken to ensure that each input file format is handled appropriately. For example, for Grib files, overlapping variables with different levels are kept separate. In addition, buckets with mixtures of NetCDF and Grib files can be processed on the same run.

Usage

usage: weather-sp [-h] -i INPUT_PATTERN -o OUTPUT_DIR [-d]

Split weather data file into files by variable.

Common options:

  • -i, --input-pattern: Pattern for input weather data.

  • --output-template: Path to template using Python formatting, see Output section below. Mutually exclusive with --output-dir.

  • --output-dir: Path to base folder for output files, see Output section below. Mutually exclusive with --output-template

  • -d, --dry-run: Test the input file matching and the output file scheme without splitting.

Invoke with -h or --help to see the full range of options.

Usage examples:

weather-sp --input-pattern 'gs://test-tmp/era5/2017/**' \
           --output-dir 'gs://test-tmp/era5/splits'

Preview splits with a dry run:

weather-sp --input-pattern 'gs://test-tmp/era5/2017/**' \
           --output-dir 'gs://test-tmp/era5/splits' \
           --dry-run

Using DataflowRunner

weather-sp --input-pattern 'gs://test-tmp/era5/2015/**' \
           --output-dir 'gs://test-tmp/era5/splits'
           --runner DataflowRunner \
           --project $PROJECT \
           --temp_location gs://$BUCKET/tmp  \
           --job_name $JOB_NAME

For a full list of how to configure the Dataflow pipeline, please review this table.

Specifying input files

The input file pattern matching is done by the Apache Beam filesystems module, see this documentation under ‘pattern syntax’

Notably, ** Is equivalent to .*, so to match all files in a directory, use

--input-pattern 'gs://test-tmp/era5/2017/**'

On the other hand, to specify a specific pattern use

--input-pattern 'gs://test-tmp/era5/2017/*/*.nc'

Output

The base output file names are specified using the --output-template or --output-dir flags. These flags are mutually exclusive, and one of them is required.

Output directory

Based on the output directory path, the directory structure of the input pattern is replicated.

To create the directory structure, the common path of the input pattern is removed from the input file path and replaced with the output path. A ‘_’ is added to separate the split level / variable.

Example:

--input-pattern 'gs://test-input/era5/2020/**' \
--output-dir 'gs://test-output/splits'

For a file gs://test-input/era5/2020/02/01.nc the output file pattern is gs://test-output/splits/2020/02/01_ and if the temperature is a variable in that data, the output file for that split will be gs://test-output/splits/2020/02/01_t.nc

Output template with Python-style formatting

Using Python-style substitution (e.g. {1}) allows for more flexibility when creating the output files. The substitutions are based on the directory structure of the input file, where each {<x>} stands for one directory name, counting backwards from the end, i.e. the file name is {0}, the immediate directory in which it is located is {1}, and so on.

Example:

--input-pattern 'gs://test-input/era5/2020/**' \
--output-template 'gs://test-output/splits/{2}.{0}.{1}T00.'

For a file gs://test-input/era5/2020/02/01.nc the output file pattern is gs://test-output/splits/2020.01.02T00. and if the temperature is a variable in that data, the output file for that split will be gs://test-output/splits/2020/01/02T00.t.nc

Note that in this case no ‘_’ is added, the template is used as is.

Dry run

To verify the input file matching and the output naming scheme, weather-sp can be run with the --dry-run option. This does not read the files, so it will not check whether the files are readable and in the correct format. It will only list the input files with the corresponding output file schemes.