# 🌪 `weather-sp` – Weather Splitter Splits NetCDF and Grib files into several files by variable (_alpha_). ## Features * **Format-Aware Processing**: Care is taken to ensure that each input file format is handled appropriately. For example, for Grib files, overlapping variables with different levels are kept separate. In addition, buckets with mixtures of NetCDF and Grib files can be processed on the same run. ## Usage ``` usage: weather-sp [-h] -i INPUT_PATTERN -o OUTPUT_DIR [-d] Split weather data file into files by variable. ``` _Common options_: * `-i, --input-pattern`: Pattern for input weather data. * `--output-template`: Path to template using Python formatting, see [Output section](#output) below. Mutually exclusive with `--output-dir`. * `--output-dir`: Path to base folder for output files, see [Output section](#output) below. Mutually exclusive with `--output-template` * `-d, --dry-run`: Test the input file matching and the output file scheme without splitting. Invoke with `-h` or `--help` to see the full range of options. _Usage examples_: ```bash weather-sp --input-pattern 'gs://test-tmp/era5/2017/**' \ --output-dir 'gs://test-tmp/era5/splits' ``` Preview splits with a dry run: ```bash weather-sp --input-pattern 'gs://test-tmp/era5/2017/**' \ --output-dir 'gs://test-tmp/era5/splits' \ --dry-run ``` Using DataflowRunner ```bash weather-sp --input-pattern 'gs://test-tmp/era5/2015/**' \ --output-dir 'gs://test-tmp/era5/splits' --runner DataflowRunner \ --project $PROJECT \ --temp_location gs://$BUCKET/tmp \ --job_name $JOB_NAME ``` For a full list of how to configure the Dataflow pipeline, please review [this table](https://cloud.google.com/dataflow/docs/reference/pipeline-options). ## Specifying input files The input file pattern matching is done by the Apache Beam filesystems module, see [this documentation](https://beam.apache.org/releases/pydoc/2.12.0/apache_beam.io.filesystems.html#apache_beam.io.filesystems.FileSystems.match) under 'pattern syntax' Notably, `**` Is equivalent to `.*`, so to match all files in a directory, use ```bash --input-pattern 'gs://test-tmp/era5/2017/**' ``` On the other hand, to specify a specific pattern use ```bash --input-pattern 'gs://test-tmp/era5/2017/*/*.nc' ``` ## Output The base output file names are specified using the `--output-template` or `--output-dir` flags. These flags are mutually exclusive, and one of them is required. ### Output directory Based on the output directory path, the directory structure of the input pattern is replicated. To create the directory structure, the common path of the input pattern is removed from the input file path and replaced with the output path. A '_' is added to separate the split level / variable. Example: ```bash --input-pattern 'gs://test-input/era5/2020/**' \ --output-dir 'gs://test-output/splits' ``` For a file `gs://test-input/era5/2020/02/01.nc` the output file pattern is `gs://test-output/splits/2020/02/01_` and if the temperature is a variable in that data, the output file for that split will be `gs://test-output/splits/2020/02/01_t.nc` ### Output template with Python-style formatting Using Python-style substitution (e.g. `{1}`) allows for more flexibility when creating the output files. The substitutions are based on the directory structure of the input file, where each `{}` stands for one directory name, counting backwards from the end, i.e. the file name is `{0}`, the immediate directory in which it is located is `{1}`, and so on. Example: ```bash --input-pattern 'gs://test-input/era5/2020/**' \ --output-template 'gs://test-output/splits/{2}.{0}.{1}T00.' ``` For a file `gs://test-input/era5/2020/02/01.nc` the output file pattern is `gs://test-output/splits/2020.01.02T00.` and if the temperature is a variable in that data, the output file for that split will be `gs://test-output/splits/2020/01/02T00.t.nc` Note that in this case no '_' is added, the template is used as is. ## Dry run To verify the input file matching and the output naming scheme, `weather-sp` can be run with the `--dry-run` option. This does not read the files, so it will not check whether the files are readable and in the correct format. It will only list the input files with the corresponding output file schemes.