This product contains a subset of the Parquet files published in https://github.com/apache/parquet-testing. It includes both correct (data) and bad (bad_data) Parquet files. The data/geospatial subdirectory contains test files for the new GEOMETRY logical type.
This directory contains binary artifacts encoded using the Parquet Variant binary encoding. These files are not valid Parquet files, but rather raw binary data.
data_dictionary.json - contains the JSON representation for each exampleEach example consists of 2 files:
.metadata -- the binary contents of the metadata field.value -- the binary contents of the value fieldprimitive_<type> -- Examples primitive (basic_type = 1), one for each of the primitive types listed in the specshort_string -- Example of short string (basic_type = 2)object_empty -- Example of object (basic_type = 3) with no fieldsobject_primitive -- Example of object with only primitive fieldsobject_nested -- Example of object with other objects in fieldsarray_empty -- Example of array (basic_type = 4) with no elementsarray_primitive -- Example of array with only primitive elementsarray_nested -- Example of an with objects and other arrays in the elementsThe files in this directory were initially generated by running the regen.py
script which used Apache Spark to generate the files. The files have been subsequently modified
when necessary to ensure that they conform to the Parquet spec.
primitive_null as a single byte (0x01)Per https://github.com/apache/parquet-testing/issues/81, Spark did not generate
any metadata for null and left primitive_null.metadata empty.
The metadata for primitive_null should be the same 3 bytes as other primitive types
0x010x00dictionary_size + 1 = 1 byte values: 0x00The value for a primitive should be a value_header and no value_data,
resulting in a single 0 byte:
TimeNTZ/Timestamp with timezone nanos/Timestamp without timezone nanos/UUID with Iceberg test codeCurrently, Spark does not support Variant values containing UUID, Time, or nanosecond-precision Timestamp. the primitive_time.[metadata/value], primitive_timestamp_nanos.[metadata/value], primitive_timestampntz_nanos.[metadata/value] and primitive_uuid.[metadata/data] was generated by Iceberg test code