HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

Parquet is my preferred format for storing files in data lake. Parquet’s columnar storage and compression makes it very efficient for in-memory processing tasks like Spark/Databricks notebooks while saving cost on storage. Parquet also supports almost all encoding schemes out there. Perhaps the coolest thing in Parquet is unlike CSV there is no such thing as column/row separator. So there is no need to escape those characters if they are part of data.

Azure SQL Data Warehouse supports Parquet data format for External (PolyBase) tables. External tables reference the underlying storage blobs and gives an option to query the data lake using SQL. In fact this is recommended in Microsoft’s reference architecture. With some Parquet files this error gets thrown when the External Table is queried

HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

In the absence of clear exception message it took a while to figure this out. This error usually happens on a Timestamp column specifically when the data is in yyyy/MM/dd hh:mm:ss format. For some reason SQL Data Warehouse expects the Timestamp data to be in yyyy-MM-dd hh:mm:ss format. Changing the date separator from / to – resolved this issue, although it must be mentioned the underlying file is a perfectly valid Parquet file.

8 thoughts on “HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

  1. I guys, for me the problem is related to spark 3, before it didn’t happen.
    Problem is when I try to df.write.format().option().

    df.write.format(“com.databricks.spark.sqldw”)\
    .option(“url”, self.jdbc_string)\
    .option(“tempDir”, self.tmp_dir)\
    .option(“forwardSparkAzureStorageCredentials”, “false”)\
    .option(“useAzureMSI”, “true”)\
    .option(“maxStrLength”, “4000”)\
    .option(“dbTable”, output_table)\
    .mode(mode)\
    .save()

    it seems when “mode” command is used.

    1. Interesting you mentioned Spark 3. Recently had an issue that occurred only when the cluster was upgraded to Spark 3. Turns out Spark 3 does not implicitly type cast whereas the earlier versions did. In my case, had a int and timestamp column in database but the corresponding data type in the data frame was string which caused the issue although the values were int and timestamp. Explicitly typecasting the columns to match target solved the issue. Maybe something similar in your case.

  2. Hi, for me it was the length of the varchar fields being smaller than the data size. As I didn’t know the exact size of varchar field so I declared the fields as varchar(100) but for some fields a few records were lengthier than that and hence caused the issue.
    How I located it… I started querying subset of columns in the select clause and keep ruling out one for which it worked fine and query would fail when problematic column got added to the select clause.

    Hope it helps.

  3. Hi Benny, I’m facing a similar issue but the error shows up while using external tables pointing to azure blob storage as source to the tabular cube. There is no issue while querying the table itself. Any insights on this much appreciated.

      1. I cannot go directly to the source to make any changes. Customer is providing the files that they cannot change. I just kept the column type as varchar to make it work and then I can convert to datetime downstream.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s