HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

Parquet is my preferred format for storing files in data lake. Parquet’s columnar storage and compression makes it very efficient for in-memory processing tasks like Spark/Databricks notebooks while saving cost on storage. Parquet also supports almost all encoding schemes out there. Perhaps the coolest thing in Parquet is unlike CSV there is no such thing as column/row separator. So there is no need to escape those characters if they are part of data.

Azure SQL Data Warehouse supports Parquet data format for External (PolyBase) tables. External tables reference the underlying storage blobs and gives an option to query the data lake using SQL. In fact this is recommended in Microsoft’s reference architecture. With some Parquet files this error gets thrown when the External Table is queried

HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

In the absence of clear exception message it took a while to figure this out. This error usually happens on a Timestamp column specifically when the data is in yyyy/MM/dd hh:mm:ss format. For some reason SQL Data Warehouse expects the Timestamp data to be in yyyy-MM-dd hh:mm:ss format. Changing the date separator from / to – resolved this issue, although it must be mentioned the underlying file is a perfectly valid Parquet file.

12 thoughts on “HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

  1. Hi Austin, I am also facing similar kind of issue, but the data has datetime of format yyyy-MM-dd hh:mm:ss. I am trying to load data from parquet to Synapse table from Copy Activity using copy command/polybase option.

    But with the Bulk insert option of copy activity I am able to load the data to target synapse table successfully.

    Can you provide reason and solution to achieve this with Copy command and Polybase option of copy activity.

  2. hey Austin, I’d a similar failure. When I was trying to load the data from data lake to data warehouse using databricks with the help of polybase. my data has some json like structure for one of its column. when I tried loading dw by filtering out those records its working. But it is not working for the records with json like structure. one of the sample is {“filters”:[{“logic”:”and”,”field”:”Name”,”operator”:”contains”,”value”:”*D-86-M-94″}],”pageSize”:200,”page”:1,”ignoreQueryBySelectedConfig”:false,”relatedItemFilters”:[],”entityType”:”ZACFDWDocuments”}. Is there any way that you can suggest to type cast this to string and load this to dw??

    1. Anudeep, one option I can think of is to flatten the JSON columns in Databricks using explode and then concat the key vaule pairs as string column into DW

  3. I guys, for me the problem is related to spark 3, before it didn’t happen.
    Problem is when I try to df.write.format().option().

    df.write.format(“com.databricks.spark.sqldw”)\
    .option(“url”, self.jdbc_string)\
    .option(“tempDir”, self.tmp_dir)\
    .option(“forwardSparkAzureStorageCredentials”, “false”)\
    .option(“useAzureMSI”, “true”)\
    .option(“maxStrLength”, “4000”)\
    .option(“dbTable”, output_table)\
    .mode(mode)\
    .save()

    it seems when “mode” command is used.

    1. Interesting you mentioned Spark 3. Recently had an issue that occurred only when the cluster was upgraded to Spark 3. Turns out Spark 3 does not implicitly type cast whereas the earlier versions did. In my case, had a int and timestamp column in database but the corresponding data type in the data frame was string which caused the issue although the values were int and timestamp. Explicitly typecasting the columns to match target solved the issue. Maybe something similar in your case.

  4. Hi, for me it was the length of the varchar fields being smaller than the data size. As I didn’t know the exact size of varchar field so I declared the fields as varchar(100) but for some fields a few records were lengthier than that and hence caused the issue.
    How I located it… I started querying subset of columns in the select clause and keep ruling out one for which it worked fine and query would fail when problematic column got added to the select clause.

    Hope it helps.

  5. Hi Benny, I’m facing a similar issue but the error shows up while using external tables pointing to azure blob storage as source to the tabular cube. There is no issue while querying the table itself. Any insights on this much appreciated.

      1. I cannot go directly to the source to make any changes. Customer is providing the files that they cannot change. I just kept the column type as varchar to make it work and then I can convert to datetime downstream.

Leave a reply to Benny Austin Cancel reply