Microsoft

Scenario  

One of the Azure Synapse Analytics customer received following error while querying the external table.

HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Too many columns in the line

 

Background

Customer has employee data in CSV format with following columns.

 

Column1

NAME(1)

STREET(2)

CITY, ST ZIP(3)

 

Following code works fine and is executed without any errors.

CREATE DATABASE SCOPED CREDENTIAL sccred

WITH IDENTITY = 'SHARED ACCESS SIGNATURE',

SECRET = {access key};

 

CREATE EXTERNAL DATA SOURCE dsblob

WITH

(

    TYPE = Hadoop,

    CREDENTIAL = sccred,

    LOCATION = 'wasbs://poccon@pocsa.blob.core.windows.net'

);

   

CREATE EXTERNAL FILE FORMAT FFTextFile

WITH (

FORMAT_TYPE = DelimitedText,

FORMAT_OPTIONS (FIELD_TERMINATOR = ',')

);

 

CREATE EXTERNAL TABLE stg.EmpData

(

[Column1] NVARCHAR(MAX) ,

[NAME(1)] NVARCHAR(MAX) ,

[STREET(2)] NVARCHAR(MAX) ,

[CITY, ST ZIP(3)] NVARCHAR(MAX)

) WITH (LOCATION='/datafile.csv',

DATA_SOURCE = dsblob,

FILE_FORMAT = FFTextFile, REJECT_TYPE = VALUE,

REJECT_VALUE = 0

);

But SELECT * FROM stg.EmpData gives error.

 

Following are some points for verifications.

  1. Verify you are not pointing to the wrong container having files with same name.
  2. Verify number of columns in external table are matching to that in file.

If you take closer look at file columns. you will notice a column is having comma in column name i.e. CITY,ST ZIP(3). This is creating the issue. As we have specified FIELD_TERMINATOR = ',', this is making one column as two different columns. Hence, we are getting error: “Too many columns in the line.”

 

The column definitions, data types and number of columns, must match the data in files. If there is a mismatch, the data rows will be rejected when querying the actual data.