[AWS] Using Chardet in AWS Lambda-Linc Hu

I am using AWS Lambda to import data from the txt files.

However, for most of files,it’s working, but failed for some type of files:

‘utf-8’ codec can’t decode byte 0x80 in position 22006: invalid start byte

After check the file char encoding, I found they are “ASNI”, however, in my lambda python code, I assume they are encoded with UTF-8:

  txtContent = obj['Body'].read().decode('utf-8')

How to detect the file

Please be aware of there is one line saying: Correctly detecting the encoding all times is impossible!

However, we could still try to use some python module “chardet” to detect the character encoding

For using it in AWS lambda, we’ll have to install the module and upload to lambda.

I’ll show how we make it work:

1) Install chardet
mkdir for-chardet
pip install chardet -t ./for-chardet/

2) Zip the files
zip -r lambda-4-chardet.zip ./chardet*
zip lambda-4-chardet.zip lambda_function.py

3) Upload to AWS Lambda function.

[AWS] Using Chardet in AWS Lambda