I am using AWS Lambda to import data from the txt files.
However, for most of files,it’s working, but failed for some type of files:
‘utf-8’ codec can’t decode byte 0x80 in position 22006: invalid start byte
After check the file char encoding, I found they are “ASNI”, however, in my lambda python code, I assume they are encoded with UTF-8:
txtContent = obj['Body'].read().decode('utf-8')
How to detect the file
https://stackoverflow.com/questions/436220/determine-the-encoding-of-text-in-python
Please be aware of there is one line saying: Correctly detecting the encoding all times is impossible!
However, we could still try to use some python module “chardet” to detect the character encoding
This is the document: https://chardet.readthedocs.io/en/latest/usage.html
For using it in AWS lambda, we’ll have to install the module and upload to lambda.
I’ll show how we make it work:
1) Install chardet
mkdir for-chardet
pip install chardet -t ./for-chardet/
2) Zip the files
zip -r lambda-4-chardet.zip ./chardet*
zip lambda-4-chardet.zip lambda_function.py
3) Upload to AWS Lambda function.
Also, attach the python encoding:
https://docs.python.org/3/library/codecs.html#standard-encodings