Automatically determine and process the encoding of the text file

To find out the encoding of the text It seems that you should try decoding from one end and use the one that has been successfully decoded.

python


def conv_encoding(data):
    lookup = ('utf_8', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213',
            'shift_jis', 'shift_jis_2004','shift_jisx0213',
            'iso2022jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_3',
            'iso2022_jp_ext','latin_1', 'ascii')
    encode = None
    for encoding in lookup:
      try:
        data = data.decode(encoding)
        encode = encoding
        break
      except:
        pass
    if isinstance(data, unicode):
        return data,encode
    else:
        raise LookupError

#File reading and encoding investigation
fp = open(path,'r')
str,encoding = None,None
try:
  str,encoding = conv_encoding(fp.read())
finally:
  fp.close()

#Edit content
...<Arbitrary code>


#Write file in original encoding
fp = open(path,'w')
try:
  fp.write(str.encode(encoding))
finally:
  fp.close()

Recommended Posts

Automatically determine and process the encoding of the text file
The process of installing Atom and getting Python running
Get the MIME type in Python and determine the file format
The process of making Python code object-oriented and improving it
Process the contents of the file in order with a shell script
The story of Python and the story of NaN
Process the result of% time,% timeit
The story of the "hole" in the file
Process the gzip file UNLOADed with Redshift with Python of Lambda, gzip it again and upload it to S3
Dig the directory and create a list of directory paths + file names
[Python scraping] Output the URL and title of the site containing a specific keyword to a text file
Netmiko automatically detects the type of network device and executes the command
Open an Excel file in Python and color the map of Japan
This and that of the inclusion notation.
Implement part of the process in C ++
Check the existence of the file with python
Review the concept and terminology of regression
Automatically update and confirm the school homepage
Automatically generate images of koalas and bears
The story of trying deep3d and losing
Set the process name of the Python program
[Python] Get the character code of the file
Add lines and text on the image
[Python3] Understand the basics of file operations
Let's play with Python Receive and save / display the text of the input form
Attempt to launch another .exe and save the console output to a text file
Python Memorandum: Refer to the text and edit the file name while copying the target file
About the behavior of copy, deepcopy and numpy.copy
Summary of the differences between PHP and Python
Full understanding of the concepts of Bellman-Ford and Dijkstra
Download the image from the text file containing the URL
The answer of "1/2" is different between python2 and 3
Organize the meaning of methods, classes and objects
Specifying the range of ruby and python arrays
Change the color of Fabric errors and warnings
Compare the speed of Python append and map
Send Gmail at the end of the process [Python]
Experiment and leave evidence to determine the specifications.
Tucker decomposition of the hay process with HOOI
General description of the CPUFreq core and CPUFreq notifiers
Convert the character code of the file with Python3
Organize the super-basic usage of Autotools and pkg-config
I read and implemented the Variants of UKR
Determine the number of classes using the Starges formula
About the * (asterisk) argument of python (and itertools.starmap)
A discussion of the strengths and weaknesses of Python
[Python] Determine the type of iris with SVM
The nice and regrettable parts of Cloud Datalab
Macports easy_install automatically resolves and runs the version
Get the update date of the Python memo file.
When a file is placed in the shared folder of Raspberry Pi, the process is executed.
[Python] Change the text color and background color of a specific keyword in print output
I tried to extract the text in the image file using Tesseract of the OCR engine
Read the csv file with jupyter notebook and write the graph on top of it
Check the processing time and the number of calls for each process in python (cProfile)
Save the text of all Evernote notes to SQLite using Beautiful Soup and SQLAlchemy