Nandakumar Edamana
Share on:
@ R t f

How to Prepare XLS and CSV Files for Processing


Banks and other fintech websites usually let you download reports as Microsoft Excel files or CSV files. While a novice would use this feature for offline viewing and archiving, an advanced user can feed such files into custom scripts that perform additional analyses, resulting in better decision making.

While highly useful, these files usually require some pre-processing before you can pass them to your scripts. Let's see what and how.

What is CSV and how to get it?

CSV stands for Comma-Separated Values, and as the name suggests, they are spreadsheets stored in the form of plaintext files. Each line represents a row, and each cell is separated by a comma (or a Tab). Being plaintext, they can be viewed using any basic text editor, and processed with simple scripts written using Python, AWK, or even Bash.

Unfortunately, sometimes tables exported from some sites are in Microsoft Excel format, which is both complex and proprietary. We need to convert it to CSV in order to make the processing simpler.

Perhaps the easiest way to convert XLS to CSV is to open the file with a spreadsheet program and save it as CSV file. LibreOffice Calc has this feature. Since we are going for automation, let's run this command to perform the same:

libreoffice --headless --convert-to csv --outdir . "Equity_Summary_Details.xls"

This would've created the file Equity_Summary_Details.csv in the same directory. Now that we've the textual version, let's move on to cleaning up the data.

Removing headers and footers

Tabular data downloaded from banks and similar sources usually have additional rows in the top and the bottom containing information like account details, date, summation, etc. This always gets in the way because they break the structure of the main table. So it's better to cut those out unless you really need them and know how to deal with them.

As an example, consider this sample file:

Account No: 123456
Name: John Doe
Date: 2025-01-01 10:00 AM

Date,Description,Credit,Debit,Balance
XXXX-XX-XX,ATM WDL,0,1000,40000
YYYY-YY-YY,Interest,160,0,40160

Monthly avg.: 38000

Here we need to start with the sixth line and cut out the last two lines. This can be done with the head and tail commands:

$ cat example.csv | tail -n +6 | head -n -2
XXXX-XX-XX,ATM WDL,0,1000,40000
YYYY-YY-YY,Interest,160,0,40160

If you want to save the cleaned-up data as a new file, just use Bash redirection:

$ cat example.csv | tail -n +6 | head -n -2 > example_tableonly.csv

(Please refer to the manuals of head and tail to know more about those commands.)

So finally, we have a proper CSV file that can be processed using standard CSV libraries.


Click here to read more like this. Click here to send a comment or query.