Search

Recent Posts

Tags


« | Main | »

More Awk with CSV/DSV files

By Dale Reagan | January 13, 2013

CSV/DSV Basics

Why Use Awk?

This post will cover basics for using CSV/DSV data with Awk (under Linux). If your need is immediate then one solution might be a toolbox tool like CSVFix – from Google Code – it has these features:

Awk – Basics for working with DSV/CSV Data

While there are differences between Awk implementations across various operating systems, for the most part, the approaches discussed here should be portable (work for any implementation of awk.)  Consult your OS  man pages for local quirks (i.e. man awk.)  Some caveats for this post:

awk -F “,” ‘{print}’  some_file.csv   ## this will simply print whatever is in your input file

awk -F “,” ‘{print $2}’  some_file.csv   ## only print the 2nd field

awk -F “,” ‘{printf “%s | %s\n”, NR, $2}’  some_file.csv   ## only print the line number and the 2nd field

awk -F “,” ‘{printf “%s | %s\n”, NR, $0}’  some_file.csv   ## only print the line number and the entire record (all fields)

Some Dummy CSV Data

The data below contains a Header line, 6 ‘good’ data entries and one ‘bad’ entry.  The last entry is ‘bad’ because the data contains too many commas.  Finding these types of ‘problem lines’ in thousands of lines of data can be tricky –  using just a little Awk code we can identify the ‘bad lines’ and either exclude them from output or edit the data and re-process.

Title,First Name,Middle Name,Last Name
ContactOne,Entry1,Middle1,Last_1
Contacttwo,Entry2,Middle2,Last_2
ContactThree,Entry3,Middle3,Last_3
ContactFour,Entry4,Middle4,Last_4
ContactFive,Entry5,Middle5,Last_5
ContactSix,Entry6,Middle6,Last_6
Dr.,SomeDoc, B. Bubba SomeDoc, D.M.D, FAGD,123 Some Street


The remainder of this post will explore functions from a Bash script that execute small Awk programs.


### this function prints:
# the Default Field count (based on the header line)
# the field count per record
# a line for EACH field
# an Error message if data contains too many/few fields
function simple_csv_1 {
stars Begin_${NUM_LINES}_lines
head -${NUM_LINES} ${IN_FILE} | \
awk -F "," '{
        if(NR == 1)
                DEF_NF=NF
        printf "NF/Def_NF %d/%d\n", NF,DEF_NF

        for (FLD=1 ; FLD <= DEF_NF ; FLD++)
                printf "\tRec: %3d | Fld: %3d | %s\n", NR, FLD, $FLD

        if(NF != DEF_NF) printf "Too many fields for record: %d\n\t%s\n", NR, $0
        }'
}

Here is the output (using the sample data above):
        ### Begin_10_lines ###
NF/Def_NF 4/4
        Rec:   1 | Fld:   1 | Title
        Rec:   1 | Fld:   2 | First Name
        Rec:   1 | Fld:   3 | Middle Name
        Rec:   1 | Fld:   4 | Last Name
NF/Def_NF 4/4
        Rec:   2 | Fld:   1 | ContactOne
        Rec:   2 | Fld:   2 | Entry1
        Rec:   2 | Fld:   3 | Middle1
        Rec:   2 | Fld:   4 | Last_1 
NF/Def_NF 4/4
        Rec:   3 | Fld:   1 | Contacttwo
        Rec:   3 | Fld:   2 | Entry2
        Rec:   3 | Fld:   3 | Middle2
        Rec:   3 | Fld:   4 | Last_2 
NF/Def_NF 4/4
        Rec:   4 | Fld:   1 | ContactThree
        Rec:   4 | Fld:   2 | Entry3
        Rec:   4 | Fld:   3 | Middle3
        Rec:   4 | Fld:   4 | Last_3 
NF/Def_NF 4/4
        Rec:   5 | Fld:   1 | ContactFour
        Rec:   5 | Fld:   2 | Entry4
        Rec:   5 | Fld:   3 | Middle4
        Rec:   5 | Fld:   4 | Last_4 
NF/Def_NF 4/4
        Rec:   6 | Fld:   1 | ContactFive
        Rec:   6 | Fld:   2 | Entry5
        Rec:   6 | Fld:   3 | Middle5
        Rec:   6 | Fld:   4 | Last_5 
NF/Def_NF 4/4
        Rec:   7 | Fld:   1 | ContactSix
        Rec:   7 | Fld:   2 | Entry6
        Rec:   7 | Fld:   3 | Middle6
        Rec:   7 | Fld:   4 | Last_6 
NF/Def_NF 6/4
        Rec:   8 | Fld:   1 | Dr.
        Rec:   8 | Fld:   2 | SomeDoc
        Rec:   8 | Fld:   3 |  B. Bubba SomeDoc
        Rec:   8 | Fld:   4 |  D.M.D
Too many fields for record: 8
        Dr.,SomeDoc, B. Bubba SomeDoc, D.M.D, FAGD,123 Some Street

The next function only prints records with ‘errors’.

### this function looks for 'errors' in the CSV data too many/few fields based on FS
function simple_csv_2 {
stars Begin_${NUM_LINES}_lines_Print_Err_Only
head -${NUM_LINES} ${IN_FILE} | \
awk -F "${D_FS}" -v D_FS="${D_FS}" '{
        if(NR == 1)
                DEF_NF=NF

        if(NF != DEF_NF) {
                ERR_CNT+=1
                printf "Should have: %d but found: %d Fields using delimiter: (%s).\n", DEF_NF, NF, FS
                printf "Too many fields for record: %d\n\t%s\n", NR, $0
                for (FLD=1 ; FLD <= DEF_NF ; FLD++)
                        printf "Rec: %3d | Fld: %3d | %s\n", NR, FLD, $FLD }
        }
        END {
                printf "\n\tError count: %d\n", ERR_CNT
        }'
}

The output:

        ### Begin_10_lines_Print_Err_Only ###
Should have: 4 but found: 6 Fields using delimiter: (,).
Too many fields for record: 8
        Dr.,SomeDoc, B. Bubba SomeDoc, D.M.D, FAGD,123 Some Street
Rec:   8 | Fld:   1 | Dr.
Rec:   8 | Fld:   2 | SomeDoc
Rec:   8 | Fld:   3 |  B. Bubba SomeDoc
Rec:   8 | Fld:   4 |  D.M.D

        Error count: 1

Next up, a function to show some ‘stats’ for the CSV data.

### this function uses an array and
### shows 'stats' for the CSV data, counts by fields and CSV Error count
function simple_array_1 {
stars Begin_${NUM_LINES}_lines_Simple_Array
head -${NUM_LINES} ${IN_FILE} | \
awk -F "${D_FS}" -v D_FS="${D_FS}" \
    '{
        if(NR == 1) {
                DEF_NF=NF
                for (FLD=1 ; FLD <= DEF_NF ; FLD++)
                        csv_stats[FLD, 0] = $FLD
        }
        for (FLD=1 ; FLD <= DEF_NF ; FLD++) {
                if($FLD != "")
                        csv_stats[FLD, 1] += 1
        }
        if(NF != DEF_NF) {
                NUM_ERR++
        }
    }
        END {
        printf "\n\n"
        printf " Def # Flds: %5d (based on header from 1st line of CSV file)\n", DEF_NF
        printf " Fld Errors: %5d (too many/few Fields)\n", NUM_ERR
        printf " Record Cnt: %5d\n", NR
        printf "\n"
        printf "  Fld | Description          | Non-null\n"
        printf "------|----------------------|-------------\n"
        for (FLD=1 ; FLD <= DEF_NF ; FLD++)
                        printf "%5d | %-20s | %5d\n", FLD, csv_stats[FLD, 0], csv_stats[FLD, 1]
        printf "\n"
        }'
}

The Output:

        ### Begin_10_lines_Simple_Array ###

 Def # Flds:     4 (based on header from 1st line of CSV file)
 Fld Errors:     1 (too many/few Fields)
 Record Cnt:     8

  Fld | Description          | Non-null
------|----------------------|-------------
    1 | Title                |     8
    2 | First Name           |     8
    3 | Middle Name          |     8
    4 | Last Name            |     8

The final function simply adds a new array element and store the current line of data; then the script ends this element contains the last line of data.

### this function shows 'stats' for the CSV data,
### counts by fields and CSV Error count AND
### includes the last data set
function simple_array_2 {
stars Begin_${NUM_LINES}_lines_Array_2
cat ${IN_FILE} | \
awk -F "${D_FS}" -v D_OFS="${D_FS}" \
    '{
        if(NR == 1) {
                DEF_NF=NF
                for (FLD=1 ; FLD <= DEF_NF ; FLD++)
                        csv_stats[FLD, 0] = $FLD
        }
        for (FLD=1 ; FLD <= DEF_NF ; FLD++) {
                if($FLD != "")
                        csv_stats[FLD, 1] += 1
                        csv_stats[FLD, 2] = $FLD
        }
        if(NF != DEF_NF) {
                NUM_ERR++
        }
    }
        END {
        printf "\n\n"
        printf " Def # Flds: %5d (based on header from 1st line of CSV file)\n", DEF_NF
        printf " Fld Errors: %5d (too many/few Fields)\n", NUM_ERR
        printf " Record Cnt: %5d\n", NR
        printf "\n"
        printf "  Fld | Description          | Non-null | Last Data\n"
        printf "------|----------------------|----------|-----------------\n"
        for (FLD=1 ; FLD <= DEF_NF ; FLD++)
                        printf "%5d | %-20s | %8d | %s\n", FLD, csv_stats[FLD, 0], csv_stats[FLD, 1], csv_stats[FLD, 2]
        printf "\n"
        }'
}

Bash Script: simple.csv.awk

I combined all of the above into a Bash script called simple.csv.awk – it includes a small function to ‘print stars’ and uses both Cat and Head commands to ‘send’ (pipe) data to Awk.

Some command line Awk to try (convert delimiters on the fly…)

cat small.csv |  awk 'BEGIN { FS=","; OFS="\t"} {print $1,$2,$3}'
cat small.csv |  awk 'BEGIN { FS=","; OFS=":"} {print $1,$2,$3}'
cat small.csv |  awk 'BEGIN { FS=","; OFS="|"} {print $1,$2,$3}'
cat small.csv |  awk 'BEGIN { FS=","; OFS="|||"} {print $1,$2,$3}'

Topics: Computer Technology, Problem Solving, Unix-Linux-Os | Comments Off on More Awk with CSV/DSV files

Comments are closed.


________________________________________________
YOUR GeoIP Data | Ip: 73.21.121.1
Continent: NA | Country Code: US | Country Name: United States
Region: | State/Region Name: | City:
(US only) Area Code: 0 | Postal code/Zip:
Latitude: 38.000000 | Longitude: -97.000000
Note - if using a mobile device your physical location may NOT be accurate...
________________________________________________

Georgia-USA.Com - Web Hosting for Business
____________________________________