pivoting in Python
I needed to do some pre-processing of some data which involved transposing column names to values. The condition was that the value for each respective column (frequency count) had to be > 1.
My input was a csv file, and my goal was an output csv file which would feed into a batch database import process.
ID,DA,NL,PHENOM1,PHENOM2,PHENOM3,PHENOM4 233,99,44,0.00,27.00,12.00,0.00
The other interesting bit was that only a range of columns applied to the condition; the other columns represented ancillary data.
Enter Python:
#!/usr/bin/python import sys import csv # open file and read headers fPhenomenon = open("phenomenon.txt","r") sHeaders = fPhenomenon.readline().replace(r'"','') aHeaders = sHeaders.split(",") # feed the rest to csv csvIn = csv.reader(fPhenomenon) csvOut = csv.writer(sys.stdout) for sRowIn in csvIn: aRowOut = [] aPhenomenon = [] aRowOut.append(sRowIn[0]) # procedure ID aRowOut.append(sRowIn[1]) # major drainage area ID for nIndexTupleVal, tupleVal in enumerate(sRowIn[3:-1]): if (float(tupleVal) > 0): # phenomenon measured at least once # add phenomenon name to list aPhenomenon.append(aHeaders[nIndexTupleVal+3]) # add phenomenon list to record aRowOut.append(",".join(aPhenomenon)) csvOut.writerow(aRowOut)
Notes
- hooray for raw strings!
- enumerate() is great and saves you the trouble of declaring your own counter
- like any language, modules/libraries makes things so easy to work with
- I wish the header stuff was a bit cleaner (I should look further into the csv module w.r.t. headers
That’s my hack for the day. Have a good weekend!
UPDATE: ah, the csv module has a .next() method, which can be used instead of the shoemaker attempt I made above to regularize / split / store the header list.