python - Extracting unrecognized information from many CSV files -

- June 15, 2015

i new programming , have many csv files need processed. each csv file has 8 line header. after header, there rows of names , data. second column has bunch of names products concerned right now. each product has set of names recognized our computers. example shoe recognized as: shoe, sneaker, heel, loafer, etc. on time, other names have sneaked csv files computers cannot recognize. want these names csv files , populate text file can go through, sort, , add computers. there information @ bottom of csvs separated information empty line.

i know should use glob module numpy , or pandas don't know how incorporate need sort of working program. here initial attempt @ code.

import csv import glob import os import numpy np stringio import stringio  fns = glob.glob('*.csv')   fn in fns:     data = np.genfromtxt(fns, delimiter=',')      if 'shoe' or 'heel' or 'loafer' or 'sneaker':      elif 'shirt' or 'tee' or 'tank' or 'polo':      else:

if has bits of code nice, appreciated. thank you

the csvs this

name    bunch of stuff                           header stuff    stuff                            header stuff    stuff                            header stuff    stuff                            header stuff    stuff                            header stuff    stuff                            count   5                            number  item    more    price1  price2  eta    faulty  other n1  shoe    stuff                                            n2  heel    stuff                                            n3  tee                                       k    n4  polo    other   stuff               g       j    n5  sneaker other   stuff               h       n

your data format little hard make sense of (is actual data tab-separated?) i've turned simpler example:

name    bunch of stuff                           header stuff    stuff                            header stuff    stuff                            header stuff    stuff                            header stuff    stuff                            header stuff    stuff                            count   5                            number,item n1,shoe n2,heel n3,tee n4,polo n5,sneaker

you can read in csv file using pandas, skipping header skiprows:

import pandas pd prod_df = pd.read_csv('prod.csv', skiprows=7)

then can find values in data (note unique() call means you'll each value once, if there hundreds of duplicates of each):

data_products = prod_df['item'].unique() data_products out[22]: array(['shoe', 'heel', 'tee', 'polo', 'sneaker'], dtype=object)

and compare them values should have:

valid_products = ['shoe', 'sneaker'] invalid_data = [x x in data_products if x not in valid_products] invalid_data out[25]: ['heel', 'tee', 'polo']

Search This Blog

Naan

python - Extracting unrecognized information from many CSV files -

Comments

Post a Comment

Popular posts from this blog

vb.net - Alternative to the T-SQL AS keyword -

php - MySQLi binding parameters in a prepared statement doesn't work unless inserted after "WHERE" -

ios - CFRelease causing crash in iPad application -