python - Extracting unrecognized information from many CSV files -
i new programming , have many csv files need processed. each csv file has 8 line header. after header, there rows of names , data. second column has bunch of names products concerned right now. each product has set of names recognized our computers. example shoe recognized as: shoe, sneaker, heel, loafer, etc. on time, other names have sneaked csv files computers cannot recognize. want these names csv files , populate text file can go through, sort, , add computers. there information @ bottom of csvs separated information empty line.
i know should use glob module numpy , or pandas don't know how incorporate need sort of working program. here initial attempt @ code.
import csv import glob import os import numpy np stringio import stringio fns = glob.glob('*.csv') fn in fns: data = np.genfromtxt(fns, delimiter=',') if 'shoe' or 'heel' or 'loafer' or 'sneaker': elif 'shirt' or 'tee' or 'tank' or 'polo': else:
if has bits of code nice, appreciated. thank you
the csvs this
name bunch of stuff header stuff stuff header stuff stuff header stuff stuff header stuff stuff header stuff stuff count 5 number item more price1 price2 eta faulty other n1 shoe stuff n2 heel stuff n3 tee k n4 polo other stuff g j n5 sneaker other stuff h n
your data format little hard make sense of (is actual data tab-separated?) i've turned simpler example:
name bunch of stuff header stuff stuff header stuff stuff header stuff stuff header stuff stuff header stuff stuff count 5 number,item n1,shoe n2,heel n3,tee n4,polo n5,sneaker
you can read in csv file using pandas, skipping header skiprows
:
import pandas pd prod_df = pd.read_csv('prod.csv', skiprows=7)
then can find values in data (note unique()
call means you'll each value once, if there hundreds of duplicates of each):
data_products = prod_df['item'].unique() data_products out[22]: array(['shoe', 'heel', 'tee', 'polo', 'sneaker'], dtype=object)
and compare them values should have:
valid_products = ['shoe', 'sneaker'] invalid_data = [x x in data_products if x not in valid_products] invalid_data out[25]: ['heel', 'tee', 'polo']
Comments
Post a Comment