Friday, January 4, 2013

Doing a Group By and count in PIG

suppose you want to convert this SQL to PIG latin.


 select  count(*)
 from   gpl_limits gl
 group by grouptypeid

In PIG, you'll have to do this:
--begin pig script

rmf /user/tsusanto/test;

--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);
 
--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;

--3)group by groupetypeid
C = GROUP B BY (grouptypeid);

D = FOREACH C generate FLATTEN(group), COUNT(B) as count;

--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');

--end pig script

If you then view the file, you should see 2 columns, the grouptypeid and the count

hadoop fs -text /user/tsusanto/test/part*



No comments:

Post a Comment