Wednesday, January 9, 2013

Pig null handling

When I sqooped in data from our Netezza database into HDFS..the null values also translated into nulls in the hadoop file.

To replace null with an empty string, do it in the GENERATE statement.

A = LOAD 'your file' USING PigStorage('\t') AS (col1:charray, col2:charray);

B = FOREACH A GENERATE col1, (col2 is null ? '' : TRIM(col2)) as col2;

--Note: I also sneaked in a TRIM to get rid of any trailing spaces in col2, otherwise if it's a null, I set it to empty string.

Friday, January 4, 2013

Doing a Group By and count in PIG

suppose you want to convert this SQL to PIG latin.


 select  count(*)
 from   gpl_limits gl
 group by grouptypeid

In PIG, you'll have to do this:
--begin pig script

rmf /user/tsusanto/test;

--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);
 
--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;

--3)group by groupetypeid
C = GROUP B BY (grouptypeid);

D = FOREACH C generate FLATTEN(group), COUNT(B) as count;

--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');

--end pig script

If you then view the file, you should see 2 columns, the grouptypeid and the count

hadoop fs -text /user/tsusanto/test/part*