When I sqooped in data from our Netezza database into HDFS..the null values also translated into nulls in the hadoop file.
To replace null with an empty string, do it in the GENERATE statement.
A = LOAD 'your file' USING PigStorage('\t') AS (col1:charray, col2:charray);
B = FOREACH A GENERATE col1, (col2 is null ? '' : TRIM(col2)) as col2;
--Note: I also sneaked in a TRIM to get rid of any trailing spaces in col2, otherwise if it's a null, I set it to empty string.
IBM Infosphere Datastage, Netezza, Microstrategy, SSIS, SSRS,Hadoop, BI, ETL, Data warehouse, SQL Server, Tableau, mongoDB
Wednesday, January 9, 2013
Friday, January 4, 2013
Doing a Group By and count in PIG
suppose you want to convert this SQL to PIG latin.
select count(*)
from gpl_limits gl
group by grouptypeid
In PIG, you'll have to do this:
--begin pig script
rmf /user/tsusanto/test;
--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);
--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;
--3)group by groupetypeid
C = GROUP B BY (grouptypeid);
D = FOREACH C generate FLATTEN(group), COUNT(B) as count;
--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');
--end pig script
If you then view the file, you should see 2 columns, the grouptypeid and the count
hadoop fs -text /user/tsusanto/test/part*
select count(*)
from gpl_limits gl
group by grouptypeid
In PIG, you'll have to do this:
--begin pig script
rmf /user/tsusanto/test;
--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);
--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;
--3)group by groupetypeid
C = GROUP B BY (grouptypeid);
D = FOREACH C generate FLATTEN(group), COUNT(B) as count;
--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');
--end pig script
If you then view the file, you should see 2 columns, the grouptypeid and the count
hadoop fs -text /user/tsusanto/test/part*
Subscribe to:
Posts (Atom)