Friday, February 22, 2013

Pig Latin - How to do a group by, count distinct

I had to convert this sql into pig recently, took me a few tries and it's not very intuitive to me, so if you ever come across this problem..here's a template you can copy.


SQL:
select month_year, osname, appname, count (distinct sessionid) as distinct_sessions
from tableA
where starttime >= '2013-01-01'
group by month_year, osname, appname;

PIG:

A = LOAD '/user/yourfolder/*/part*' USING PigStorage('\t') AS (startday:chararray, appname:chararray, osname:chararray, uniquesessionid:chararray);

B = FILTER A by startday matches '2013.*';

--project startday to month
C = FOREACH B GENERATE SUBSTRING(TRIM(startday), 0, 7) as monthyear, appname, osname, uniquesessionid;

D = GROUP C BY (monthyear, appname, osname);

SESSIONS = foreach D {
    unique_sessions = DISTINCT C.uniquesessionid;
    generate FLATTEN(group), COUNT(unique_sessions) as sessions;
};

STORE SESSIONS INTO '/user/yourfolder' USING PigStorage('\t');

No comments:

Post a Comment