My BI World

Friday, February 22, 2013

Pig Latin - How to do a group by, count distinct

I had to convert this sql into pig recently, took me a few tries and it's not very intuitive to me, so if you ever come across this problem..here's a template you can copy.

SQL:
select month_year, osname, appname, count (distinct sessionid) as distinct_sessions
from tableA
where starttime >= '2013-01-01'
group by month_year, osname, appname;

PIG:

A = LOAD '/user/yourfolder/*/part*' USING PigStorage('\t') AS (startday:chararray, appname:chararray, osname:chararray, uniquesessionid:chararray);

B = FILTER A by startday matches '2013.*';

--project startday to month
C = FOREACH B GENERATE SUBSTRING(TRIM(startday), 0, 7) as monthyear, appname, osname, uniquesessionid;

D = GROUP C BY (monthyear, appname, osname);

SESSIONS = foreach D {
unique_sessions = DISTINCT C.uniquesessionid;
generate FLATTEN(group), COUNT(unique_sessions) as sessions;
};

STORE SESSIONS INTO '/user/yourfolder' USING PigStorage('\t');

Thursday, February 21, 2013

netezza nzsql command

nzsql -host YourServer -d YourDatabase -u YourUserName -pw YourPassword

Once you connect, you can just execute sql statements, end it with semi colon, hit enter and it will display the results of your sql.

prompt => select * from yourtableA limit 10;

Wednesday, January 9, 2013

Pig null handling

When I sqooped in data from our Netezza database into HDFS..the null values also translated into nulls in the hadoop file.

To replace null with an empty string, do it in the GENERATE statement.

A = LOAD 'your file' USING PigStorage('\t') AS (col1:charray, col2:charray);

B = FOREACH A GENERATE col1, (col2 is null ? '' : TRIM(col2)) as col2;

--Note: I also sneaked in a TRIM to get rid of any trailing spaces in col2, otherwise if it's a null, I set it to empty string.

Friday, January 4, 2013

Doing a Group By and count in PIG

suppose you want to convert this SQL to PIG latin.

select count(*)
from gpl_limits gl
group by grouptypeid

In PIG, you'll have to do this:
--begin pig script

rmf /user/tsusanto/test;

--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);

--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;

--3)group by groupetypeid
C = GROUP B BY (grouptypeid);

D = FOREACH C generate FLATTEN(group), COUNT(B) as count;

--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');

--end pig script

If you then view the file, you should see 2 columns, the grouptypeid and the count

hadoop fs -text /user/tsusanto/test/part*

Tuesday, December 18, 2012

Hadoop etc

Cheat sheet of Hadoop stuff

SQOOP

IMPORT:
sqoop import --connect jdbc:netezza://host/dbname --username yourusername --password=yourpassword --query "SELECT * FROM tableA where \$CONDITIONS and columnA >= '2012-08-01'" --direct --fields-terminated-by '\t' --split-by columnB--num-mappers 1 --target-dir /user/tsusanto/input_data/20120803 --compression-codec org.apache.hadoop.io.compress.GzipCodec

DELETE. You can also delete records using sqoop:
sqoop eval --connect jdbc:netezza://host/dbname --username yourusername --password=yourpassword# -e "delete from yourtable where columnA= 926"

Hadoop FS SHELL

write a value to a file in HDFS:
echo 'test123' | hadoop fs -put - yourfilepath

display a few rows of your data file stored in hadoop file system
hadoop fs -text /user/tsusanto/yourdatafolder/part* | less

to search a text string
hadoop fs -text /user/tsusanto/yourdatafolder/part* | grep -P '\byoursearchstring\b'

move a file from hadoop file system to your local current folder

hadoop fs -get /user/tsusanto/jobs/yourfolder/workflow.xml .

delete a file from hadoop file system. You must delete the file before you can push a new file to the same location, file must not exists
hadoop fs -rm -r /user/tsusanto/jobs/yourfolder/workflow.xml

move a file to the hadoop file system from your current folder
hadoop fs -put workflow.xml /user/tsusanto/jobs/yourfolder/

Wednesday, December 12, 2012

Netezza SQL Extension Toolkit

There are many useful string manipulation functions in Netezza available if you install Netezza SQL Extension Toolkit.

To get the Right 3 characters of a string:

select strright( 'this is a string abc', 3)

The above will return abc.

Tuesday, December 11, 2012

Microstratgy Dashboard

Microstrategy Dashboard built using Microstrategy 9.3

Features:

Row level security based on user login - Microsite IDs. The selectors displayed on the left side of the screen below changes based on the user login
Cascading selectors. Selector Offer Type filters the Retailer selector, which in turn filters the Microsite selector
Multi Panels, so user can easily tab between panels to view different data