I had to convert this sql into pig recently, took me a few tries and it's not very intuitive to me, so if you ever come across this problem..here's a template you can copy.
SQL:
select month_year, osname, appname, count (distinct sessionid) as distinct_sessions
from tableA
where starttime >= '2013-01-01'
group by month_year, osname, appname;
PIG:
A = LOAD '/user/yourfolder/*/part*' USING PigStorage('\t') AS (startday:chararray, appname:chararray, osname:chararray, uniquesessionid:chararray);
B = FILTER A by startday matches '2013.*';
--project startday to month
C = FOREACH B GENERATE SUBSTRING(TRIM(startday), 0, 7) as monthyear, appname, osname, uniquesessionid;
D = GROUP C BY (monthyear, appname, osname);
SESSIONS = foreach D {
unique_sessions = DISTINCT C.uniquesessionid;
generate FLATTEN(group), COUNT(unique_sessions) as sessions;
};
STORE SESSIONS INTO '/user/yourfolder' USING PigStorage('\t');
IBM Infosphere Datastage, Netezza, Microstrategy, SSIS, SSRS,Hadoop, BI, ETL, Data warehouse, SQL Server, Tableau, mongoDB
Friday, February 22, 2013
Pig Latin - How to do a group by, count distinct
Thursday, February 21, 2013
netezza nzsql command
nzsql -host YourServer -d YourDatabase -u YourUserName -pw YourPassword
Once you connect, you can just execute sql statements, end it with semi colon, hit enter and it will display the results of your sql.
prompt => select * from yourtableA limit 10;
Wednesday, January 9, 2013
Pig null handling
When I sqooped in data from our Netezza database into HDFS..the null values also translated into nulls in the hadoop file.
To replace null with an empty string, do it in the GENERATE statement.
A = LOAD 'your file' USING PigStorage('\t') AS (col1:charray, col2:charray);
B = FOREACH A GENERATE col1, (col2 is null ? '' : TRIM(col2)) as col2;
--Note: I also sneaked in a TRIM to get rid of any trailing spaces in col2, otherwise if it's a null, I set it to empty string.
To replace null with an empty string, do it in the GENERATE statement.
A = LOAD 'your file' USING PigStorage('\t') AS (col1:charray, col2:charray);
B = FOREACH A GENERATE col1, (col2 is null ? '' : TRIM(col2)) as col2;
--Note: I also sneaked in a TRIM to get rid of any trailing spaces in col2, otherwise if it's a null, I set it to empty string.
Friday, January 4, 2013
Doing a Group By and count in PIG
suppose you want to convert this SQL to PIG latin.
select count(*)
from gpl_limits gl
group by grouptypeid
In PIG, you'll have to do this:
--begin pig script
rmf /user/tsusanto/test;
--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);
--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;
--3)group by groupetypeid
C = GROUP B BY (grouptypeid);
D = FOREACH C generate FLATTEN(group), COUNT(B) as count;
--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');
--end pig script
If you then view the file, you should see 2 columns, the grouptypeid and the count
hadoop fs -text /user/tsusanto/test/part*
select count(*)
from gpl_limits gl
group by grouptypeid
In PIG, you'll have to do this:
--begin pig script
rmf /user/tsusanto/test;
--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);
--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;
--3)group by groupetypeid
C = GROUP B BY (grouptypeid);
D = FOREACH C generate FLATTEN(group), COUNT(B) as count;
--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');
--end pig script
If you then view the file, you should see 2 columns, the grouptypeid and the count
hadoop fs -text /user/tsusanto/test/part*
Tuesday, December 18, 2012
Hadoop etc
Cheat sheet of Hadoop stuff
SQOOP
IMPORT:
sqoop import --connect jdbc:netezza://host/dbname --username yourusername --password=yourpassword --query "SELECT * FROM tableA where \$CONDITIONS and columnA >= '2012-08-01'" --direct --fields-terminated-by '\t' --split-by columnB--num-mappers 1 --target-dir /user/tsusanto/input_data/20120803 --compression-codec org.apache.hadoop.io.compress.GzipCodec
DELETE. You can also delete records using sqoop:
sqoop eval --connect jdbc:netezza://host/dbname --username yourusername --password=yourpassword# -e "delete from yourtable where columnA= 926"
Hadoop FS SHELL
write a value to a file in HDFS:
echo 'test123' | hadoop fs -put - yourfilepath
display a few rows of your data file stored in hadoop file system
hadoop fs -text /user/tsusanto/yourdatafolder/part* | less
to search a text string
hadoop fs -text /user/tsusanto/yourdatafolder/part* | grep -P '\byoursearchstring\b'
move a file from hadoop file system to your local current folder
delete a file from hadoop file system. You must delete the file before you can push a new file to the same location, file must not exists
hadoop fs -rm -r /user/tsusanto/jobs/yourfolder/workflow.xml
move a file to the hadoop file system from your current folder
hadoop fs -put workflow.xml /user/tsusanto/jobs/yourfolder/
SQOOP
IMPORT:
sqoop import --connect jdbc:netezza://host/dbname --username yourusername --password=yourpassword --query "SELECT * FROM tableA where \$CONDITIONS and columnA >= '2012-08-01'" --direct --fields-terminated-by '\t' --split-by columnB--num-mappers 1 --target-dir /user/tsusanto/input_data/20120803 --compression-codec org.apache.hadoop.io.compress.GzipCodec
DELETE. You can also delete records using sqoop:
sqoop eval --connect jdbc:netezza://host/dbname --username yourusername --password=yourpassword# -e "delete from yourtable where columnA= 926"
Hadoop FS SHELL
write a value to a file in HDFS:
echo 'test123' | hadoop fs -put - yourfilepath
display a few rows of your data file stored in hadoop file system
hadoop fs -text /user/tsusanto/yourdatafolder/part* | less
to search a text string
hadoop fs -text /user/tsusanto/yourdatafolder/part* | grep -P '\byoursearchstring\b'
move a file from hadoop file system to your local current folder
hadoop fs -get /user/tsusanto/jobs/yourfolder/workflow.xml .
delete a file from hadoop file system. You must delete the file before you can push a new file to the same location, file must not exists
hadoop fs -rm -r /user/tsusanto/jobs/yourfolder/workflow.xml
move a file to the hadoop file system from your current folder
hadoop fs -put workflow.xml /user/tsusanto/jobs/yourfolder/
Wednesday, December 12, 2012
Netezza SQL Extension Toolkit
There are many useful string manipulation functions in Netezza available if you install Netezza SQL Extension Toolkit.
To get the Right 3 characters of a string:
select strright( 'this is a string abc', 3)
The above will return abc.
To get the Right 3 characters of a string:
select strright( 'this is a string abc', 3)
The above will return abc.
Tuesday, December 11, 2012
Microstratgy Dashboard
Microstrategy Dashboard built using Microstrategy 9.3
Features:
- Row level security based on user login - Microsite IDs. The selectors displayed on the left side of the screen below changes based on the user login
- Cascading selectors. Selector Offer Type filters the Retailer selector, which in turn filters the Microsite selector
- Multi Panels, so user can easily tab between panels to view different data
Subscribe to:
Posts (Atom)