Tuesday, December 3, 2013

bash script to extract dates from sql statements from history table

We had to figure out the usage against one of our biggest table to decide how to archive it. We keep a history of all queries issued against the table and I wanted to extract the date filter used in the sql.

There are many ways to do it but I wanted to do it in bash script just for fun...Probably not the most efficient way of doing it but I'm posting this anyway mostly for my own benefit in case I need to figure out how to do something in bash..like how to find the position of a word in a string, how to extract words from a line etc etc



#! /bin/bash

#goal: to extract the dates used in where condition in sql ie adddate >= '2013-11-01' or daystring=20131105
nzsql -t -host serverhere -d databasehere -u usernamehere -W passwordhere -c "select QH_SESSIONID, QH_TSUBMIT, QH_USER, QH_SQL from QUERYHISTORY" > queryhistory.txt 2>&1

rm -f lhqueryhistory1.csv

while read line
do
  if [ -z "$line" ];
  then
    echo "line is empty"
    break
  fi

  sessionid=`echo $line | cut -d"|" -f1`
  starttime=`echo $line | cut -d"|" -f2`
  username=`echo $line | cut -d"|" -f3`
  sql=`echo $line | cut -d"|" -f4`

  #cut out sql starting from the where word
  wherepos=$(echo $sql | awk '{print match(tolower($0), "where")}')

  #if no where condition skip that sql
  if [ ${wherepos:-0} -eq 0 ];
  then
    continue
  fi
  wheresql=$(echo ${sql:${wherepos}})

  #cut out sql starting from the adddate or daystring word
  wherepos2=$(echo $wheresql | awk '{print match(tolower($0), "adddate|daystring")}')
  if [ ${wherepos2:-0} -eq 0 ];
  then
    wheresql2="none and"
  else
    wherepos2=$(($wherepos2 - 1))
    wheresql2=$(echo ${wheresql:${wherepos2}})
  fi

  #cut out sql up until the and, group or limit word
  wherepos3=$(echo $wheresql2 | awk '{print match(tolower($0), "and|group|limit")}')
  if [ ${wherepos3:-0} -eq 0 ];
  then
    wheresql3=$wheresql2
  else
    wherepos3=$(($wherepos3 - 1))
    wheresql3=$(echo ${wheresql2:0:${wherepos3:-0}})
  fi

  #extract the date (before single quote)
  wherepos4=$(expr index "$wheresql3" "\'") #look for single quote
  wheresql4=$(echo ${wheresql3:${wherepos4}})
  if [ ${wherepos4:-0} -eq 0 ];
  then
    wheresql4=$wheresql3
  else
    wherepos4=$(expr index "$wheresql3" "\'")#look for next single quote
    wherepos4=$(($wherepos4 - 1))
    wheresql4=$(echo ${wheresql3:${wherepos4}})
  fi
  finalrow="$sessionid|$starttime|$username|$wheresql4"
  echo $finalrow >> lhqueryhistory1.txt


done < queryhistory.txt

rm -f queryhistory.txt

Friday, November 22, 2013

Netezza ERROR: 65798 : no more space for TXMgr transactions array

Every now and then, in Netezza, we'll get an error message : ERROR:  65798 : no more space for TXMgr transactions array.

When this happens, nobody can issue an query against the database, and the database is unusable.

There is a limit of approximately 65,000 transaction objects in Netezza. Once that limit is reached, the system will not accept any new queries. It usually isn't a problem because as transactions completes, the spot that was taken up by that query is freed up and given to new queries. However, sometimes there will be queries that will get stuck and will hang onto the transaction objects until we go into the system and manually abort those sessions.

This script will look for queries that has been running for more than 24 hours (it's an argument thus configurable), and sends an email that displays how long it has been running, the username, the sql command, and automatically kills the sessions.




#! /bin/bash

MaxMinutes="${1}"

if [ -z "$1" ]
  then MaxMinutes=1440
fi

rowcount=0

nzsql -t -c "select id, conntime, username, dbname, command from _v_session where status='active' and command is not null
;" > sessionstokill.txt 2>&1

CurrentTimeEpoch=$(date +%s)
message="Long Running Queries:\n----------------\n"

while read line
do

  id=`echo $line | cut -d"|" -f1`
  starttime=`echo $line | cut -d"|" -f2`
  username=`echo $line | cut -d"|" -f3`
  dbname=`echo $line | cut -d"|" -f4`
  command=`echo $line | cut -d"|" -f5`
  StartTimeEpoch=$(date -d "$starttime" +%s)
  DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))

  #if it has been running for more than X minutes (argument 1)
  if [ $DifferenceInMinutes -ge $MaxMinutes ] && [ $id > 0 ];
  then
     rowcount=$[$rowcount+1];
     message+="SESSIONID: ${id} has been running for ${DifferenceInMinutes} minutes.\nUSERNAME:${username}\nDATABASE:${dbname}\nSQL:${command}\n"
   
     #we only want to kill it if the queries came from these users
     if [[ "$username" == *INTERNAL*_W* ]] || [[ "$username" == *ADMIN* ]];
     then
        message+="\n!!This session will be killed!!\n";

        `nzsession abortTxn -force -id $id`
     fi

     message+="-----------------\n\n"
  fi

done < sessionstokill.txt

if [ $rowcount -gt 0 ]
then
  sendMail -dst "youremail@yourcompany.com" -msg "Long Running Queries / Hung Transactions NZ" -bodyText "$message"

fi

Wednesday, November 20, 2013

Exporting resultset using Impala and chaining multiple commands


Using Hue's Impala UI, you can download the resultset as CSV or Excel. But Impala shell offers a much more flexible way of exporting resultset as you can specify any delimiter you want. Another benefit over using Hue UI Impala 1.1 is that in the Impala UI, you cannot execute multiple commands (even though it doesn't report a failure). Only the first statement will be executed.

If you want to chain a few commands, create a script file (can be any extension although the convention is to give it a .hql extension).

Put all your sql commands in a file. The last command should select the output you want to save. To execute the file, type:

impala-shell -f myscript.hql -B -o results.txt --output_delimiter='\t';

The above command will invoke the impala shell, and run the commands stored in myscrpt.hql, will output the results to a file called "results.txt", and the field delimited is tab.

Contents of myscript.hql:

drop table if exists mytable; create table mytable (adddate string, rowcount bigint); insert into mytable select to_date(adddate), count(*) from transactionmaster group by to_date(adddate); select * from mytable;


-------------------------------------------------------------------------------------------

You can then view the output in the file you specified, in my case:
cat results.txt

Thursday, November 14, 2013

Hive / Impala - create external tables with data from subfolders

At my workplace, we already store a lot of files in our HDFS..and I wanted to create impala tables against them. But our files are stored as lzo compressed files, and as of Impala 1.1, you cannot create the tables that are using lzo files through Impala, but you can create them in Hive, then query them in Impala.

Our typical folder structure look like this:

/data/mylogfiles/20130101
/data/mylogfiles/20130102
/data/mylogfiles/20130103

So basically we create a folder for each day of data.

At first I tried this and this doesn't work:
create external table mylogfile
    (
    keyword string,              
    campaign_id string,            
    mydate string,                 
    mytime string,                 
    display_site string,           
    was_clicked int,               
    cpc int,                       
    country string,                
    placement string) 
    PARTITIONED BY (log_date STRING)         
    ROW FORMAT DELIMITED           
    fields terminated by '\t' 
    stored as textfile             
    location '/data/mylogfiles/*/'; doesn't work!!!

Then I read about partitions and this works..although I do think it's a hassle to have to execute a statement to add a new partition for each folder of data in our hdfs. If you know of any other better way, please feel free to leave it in the comments section.


 create external table mylogtable
    (
    keyword string,             
    campaign_id string,           
    mydate string,                
    mytime string,                
    display_site string,          
    was_clicked int,              
    cpc int,                      
    country string,               
    placement string)
    PARTITIONED BY (folderdate BIGINT)        
    ROW FORMAT DELIMITED          
    fields terminated by '\t'; leave the location empty


Note: when you first go to impala after creating the table in Hive, you will need to issue these 2 commands or the table will not be visible in Impala:

invalidate metadata;
refresh mylogtable;

You only need to create the table in Hive, then you add the partitions in impala shell.


 ALTER TABLE mylogtable ADD PARTITION (folderdate=20130101) LOCATION '/data/mylogfiles/20130101';

ALTER TABLE mylogtable ADD PARTITION (folderdate=20130102) LOCATION '
/data/mylogfiles/20130102';


You can do a quick sanity check after addition of each partition...the rowcount should go up.

select count(*) 
from mylogfile;


Notice that the partition key is not an existing column of the table. You just give the partition your own name, and assign the appropriate file to that partition. In my example above, I am effectively naming the partition "20130101" for the file located in hdfs folder: "/data/mylogfiles/20130101". I could have named the partition "mypartition1" if I wanted to.

The real use of partitions though, it to speed up query by reading only the data necessary if you know which partitions the data resides in. 

For example, if you are looking for data with mydate = 2013-11-02, and you know that the data resides in the 20131102 and 20131103 partitions, you can issue a statement like this which will speed up the query:

select count(*)
from mylogtable
where to_date(mydate) >= '2013-11-02' and to_date(mydate) < '2013-11-03'
and folderdate in (20131102,20131103); --> folderdate is the partition key, not an actual columnname



Regular expression

in pig, say you want to match only 2013 feb - may:

/*

 * Include only records where the 'order_dtm' field matches
 * the regular expression pattern:
 *
 *   ^       = beginning of string
 *   2013    = literal value '2013'
 *   0[2345] = 0 followed by 2, 3, 4, or 5
 *   -       = a literal character '-'
 *   \\d{2}  = exactly two digits
 *   \\s     = a single whitespace character
 *   .*      = any number of any characters
 *   $       = end of string
 *
 * If you are not familiar with regular expressions and would
 * like to know more about them, see the Regular Expression
 * Reference at the end of the Exercise Manual.
 */
A = FILTER data by order_date matches '^2013-0[2345]-\\d{2}\\s.*$';

Wednesday, November 13, 2013

More Pig stuff

Given a 2 column dataset - Salesperson,  SalesAmount
Mary  20000
Tom 1200
Mary 6000
Jane 8000
Jim 9000
Tom 20000
Amy 28000
Barry 35000
Charles 2400
Dawson 7384
Haley 2847
Kelly 29495
Lucy 3648



Here's how you get the top 3 Salesperson in pig:

salespeople = LOAD 'data' AS (salesperson:chararray, salesamount:int);
salespeople_group = GROUP salespeople BY salesperson;
top_sales = FOREACH salespeople_group {
sorted = ORDER salespeople BY salesamount DESC;
highest_sales = LIMIT sorted 3;
GENERATE group as salesperson, highest_sales;
};

Monday, November 11, 2013

Datastage stop and reset long hanging jobs

We run a lot of loading jobs from our source SQL Server databases into Netezza using Datastage..these are just simple table to table load with no data transformation and runs every hour.

Every now and them...some of these jobs will hang but will not abort, and it will be a in a perpetual running state until one of us comes in and manually stop the job and recompile it then the next hourly scheduled run will kick off successfully.

I wrote a little shell script to check for Datastage jobs that have been running for more than a certain interval and if it is on the "okay to reset and kill" list (stored in a textfile), then it will stop the job and reset using dsjob commands.

#! /bin/bash

PROG=`basename ${0}`
EXIT_STATUS=0
ProjectName="${1}"
MaxMinutesBeforeReset="${2}"

if [ ${#} -ne 2 ]; then
   echo "${PROG}    : Invalid parameter list. The script needs 2 parameters:"
   echo "Param 1    : DS Project Name "
   echo "Param 2    : MinutesBeforeReset"

   EXIT_STATUS=99
   echo ${NOW} ${PROG} Exiting without completion with status [${EXIT_STATUS}]
   exit ${EXIT_STATUS}
fi

#go to /opt/IBM/InformationServer/Server/DSEngine
BinFileDirectory=`cat /.dshome`/bin
cd ${BinFileDirectory}

#Get current epochtime to as a baseline
CurrentTimeEpoch=$(date +%s)

#check for current running Datastage jobs
ps aux | grep 'DSD.RUN' | tr -s ' ' | cut -d" " -f13 | tr -d '\.' | while read JobName;
do

   #check if it is in the jobs to monitor & reset file, if not skip it
   if grep -Fxq "$JobName" /home/myfolder/JobsToMonitorAndReset.txt
   then
     #Get starttime which is on the 3rd row after the colon
     StartTime=$(./dsjob -jobinfo $ProjectName $JobName | sed -n 3p | grep -o ": .*" | grep -o " .*")
     StartTimeEpoch=$(date -d "$StartTime" +%s)
     DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))
     echo "$JobName has been running for $DifferenceInMinutes minutes"

     #if it has been running more than x (2nd argument) minutes, stop and reset job
     if [ $DifferenceInMinutes -ge $MaxMinutesBeforeReset];
     then
       echo "$JobName will be stopped and reset."
       ./dsjob -stop $ProjectName $JobName
       ./dsjob -run -mode RESET -wait -jobstatus $ProjectName $JobName
       exit 0
     fi
   fi
done

exit 0

---------------------------------------------------------------
If you want to monitor only specific jobs, just add the datastage JobName your file, I stored mine in /home/myfolder/JobsToMonitorAndReset.txt.

You can send an email to yourself too with the jobs that were stopped and reset.

Thursday, November 7, 2013

Convert string to JSON array using Perl

original json file (note: the string values are delimited by a colon)
[
{"cxbrandid":"100158:100159","cxbrandname":"APPLE, ORANGE & BANANA:JUICE JUICE"},
{"cxbrandid":"3000","cxbrandname":"JUNGLE, FOREST"}
]

desired output json file:
[{"cxbrandid":["100158","100159"],"cxbrandname":["APPLE, ORANGE & BANANA","JUICE JUICE"]},{"cxbrandid":["3000"],"cxbrandname":["JUNGLE, FOREST"]}]


Perl script to do that:
========================================================================
use JSON;

my $json;
{
  local $/; #Enable 'slurp' mode
  open my $fh, "<", "t.json";
  $json = <$fh>;
  close $fh;
}
my $data = decode_json($json);

#count number of elements in array and minus one
$t = scalar(@{$data});
$t = $t - 1;

for (my $i=0; $i <= $t; $i++) {

$cxbrandname =  $data->[$i]->{'cxbrandname'};
$cxbrandid = $data->[$i]->{'cxbrandid'};

my $brandidtable=[];
my $brandnametable=[];

@cxbrandidarray=split(/:/, $cxbrandid);
@cxbrandnamearray=split(/:/, $cxbrandname);

foreach my $val (@cxbrandidarray) {
  push @{$brandidtable}, $val;
}

foreach my $val (@cxbrandnamearray) {
  push @{$brandnametable}, $val;
}

$data->[$i]->{'cxbrandid'}= $brandidtable;
$data->[$i]->{'cxbrandname'}= $brandnametable;

}

open my $fh, ">", "data_out.json";
print $fh encode_json($data);
close $fh;

Wednesday, November 6, 2013

Mongodb find locations within certain miles given a coordinate

Note: this has been tested with Mongodb 2.4

I recently got to work on a side project that needed to look for close by stores within 10 miles given a coordinate.

The stores collection already stored the latitude and longitude but they were not stored in the expected GeoJSON format, which is this:

{ "loc": [ -73.946922, 40.780182] }

or this is also acceptable

{ "loc" : [ "lng": -73.946922, "lat": 40.780182 ] }


Instead the collection I was working with had this:

{"lat" : 40.780182, "lng" : -73.946922 }

So first thing, I had to update the collection to add a new field, and set a new property called "loc" to store a coordinates array:

db.stores.find().forEach(function(doc){db.stores.update({_id: doc._id}, { $set: {loc: [doc.longitude, doc.latitude] }})});

Then build an index

db.stores.ensureIndex( { loc : "2d" } );


Then, after that, say you want to find all stores that is within 5 miles of a given set of coordinate, in mongodb 2.4, all you have to do is issue this command:

db.stores.find( { loc: { $geoWithin: { $centerSphere: [ [ -122.0751549, 37.386812 ] , 5 / 3959 ] } } } );

Note: why divide by 3959? The radius of the Earth is approximately 3,959 miles.

Thursday, August 8, 2013

My first experience with Splunk

I've been tasked to explore the Splunk tool and this is my findings after playing with it for a couple hours...

Here's what I did...

Downloaded the free trial Enterprise version from splunk.com to install on my windows laptop (splunk-5.0.4-172409-x64-release.msi). With this trial version I get to index 500 MB of data per day, for 60 days.

Our main goal is to use splunk to explore data sitting on our Hadoop file system but to start of initially, all I wanted to do was upload a simple tab delimited file and write some queries against it...wasn't as straightforward as I thought! They splunk folks did mention that ideally the input file should be key value pairs of data but most of the data I work with are csv/tab delimited. Anyway...to make the fields show up so you can include it in your splunk query, you have to define the column names in 2 files called props.conf and transforms.conf.

I installed splunk using all the defaults so I had to create the 2 files props.conf and transforms.conf here:
C:\Program Files\Splunk\etc\system\local.

Just right click, create new text file and in props.conf, type this:

[mydataname]
NO_BINARY_CHECK = 1
pulldown_type = 1
REPORT-myname =  myfieldnames
----------------------------
--comments:
you can replace "mydataname" with anything you want
you can replace "myfieldnames" with anything you want
----------------------------------------------------

Create a new text file and in transforms.conf, type this:

[myfieldnames]
DELIMS = "\t"
FIELDS = "ACTIONID","CALLID","GIQDEVICEID","APPSESSIONID","ACTION","INFO1","INFO2","STARTTIME","ENDTIME","ADDDATE"


-----------------------------------------
--comments
my input file had 10 columns (no header row). So I just typed out the column names in the order they came in.

Then  I went to Splunk Manager, System, and clicked on Server controls, and it gives me an option to Restart Splunk, which I did.

Then I went to Data, imported my tab delimited file that is sitting on my local machine by using the Add Data to Splunk wizard, selected the file, and in the window where it says "Set source type", if you click on the drop down list, you should now see "mydataname" in the drop down list. Your data should now appear in the source types of the summary page. I clicked on "mydataname" which brought me to the search page and gave me the default query:
sourcetype="mydataname" (which is equivalent to select * from mydataname).

And now I can further filter my results using the column names I specified in the transforms.conf file.
I wanted to write the equivalent of this:
select count(*)
from mydataname
where ACTION="AppStarted"

Splunk:
sourcetype="giquseraction" AND ACTION="AppStarted" | stats COUNT


So there you go...my first experiment with Splunk...will keep you guys posted on what I find next..



We had the splunk folks come down to give us a demo a few weeks ago and I asked them what happens if one of us index a huge 1 GB data file by mistake that makes us go over the limit...(oops)....they say it happens all the time and they give their customers 5 oopsies per month, which means you have 5 chances per month when you can call your sales representative up and ask them to "reset" the quota for that day.

Friday, August 2, 2013

Netezza regexp_like. Regex to find integers only

regex is very handy and to be able to use it as part of a sql is just plain awesome I think.

I needed to find all records that only has integers in that field and this is how I did it:

select *
from TABLENAME
where regexp_like(INPUTCOLUMNNAME ,'^\d+$');


Thursday, June 6, 2013

Netezza Regex

I had to update a column on a netezza table to the first 5 digits that came after a string like so "TID=" But TID could be lowercase, uppercase etc.

So for eg, Tid=31564abc, and I wanted to extract the "31564" portion. Here's how I did it.

update MYTABLE
set MYCOLUMN=  regexp_extract_sp(ColA, '(tid|tID|ntID|ntid|NtID|ntid)=([0-9]{1,5})(&|\b)',1,3)
where adddate >= '2013-03-26 00:00:00'
and adddate < '2013-03-27 00:00:00'
and lower(colA) like '%tid%'

Note: this feature is only available if you installed SQL Extension Toolkit

Friday, March 22, 2013

Mongodb group by count, update

The new Aggregation Framework from mongodb (available in version 2.1 and above) is pretty useful...rather than write mapreduce functions, you can use their aggregation framework to do simple group by, count(*), max and sum type of queries.


--This is how you insert into a collection based on query on another collection
 db.basecollection.aggregate( [  
{ $group: { _id: "$c1",              
   rowcount: { $sum: 1 },  
                    lastupdated: { $max: "$c2" } } }
 ], {$sort:{'lastupdated':1}} ).result.forEach(function(doc){
   db.aggcollection.insert(doc)});

The equivalent sql would be something like:
insert into your_aggregate_table (category, rowcount, last_updated)
select category, count(*), max(update_date)
from your_base_table
group by category


--update (insert if don't exist) a collection based on query on another collection
db.basecollection.aggregate( [  
{ $group: { _id: "$c1",              
   rowcount: { $sum: 1 },  
                    lastupdated: { $max: "$c2" } } }
 ]).result.forEach(function(doc){
   db.aggcollection.update({_id: doc._id}, { $set: {rowcount: doc.rowcount, lastupdated: doc.lastupdated}}, {upsert:true})
   });


Thursday, March 21, 2013

useful unix commands for data processing

I use IBM Datastage for ETL at work. Datastage has an "Execute Command Activity" which allows us to issue an command on the operating system, in our case it's a linux.

Below are the commands that have come in very handy and efficient in helping me process the data or augment my data workflow.

To convert a single column into a comma separated row of values:
paste -s -d, your_input_file


To get last row:
tail -1 your_input_file

Friday, February 22, 2013

MongoDB cheatsheet

I've been dabbling quite a bit with MongoDB lately due to my new hobby (developing iPhone apps).

Here's my list of MongoDB cheatsheet for the commands I run pretty often:

DROPPING COLLECTIONS
db.yourcollection.drop();

DELETIONS
--To remove documents from a collection with ObjectId greater than a given id.
db.yourcollection.remove({"_id":{$gt:ObjectId("51212d98e4b07347c88e0f6f")}})

--To remove documents from a collection with a Column greater than a value
db.yourcollection.remove({yourcolumn:{$gt:50}})

--To remove documents from a collection with ObjectId given an id. 
db.yourcollection.remove({"_id":ObjectId("50ff1185e4b0f9e7d1c23586")});

IMPORT csv file into a collection in MongoDB
mongoimport -h yourmongodbhost -d database -c collection -u user -p password --type csv -f column1name,column2name --file c:\filename.csv

QUERYING
--find a document where ObjectId = an id
db.yourcollection.find({"_id":ObjectId("5101c346ade1db93e6f8bdff")})

--return first 10 rows 
db.yourcollection..find().sort({_id:1}).limit(10);

--return last 10 rows, if you want to return last row, just change to limit(1)
db.yourcollection..find().sort({_id:-1}).limit(10);



Pig Latin - How to do a group by, count distinct

I had to convert this sql into pig recently, took me a few tries and it's not very intuitive to me, so if you ever come across this problem..here's a template you can copy.


SQL:
select month_year, osname, appname, count (distinct sessionid) as distinct_sessions
from tableA
where starttime >= '2013-01-01'
group by month_year, osname, appname;

PIG:

A = LOAD '/user/yourfolder/*/part*' USING PigStorage('\t') AS (startday:chararray, appname:chararray, osname:chararray, uniquesessionid:chararray);

B = FILTER A by startday matches '2013.*';

--project startday to month
C = FOREACH B GENERATE SUBSTRING(TRIM(startday), 0, 7) as monthyear, appname, osname, uniquesessionid;

D = GROUP C BY (monthyear, appname, osname);

SESSIONS = foreach D {
    unique_sessions = DISTINCT C.uniquesessionid;
    generate FLATTEN(group), COUNT(unique_sessions) as sessions;
};

STORE SESSIONS INTO '/user/yourfolder' USING PigStorage('\t');

Thursday, February 21, 2013

netezza nzsql command


nzsql -host YourServer -d YourDatabase -u YourUserName -pw YourPassword

Once you connect, you can just execute sql statements, end it with semi colon, hit enter and it will display the results of your sql.

prompt => select  * from yourtableA limit 10;

Wednesday, January 9, 2013

Pig null handling

When I sqooped in data from our Netezza database into HDFS..the null values also translated into nulls in the hadoop file.

To replace null with an empty string, do it in the GENERATE statement.

A = LOAD 'your file' USING PigStorage('\t') AS (col1:charray, col2:charray);

B = FOREACH A GENERATE col1, (col2 is null ? '' : TRIM(col2)) as col2;

--Note: I also sneaked in a TRIM to get rid of any trailing spaces in col2, otherwise if it's a null, I set it to empty string.

Friday, January 4, 2013

Doing a Group By and count in PIG

suppose you want to convert this SQL to PIG latin.


 select  count(*)
 from   gpl_limits gl
 group by grouptypeid

In PIG, you'll have to do this:
--begin pig script

rmf /user/tsusanto/test;

--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);
 
--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;

--3)group by groupetypeid
C = GROUP B BY (grouptypeid);

D = FOREACH C generate FLATTEN(group), COUNT(B) as count;

--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');

--end pig script

If you then view the file, you should see 2 columns, the grouptypeid and the count

hadoop fs -text /user/tsusanto/test/part*