Thursday, June 6, 2013

Netezza Regex

I had to update a column on a netezza table to the first 5 digits that came after a string like so "TID=" But TID could be lowercase, uppercase etc.

So for eg, Tid=31564abc, and I wanted to extract the "31564" portion. Here's how I did it.

update MYTABLE
set MYCOLUMN=  regexp_extract_sp(ColA, '(tid|tID|ntID|ntid|NtID|ntid)=([0-9]{1,5})(&|\b)',1,3)
where adddate >= '2013-03-26 00:00:00'
and adddate < '2013-03-27 00:00:00'
and lower(colA) like '%tid%'

Note: this feature is only available if you installed SQL Extension Toolkit

Friday, March 22, 2013

Mongodb group by count, update

The new Aggregation Framework from mongodb (available in version 2.1 and above) is pretty useful...rather than write mapreduce functions, you can use their aggregation framework to do simple group by, count(*), max and sum type of queries.


--This is how you insert into a collection based on query on another collection
 db.basecollection.aggregate( [  
{ $group: { _id: "$c1",              
   rowcount: { $sum: 1 },  
                    lastupdated: { $max: "$c2" } } }
 ], {$sort:{'lastupdated':1}} ).result.forEach(function(doc){
   db.aggcollection.insert(doc)});

The equivalent sql would be something like:
insert into your_aggregate_table (category, rowcount, last_updated)
select category, count(*), max(update_date)
from your_base_table
group by category


--update (insert if don't exist) a collection based on query on another collection
db.basecollection.aggregate( [  
{ $group: { _id: "$c1",              
   rowcount: { $sum: 1 },  
                    lastupdated: { $max: "$c2" } } }
 ]).result.forEach(function(doc){
   db.aggcollection.update({_id: doc._id}, { $set: {rowcount: doc.rowcount, lastupdated: doc.lastupdated}}, {upsert:true})
   });


Thursday, March 21, 2013

useful unix commands for data processing

I use IBM Datastage for ETL at work. Datastage has an "Execute Command Activity" which allows us to issue an command on the operating system, in our case it's a linux.

Below are the commands that have come in very handy and efficient in helping me process the data or augment my data workflow.

To convert a single column into a comma separated row of values:
paste -s -d, your_input_file


To get last row:
tail -1 your_input_file

Friday, February 22, 2013

MongoDB cheatsheet

I've been dabbling quite a bit with MongoDB lately due to my new hobby (developing iPhone apps).

Here's my list of MongoDB cheatsheet for the commands I run pretty often:

DROPPING COLLECTIONS
db.yourcollection.drop();

DELETIONS
--To remove documents from a collection with ObjectId greater than a given id.
db.yourcollection.remove({"_id":{$gt:ObjectId("51212d98e4b07347c88e0f6f")}})

--To remove documents from a collection with a Column greater than a value
db.yourcollection.remove({yourcolumn:{$gt:50}})

--To remove documents from a collection with ObjectId given an id. 
db.yourcollection.remove({"_id":ObjectId("50ff1185e4b0f9e7d1c23586")});

IMPORT csv file into a collection in MongoDB
mongoimport -h yourmongodbhost -d database -c collection -u user -p password --type csv -f column1name,column2name --file c:\filename.csv

QUERYING
--find a document where ObjectId = an id
db.yourcollection.find({"_id":ObjectId("5101c346ade1db93e6f8bdff")})

--return first 10 rows 
db.yourcollection..find().sort({_id:1}).limit(10);

--return last 10 rows, if you want to return last row, just change to limit(1)
db.yourcollection..find().sort({_id:-1}).limit(10);



Pig Latin - How to do a group by, count distinct

I had to convert this sql into pig recently, took me a few tries and it's not very intuitive to me, so if you ever come across this problem..here's a template you can copy.


SQL:
select month_year, osname, appname, count (distinct sessionid) as distinct_sessions
from tableA
where starttime >= '2013-01-01'
group by month_year, osname, appname;

PIG:

A = LOAD '/user/yourfolder/*/part*' USING PigStorage('\t') AS (startday:chararray, appname:chararray, osname:chararray, uniquesessionid:chararray);

B = FILTER A by startday matches '2013.*';

--project startday to month
C = FOREACH B GENERATE SUBSTRING(TRIM(startday), 0, 7) as monthyear, appname, osname, uniquesessionid;

D = GROUP C BY (monthyear, appname, osname);

SESSIONS = foreach D {
    unique_sessions = DISTINCT C.uniquesessionid;
    generate FLATTEN(group), COUNT(unique_sessions) as sessions;
};

STORE SESSIONS INTO '/user/yourfolder' USING PigStorage('\t');

Thursday, February 21, 2013

netezza nzsql command


nzsql -host YourServer -d YourDatabase -u YourUserName -pw YourPassword

Once you connect, you can just execute sql statements, end it with semi colon, hit enter and it will display the results of your sql.

prompt => select  * from yourtableA limit 10;

Wednesday, January 9, 2013

Pig null handling

When I sqooped in data from our Netezza database into HDFS..the null values also translated into nulls in the hadoop file.

To replace null with an empty string, do it in the GENERATE statement.

A = LOAD 'your file' USING PigStorage('\t') AS (col1:charray, col2:charray);

B = FOREACH A GENERATE col1, (col2 is null ? '' : TRIM(col2)) as col2;

--Note: I also sneaked in a TRIM to get rid of any trailing spaces in col2, otherwise if it's a null, I set it to empty string.