We had to figure out the usage against one of our biggest table to decide how to archive it. We keep a history of all queries issued against the table and I wanted to extract the date filter used in the sql.
There are many ways to do it but I wanted to do it in bash script just for fun...Probably not the most efficient way of doing it but I'm posting this anyway mostly for my own benefit in case I need to figure out how to do something in bash..like how to find the position of a word in a string, how to extract words from a line etc etc
#! /bin/bash
#goal: to extract the dates used in where condition in sql ie adddate >= '2013-11-01' or daystring=20131105
nzsql -t -host serverhere -d databasehere -u usernamehere -W passwordhere -c "select QH_SESSIONID, QH_TSUBMIT, QH_USER, QH_SQL from QUERYHISTORY" > queryhistory.txt 2>&1
rm -f lhqueryhistory1.csv
while read line
do
if [ -z "$line" ];
then
echo "line is empty"
break
fi
sessionid=`echo $line | cut -d"|" -f1`
starttime=`echo $line | cut -d"|" -f2`
username=`echo $line | cut -d"|" -f3`
sql=`echo $line | cut -d"|" -f4`
#cut out sql starting from the where word
wherepos=$(echo $sql | awk '{print match(tolower($0), "where")}')
#if no where condition skip that sql
if [ ${wherepos:-0} -eq 0 ];
then
continue
fi
wheresql=$(echo ${sql:${wherepos}})
#cut out sql starting from the adddate or daystring word
wherepos2=$(echo $wheresql | awk '{print match(tolower($0), "adddate|daystring")}')
if [ ${wherepos2:-0} -eq 0 ];
then
wheresql2="none and"
else
wherepos2=$(($wherepos2 - 1))
wheresql2=$(echo ${wheresql:${wherepos2}})
fi
#cut out sql up until the and, group or limit word
wherepos3=$(echo $wheresql2 | awk '{print match(tolower($0), "and|group|limit")}')
if [ ${wherepos3:-0} -eq 0 ];
then
wheresql3=$wheresql2
else
wherepos3=$(($wherepos3 - 1))
wheresql3=$(echo ${wheresql2:0:${wherepos3:-0}})
fi
#extract the date (before single quote)
wherepos4=$(expr index "$wheresql3" "\'") #look for single quote
wheresql4=$(echo ${wheresql3:${wherepos4}})
if [ ${wherepos4:-0} -eq 0 ];
then
wheresql4=$wheresql3
else
wherepos4=$(expr index "$wheresql3" "\'")#look for next single quote
wherepos4=$(($wherepos4 - 1))
wheresql4=$(echo ${wheresql3:${wherepos4}})
fi
finalrow="$sessionid|$starttime|$username|$wheresql4"
echo $finalrow >> lhqueryhistory1.txt
done < queryhistory.txt
rm -f queryhistory.txt
IBM Infosphere Datastage, Netezza, Microstrategy, SSIS, SSRS,Hadoop, BI, ETL, Data warehouse, SQL Server, Tableau, mongoDB
Tuesday, December 3, 2013
Friday, November 22, 2013
Netezza ERROR: 65798 : no more space for TXMgr transactions array
Every now and then, in Netezza, we'll get an error message : ERROR: 65798 : no more space for TXMgr transactions array.
When this happens, nobody can issue an query against the database, and the database is unusable.
There is a limit of approximately 65,000 transaction objects in Netezza. Once that limit is reached, the system will not accept any new queries. It usually isn't a problem because as transactions completes, the spot that was taken up by that query is freed up and given to new queries. However, sometimes there will be queries that will get stuck and will hang onto the transaction objects until we go into the system and manually abort those sessions.
This script will look for queries that has been running for more than 24 hours (it's an argument thus configurable), and sends an email that displays how long it has been running, the username, the sql command, and automatically kills the sessions.
#! /bin/bash
MaxMinutes="${1}"
if [ -z "$1" ]
then MaxMinutes=1440
fi
rowcount=0
nzsql -t -c "select id, conntime, username, dbname, command from _v_session where status='active' and command is not null
;" > sessionstokill.txt 2>&1
CurrentTimeEpoch=$(date +%s)
message="Long Running Queries:\n----------------\n"
while read line
do
id=`echo $line | cut -d"|" -f1`
starttime=`echo $line | cut -d"|" -f2`
username=`echo $line | cut -d"|" -f3`
dbname=`echo $line | cut -d"|" -f4`
command=`echo $line | cut -d"|" -f5`
StartTimeEpoch=$(date -d "$starttime" +%s)
DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))
#if it has been running for more than X minutes (argument 1)
if [ $DifferenceInMinutes -ge $MaxMinutes ] && [ $id > 0 ];
then
rowcount=$[$rowcount+1];
message+="SESSIONID: ${id} has been running for ${DifferenceInMinutes} minutes.\nUSERNAME:${username}\nDATABASE:${dbname}\nSQL:${command}\n"
#we only want to kill it if the queries came from these users
if [[ "$username" == *INTERNAL*_W* ]] || [[ "$username" == *ADMIN* ]];
then
message+="\n!!This session will be killed!!\n";
`nzsession abortTxn -force -id $id`
fi
message+="-----------------\n\n"
fi
done < sessionstokill.txt
if [ $rowcount -gt 0 ]
then
sendMail -dst "youremail@yourcompany.com" -msg "Long Running Queries / Hung Transactions NZ" -bodyText "$message"
fi
When this happens, nobody can issue an query against the database, and the database is unusable.
There is a limit of approximately 65,000 transaction objects in Netezza. Once that limit is reached, the system will not accept any new queries. It usually isn't a problem because as transactions completes, the spot that was taken up by that query is freed up and given to new queries. However, sometimes there will be queries that will get stuck and will hang onto the transaction objects until we go into the system and manually abort those sessions.
This script will look for queries that has been running for more than 24 hours (it's an argument thus configurable), and sends an email that displays how long it has been running, the username, the sql command, and automatically kills the sessions.
#! /bin/bash
MaxMinutes="${1}"
if [ -z "$1" ]
then MaxMinutes=1440
fi
rowcount=0
nzsql -t -c "select id, conntime, username, dbname, command from _v_session where status='active' and command is not null
;" > sessionstokill.txt 2>&1
CurrentTimeEpoch=$(date +%s)
message="Long Running Queries:\n----------------\n"
while read line
do
id=`echo $line | cut -d"|" -f1`
starttime=`echo $line | cut -d"|" -f2`
username=`echo $line | cut -d"|" -f3`
dbname=`echo $line | cut -d"|" -f4`
command=`echo $line | cut -d"|" -f5`
StartTimeEpoch=$(date -d "$starttime" +%s)
DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))
#if it has been running for more than X minutes (argument 1)
if [ $DifferenceInMinutes -ge $MaxMinutes ] && [ $id > 0 ];
then
rowcount=$[$rowcount+1];
message+="SESSIONID: ${id} has been running for ${DifferenceInMinutes} minutes.\nUSERNAME:${username}\nDATABASE:${dbname}\nSQL:${command}\n"
#we only want to kill it if the queries came from these users
if [[ "$username" == *INTERNAL*_W* ]] || [[ "$username" == *ADMIN* ]];
then
message+="\n!!This session will be killed!!\n";
`nzsession abortTxn -force -id $id`
fi
message+="-----------------\n\n"
fi
done < sessionstokill.txt
if [ $rowcount -gt 0 ]
then
sendMail -dst "youremail@yourcompany.com" -msg "Long Running Queries / Hung Transactions NZ" -bodyText "$message"
fi
Wednesday, November 20, 2013
Exporting resultset using Impala and chaining multiple commands
Using Hue's Impala UI, you can download the resultset as CSV or Excel. But Impala shell offers a much more flexible way of exporting resultset as you can specify any delimiter you want. Another benefit over using Hue UI Impala 1.1 is that in the Impala UI, you cannot execute multiple commands (even though it doesn't report a failure). Only the first statement will be executed.
If you want to chain a few commands, create a script file (can be any extension although the convention is to give it a .hql extension).
Put all your sql commands in a file. The last command should select the output you want to save. To execute the file, type:
The above command will invoke the impala shell, and run the commands stored in myscrpt.hql, will output the results to a file called "results.txt", and the field delimited is tab.
Contents of myscript.hql:
You can then view the output in the file you specified, in my case:
cat results.txt
Put all your sql commands in a file. The last command should select the output you want to save. To execute the file, type:
impala-shell -f myscript.hql -B -o results.txt --output_delimiter='\t';
The above command will invoke the impala shell, and run the commands stored in myscrpt.hql, will output the results to a file called "results.txt", and the field delimited is tab.
Contents of myscript.hql:
drop table if exists mytable;
create table mytable (adddate string, rowcount bigint);
insert into mytable
select to_date(adddate), count(*)
from transactionmaster
group by to_date(adddate);
select * from mytable;
-------------------------------------------------------------------------------------------
cat results.txt
Thursday, November 14, 2013
Hive / Impala - create external tables with data from subfolders
At my workplace, we already store a lot of files in our HDFS..and I wanted to create impala tables against them. But our files are stored as lzo compressed files, and as of Impala 1.1, you cannot create the tables that are using lzo files through Impala, but you can create them in Hive, then query them in Impala.
Our typical folder structure look like this:
/data/mylogfiles/20130101
/data/mylogfiles/20130102
/data/mylogfiles/20130103
So basically we create a folder for each day of data.
At first I tried this and this doesn't work:
create external table mylogfile
(
keyword string,
campaign_id string,
mydate string,
mytime string,
display_site string,
was_clicked int,
cpc int,
country string,
placement string)
PARTITIONED BY (log_date STRING)
ROW FORMAT DELIMITED
fields terminated by '\t'
stored as textfile
location '/data/mylogfiles/*/'; doesn't work!!!
Then I read about partitions and this works..although I do think it's a hassle to have to execute a statement to add a new partition for each folder of data in our hdfs. If you know of any other better way, please feel free to leave it in the comments section.
create external table mylogtable
(
keyword string,
campaign_id string,
mydate string,
mytime string,
display_site string,
was_clicked int,
cpc int,
country string,
placement string)
PARTITIONED BY (folderdate BIGINT)
ROW FORMAT DELIMITED
fields terminated by '\t'; leave the location empty
Note: when you first go to impala after creating the table in Hive, you will need to issue these 2 commands or the table will not be visible in Impala:
invalidate metadata;
refresh mylogtable;
You only need to create the table in Hive, then you add the partitions in impala shell.
ALTER TABLE mylogtable ADD PARTITION (folderdate=20130101) LOCATION '/data/mylogfiles/20130101';
ALTER TABLE mylogtable ADD PARTITION (folderdate=20130102) LOCATION '/data/mylogfiles/20130102';
You can do a quick sanity check after addition of each partition...the rowcount should go up.
select count(*)
from mylogfile;
Notice that the partition key is not an existing column of the table. You just give the partition your own name, and assign the appropriate file to that partition. In my example above, I am effectively naming the partition "20130101" for the file located in hdfs folder: "/data/mylogfiles/20130101". I could have named the partition "mypartition1" if I wanted to.
The real use of partitions though, it to speed up query by reading only the data necessary if you know which partitions the data resides in.
For example, if you are looking for data with mydate = 2013-11-02, and you know that the data resides in the 20131102 and 20131103 partitions, you can issue a statement like this which will speed up the query:
select count(*)
from mylogtable
where to_date(mydate) >= '2013-11-02' and to_date(mydate) < '2013-11-03'
and folderdate in (20131102,20131103); --> folderdate is the partition key, not an actual columnname
Our typical folder structure look like this:
/data/mylogfiles/20130101
/data/mylogfiles/20130102
/data/mylogfiles/20130103
So basically we create a folder for each day of data.
At first I tried this and this doesn't work:
create external table mylogfile
(
keyword string,
campaign_id string,
mydate string,
mytime string,
display_site string,
was_clicked int,
cpc int,
country string,
placement string)
PARTITIONED BY (log_date STRING)
ROW FORMAT DELIMITED
fields terminated by '\t'
stored as textfile
location '/data/mylogfiles/*/'; doesn't work!!!
Then I read about partitions and this works..although I do think it's a hassle to have to execute a statement to add a new partition for each folder of data in our hdfs. If you know of any other better way, please feel free to leave it in the comments section.
create external table mylogtable
(
keyword string,
campaign_id string,
mydate string,
mytime string,
display_site string,
was_clicked int,
cpc int,
country string,
placement string)
PARTITIONED BY (folderdate BIGINT)
ROW FORMAT DELIMITED
fields terminated by '\t'; leave the location empty
Note: when you first go to impala after creating the table in Hive, you will need to issue these 2 commands or the table will not be visible in Impala:
invalidate metadata;
refresh mylogtable;
You only need to create the table in Hive, then you add the partitions in impala shell.
ALTER TABLE mylogtable ADD PARTITION (folderdate=20130101) LOCATION '/data/mylogfiles/20130101';
ALTER TABLE mylogtable ADD PARTITION (folderdate=20130102) LOCATION '/data/mylogfiles/20130102';
You can do a quick sanity check after addition of each partition...the rowcount should go up.
select count(*)
from mylogfile;
Notice that the partition key is not an existing column of the table. You just give the partition your own name, and assign the appropriate file to that partition. In my example above, I am effectively naming the partition "20130101" for the file located in hdfs folder: "/data/mylogfiles/20130101". I could have named the partition "mypartition1" if I wanted to.
The real use of partitions though, it to speed up query by reading only the data necessary if you know which partitions the data resides in.
For example, if you are looking for data with mydate = 2013-11-02, and you know that the data resides in the 20131102 and 20131103 partitions, you can issue a statement like this which will speed up the query:
select count(*)
from mylogtable
where to_date(mydate) >= '2013-11-02' and to_date(mydate) < '2013-11-03'
and folderdate in (20131102,20131103); --> folderdate is the partition key, not an actual columnname
Regular expression
in pig, say you want to match only 2013 feb - may:
/*
* Include only records where the 'order_dtm' field matches
* the regular expression pattern:
*
* ^ = beginning of string
* 2013 = literal value '2013'
* 0[2345] = 0 followed by 2, 3, 4, or 5
* - = a literal character '-'
* \\d{2} = exactly two digits
* \\s = a single whitespace character
* .* = any number of any characters
* $ = end of string
*
* If you are not familiar with regular expressions and would
* like to know more about them, see the Regular Expression
* Reference at the end of the Exercise Manual.
*/
A = FILTER data by order_date matches '^2013-0[2345]-\\d{2}\\s.*$';
/*
* Include only records where the 'order_dtm' field matches
* the regular expression pattern:
*
* ^ = beginning of string
* 2013 = literal value '2013'
* 0[2345] = 0 followed by 2, 3, 4, or 5
* - = a literal character '-'
* \\d{2} = exactly two digits
* \\s = a single whitespace character
* .* = any number of any characters
* $ = end of string
*
* If you are not familiar with regular expressions and would
* like to know more about them, see the Regular Expression
* Reference at the end of the Exercise Manual.
*/
A = FILTER data by order_date matches '^2013-0[2345]-\\d{2}\\s.*$';
Wednesday, November 13, 2013
More Pig stuff
Given a 2 column dataset - Salesperson, SalesAmount
Mary 20000
Tom 1200
Mary 6000
Jane 8000
Jim 9000
Tom 20000
Amy 28000
Barry 35000
Charles 2400
Dawson 7384
Haley 2847
Kelly 29495
Lucy 3648
Here's how you get the top 3 Salesperson in pig:
salespeople = LOAD 'data' AS (salesperson:chararray, salesamount:int);
salespeople_group = GROUP salespeople BY salesperson;
top_sales = FOREACH salespeople_group {
sorted = ORDER salespeople BY salesamount DESC;
highest_sales = LIMIT sorted 3;
GENERATE group as salesperson, highest_sales;
};
Mary 20000
Tom 1200
Mary 6000
Jane 8000
Jim 9000
Tom 20000
Amy 28000
Barry 35000
Charles 2400
Dawson 7384
Haley 2847
Kelly 29495
Lucy 3648
Here's how you get the top 3 Salesperson in pig:
salespeople = LOAD 'data' AS (salesperson:chararray, salesamount:int);
salespeople_group = GROUP salespeople BY salesperson;
top_sales = FOREACH salespeople_group {
sorted = ORDER salespeople BY salesamount DESC;
highest_sales = LIMIT sorted 3;
GENERATE group as salesperson, highest_sales;
};
Monday, November 11, 2013
Datastage stop and reset long hanging jobs
We run a lot of loading jobs from our source SQL Server databases into Netezza using Datastage..these are just simple table to table load with no data transformation and runs every hour.
Every now and them...some of these jobs will hang but will not abort, and it will be a in a perpetual running state until one of us comes in and manually stop the job and recompile it then the next hourly scheduled run will kick off successfully.
I wrote a little shell script to check for Datastage jobs that have been running for more than a certain interval and if it is on the "okay to reset and kill" list (stored in a textfile), then it will stop the job and reset using dsjob commands.
#! /bin/bash
PROG=`basename ${0}`
EXIT_STATUS=0
ProjectName="${1}"
MaxMinutesBeforeReset="${2}"
if [ ${#} -ne 2 ]; then
echo "${PROG} : Invalid parameter list. The script needs 2 parameters:"
echo "Param 1 : DS Project Name "
echo "Param 2 : MinutesBeforeReset"
EXIT_STATUS=99
echo ${NOW} ${PROG} Exiting without completion with status [${EXIT_STATUS}]
exit ${EXIT_STATUS}
fi
#go to /opt/IBM/InformationServer/Server/DSEngine
BinFileDirectory=`cat /.dshome`/bin
cd ${BinFileDirectory}
#Get current epochtime to as a baseline
CurrentTimeEpoch=$(date +%s)
#check for current running Datastage jobs
ps aux | grep 'DSD.RUN' | tr -s ' ' | cut -d" " -f13 | tr -d '\.' | while read JobName;
do
#check if it is in the jobs to monitor & reset file, if not skip it
if grep -Fxq "$JobName" /home/myfolder/JobsToMonitorAndReset.txt
then
#Get starttime which is on the 3rd row after the colon
StartTime=$(./dsjob -jobinfo $ProjectName $JobName | sed -n 3p | grep -o ": .*" | grep -o " .*")
StartTimeEpoch=$(date -d "$StartTime" +%s)
DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))
echo "$JobName has been running for $DifferenceInMinutes minutes"
#if it has been running more than x (2nd argument) minutes, stop and reset job
if [ $DifferenceInMinutes -ge $MaxMinutesBeforeReset];
then
echo "$JobName will be stopped and reset."
./dsjob -stop $ProjectName $JobName
./dsjob -run -mode RESET -wait -jobstatus $ProjectName $JobName
exit 0
fi
fi
done
exit 0
---------------------------------------------------------------
If you want to monitor only specific jobs, just add the datastage JobName your file, I stored mine in /home/myfolder/JobsToMonitorAndReset.txt.
You can send an email to yourself too with the jobs that were stopped and reset.
Every now and them...some of these jobs will hang but will not abort, and it will be a in a perpetual running state until one of us comes in and manually stop the job and recompile it then the next hourly scheduled run will kick off successfully.
I wrote a little shell script to check for Datastage jobs that have been running for more than a certain interval and if it is on the "okay to reset and kill" list (stored in a textfile), then it will stop the job and reset using dsjob commands.
#! /bin/bash
PROG=`basename ${0}`
EXIT_STATUS=0
ProjectName="${1}"
MaxMinutesBeforeReset="${2}"
if [ ${#} -ne 2 ]; then
echo "${PROG} : Invalid parameter list. The script needs 2 parameters:"
echo "Param 1 : DS Project Name "
echo "Param 2 : MinutesBeforeReset"
EXIT_STATUS=99
echo ${NOW} ${PROG} Exiting without completion with status [${EXIT_STATUS}]
exit ${EXIT_STATUS}
fi
#go to /opt/IBM/InformationServer/Server/DSEngine
BinFileDirectory=`cat /.dshome`/bin
cd ${BinFileDirectory}
#Get current epochtime to as a baseline
CurrentTimeEpoch=$(date +%s)
#check for current running Datastage jobs
ps aux | grep 'DSD.RUN' | tr -s ' ' | cut -d" " -f13 | tr -d '\.' | while read JobName;
do
#check if it is in the jobs to monitor & reset file, if not skip it
if grep -Fxq "$JobName" /home/myfolder/JobsToMonitorAndReset.txt
then
#Get starttime which is on the 3rd row after the colon
StartTime=$(./dsjob -jobinfo $ProjectName $JobName | sed -n 3p | grep -o ": .*" | grep -o " .*")
StartTimeEpoch=$(date -d "$StartTime" +%s)
DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))
echo "$JobName has been running for $DifferenceInMinutes minutes"
#if it has been running more than x (2nd argument) minutes, stop and reset job
if [ $DifferenceInMinutes -ge $MaxMinutesBeforeReset];
then
echo "$JobName will be stopped and reset."
./dsjob -stop $ProjectName $JobName
./dsjob -run -mode RESET -wait -jobstatus $ProjectName $JobName
exit 0
fi
fi
done
exit 0
---------------------------------------------------------------
If you want to monitor only specific jobs, just add the datastage JobName your file, I stored mine in /home/myfolder/JobsToMonitorAndReset.txt.
You can send an email to yourself too with the jobs that were stopped and reset.
Thursday, November 7, 2013
Convert string to JSON array using Perl
original json file (note: the string values are delimited by a colon)
[
{"cxbrandid":"100158:100159","cxbrandname":"APPLE, ORANGE & BANANA:JUICE JUICE"},
{"cxbrandid":"3000","cxbrandname":"JUNGLE, FOREST"}
]
Perl script to do that:
========================================================================
use JSON;
my $json;
{
local $/; #Enable 'slurp' mode
open my $fh, "<", "t.json";
$json = <$fh>;
close $fh;
}
my $data = decode_json($json);
#count number of elements in array and minus one
$t = scalar(@{$data});
$t = $t - 1;
for (my $i=0; $i <= $t; $i++) {
$cxbrandname = $data->[$i]->{'cxbrandname'};
$cxbrandid = $data->[$i]->{'cxbrandid'};
my $brandidtable=[];
my $brandnametable=[];
@cxbrandidarray=split(/:/, $cxbrandid);
@cxbrandnamearray=split(/:/, $cxbrandname);
foreach my $val (@cxbrandidarray) {
push @{$brandidtable}, $val;
}
foreach my $val (@cxbrandnamearray) {
push @{$brandnametable}, $val;
}
$data->[$i]->{'cxbrandid'}= $brandidtable;
$data->[$i]->{'cxbrandname'}= $brandnametable;
}
open my $fh, ">", "data_out.json";
print $fh encode_json($data);
close $fh;
[
{"cxbrandid":"100158:100159","cxbrandname":"APPLE, ORANGE & BANANA:JUICE JUICE"},
{"cxbrandid":"3000","cxbrandname":"JUNGLE, FOREST"}
]
desired output json file:
[{"cxbrandid":["100158","100159"],"cxbrandname":["APPLE, ORANGE & BANANA","JUICE JUICE"]},{"cxbrandid":["3000"],"cxbrandname":["JUNGLE, FOREST"]}]
Perl script to do that:
========================================================================
use JSON;
my $json;
{
local $/; #Enable 'slurp' mode
open my $fh, "<", "t.json";
$json = <$fh>;
close $fh;
}
my $data = decode_json($json);
#count number of elements in array and minus one
$t = scalar(@{$data});
$t = $t - 1;
for (my $i=0; $i <= $t; $i++) {
$cxbrandname = $data->[$i]->{'cxbrandname'};
$cxbrandid = $data->[$i]->{'cxbrandid'};
my $brandidtable=[];
my $brandnametable=[];
@cxbrandidarray=split(/:/, $cxbrandid);
@cxbrandnamearray=split(/:/, $cxbrandname);
foreach my $val (@cxbrandidarray) {
push @{$brandidtable}, $val;
}
foreach my $val (@cxbrandnamearray) {
push @{$brandnametable}, $val;
}
$data->[$i]->{'cxbrandid'}= $brandidtable;
$data->[$i]->{'cxbrandname'}= $brandnametable;
}
open my $fh, ">", "data_out.json";
print $fh encode_json($data);
close $fh;
Wednesday, November 6, 2013
Mongodb find locations within certain miles given a coordinate
Note: this has been tested with Mongodb 2.4
I recently got to work on a side project that needed to look for close by stores within 10 miles given a coordinate.
The stores collection already stored the latitude and longitude but they were not stored in the expected GeoJSON format, which is this:
{ "loc": [ -73.946922, 40.780182] }
or this is also acceptable
{ "loc" : [ "lng": -73.946922, "lat": 40.780182 ] }
Instead the collection I was working with had this:
{"lat" : 40.780182, "lng" : -73.946922 }
So first thing, I had to update the collection to add a new field, and set a new property called "loc" to store a coordinates array:
db.stores.find().forEach( function(doc){db.stores. update({_id: doc._id}, { $set: {loc: [doc.longitude, doc.latitude] }})});
Then build an index
db.stores.ensureIndex( { loc : "2d" } );
Then, after that, say you want to find all stores that is within 5 miles of a given set of coordinate, in mongodb 2.4, all you have to do is issue this command:
db.stores.find( { loc: { $geoWithin: { $centerSphere: [ [ -122.0751549, 37.386812 ] , 5 / 3959 ] } } } );
Note: why divide by 3959? The radius of the Earth is approximately 3,959 miles.
I recently got to work on a side project that needed to look for close by stores within 10 miles given a coordinate.
The stores collection already stored the latitude and longitude but they were not stored in the expected GeoJSON format, which is this:
{ "loc": [ -73.946922, 40.780182] }
or this is also acceptable
{ "loc" : [ "lng": -73.946922, "lat": 40.780182 ] }
Instead the collection I was working with had this:
{"lat" : 40.780182, "lng" : -73.946922 }
So first thing, I had to update the collection to add a new field, and set a new property called "loc" to store a coordinates array:
db.stores.find().forEach(
Then build an index
db.stores.ensureIndex( { loc : "2d" } );
Then, after that, say you want to find all stores that is within 5 miles of a given set of coordinate, in mongodb 2.4, all you have to do is issue this command:
db.stores.find( { loc: { $geoWithin: { $centerSphere: [ [ -122.0751549, 37.386812 ] , 5 / 3959 ] } } } );
Note: why divide by 3959? The radius of the Earth is approximately 3,959 miles.
Thursday, August 8, 2013
My first experience with Splunk
I've been tasked to explore the Splunk tool and this is my findings after playing with it for a couple hours...
Here's what I did...
Downloaded the free trial Enterprise version from splunk.com to install on my windows laptop (splunk-5.0.4-172409-x64-release.msi). With this trial version I get to index 500 MB of data per day, for 60 days.
Our main goal is to use splunk to explore data sitting on our Hadoop file system but to start of initially, all I wanted to do was upload a simple tab delimited file and write some queries against it...wasn't as straightforward as I thought! They splunk folks did mention that ideally the input file should be key value pairs of data but most of the data I work with are csv/tab delimited. Anyway...to make the fields show up so you can include it in your splunk query, you have to define the column names in 2 files called props.conf and transforms.conf.
I installed splunk using all the defaults so I had to create the 2 files props.conf and transforms.conf here:
C:\Program Files\Splunk\etc\system\local.
Just right click, create new text file and in props.conf, type this:
[mydataname]
NO_BINARY_CHECK = 1
pulldown_type = 1
REPORT-myname = myfieldnames
----------------------------
--comments:
you can replace "mydataname" with anything you want
you can replace "myfieldnames" with anything you want
----------------------------------------------------
Create a new text file and in transforms.conf, type this:
[myfieldnames]
DELIMS = "\t"
FIELDS = "ACTIONID","CALLID","GIQDEVICEID","APPSESSIONID","ACTION","INFO1","INFO2","STARTTIME","ENDTIME","ADDDATE"
-----------------------------------------
--comments
my input file had 10 columns (no header row). So I just typed out the column names in the order they came in.
Then I went to Splunk Manager, System, and clicked on Server controls, and it gives me an option to Restart Splunk, which I did.
Then I went to Data, imported my tab delimited file that is sitting on my local machine by using the Add Data to Splunk wizard, selected the file, and in the window where it says "Set source type", if you click on the drop down list, you should now see "mydataname" in the drop down list. Your data should now appear in the source types of the summary page. I clicked on "mydataname" which brought me to the search page and gave me the default query:
sourcetype="mydataname" (which is equivalent to select * from mydataname).
And now I can further filter my results using the column names I specified in the transforms.conf file.
I wanted to write the equivalent of this:
select count(*)
from mydataname
where ACTION="AppStarted"
Splunk:
sourcetype="giquseraction" AND ACTION="AppStarted" | stats COUNT
So there you go...my first experiment with Splunk...will keep you guys posted on what I find next..
We had the splunk folks come down to give us a demo a few weeks ago and I asked them what happens if one of us index a huge 1 GB data file by mistake that makes us go over the limit...(oops)....they say it happens all the time and they give their customers 5 oopsies per month, which means you have 5 chances per month when you can call your sales representative up and ask them to "reset" the quota for that day.
Here's what I did...
Downloaded the free trial Enterprise version from splunk.com to install on my windows laptop (splunk-5.0.4-172409-x64-release.msi). With this trial version I get to index 500 MB of data per day, for 60 days.
Our main goal is to use splunk to explore data sitting on our Hadoop file system but to start of initially, all I wanted to do was upload a simple tab delimited file and write some queries against it...wasn't as straightforward as I thought! They splunk folks did mention that ideally the input file should be key value pairs of data but most of the data I work with are csv/tab delimited. Anyway...to make the fields show up so you can include it in your splunk query, you have to define the column names in 2 files called props.conf and transforms.conf.
I installed splunk using all the defaults so I had to create the 2 files props.conf and transforms.conf here:
C:\Program Files\Splunk\etc\system\local.
Just right click, create new text file and in props.conf, type this:
[mydataname]
NO_BINARY_CHECK = 1
pulldown_type = 1
REPORT-myname = myfieldnames
----------------------------
--comments:
you can replace "mydataname" with anything you want
you can replace "myfieldnames" with anything you want
----------------------------------------------------
Create a new text file and in transforms.conf, type this:
[myfieldnames]
DELIMS = "\t"
FIELDS = "ACTIONID","CALLID","GIQDEVICEID","APPSESSIONID","ACTION","INFO1","INFO2","STARTTIME","ENDTIME","ADDDATE"
-----------------------------------------
--comments
my input file had 10 columns (no header row). So I just typed out the column names in the order they came in.
Then I went to Splunk Manager, System, and clicked on Server controls, and it gives me an option to Restart Splunk, which I did.
Then I went to Data, imported my tab delimited file that is sitting on my local machine by using the Add Data to Splunk wizard, selected the file, and in the window where it says "Set source type", if you click on the drop down list, you should now see "mydataname" in the drop down list. Your data should now appear in the source types of the summary page. I clicked on "mydataname" which brought me to the search page and gave me the default query:
sourcetype="mydataname" (which is equivalent to select * from mydataname).
And now I can further filter my results using the column names I specified in the transforms.conf file.
I wanted to write the equivalent of this:
select count(*)
from mydataname
where ACTION="AppStarted"
Splunk:
sourcetype="giquseraction" AND ACTION="AppStarted" | stats COUNT
So there you go...my first experiment with Splunk...will keep you guys posted on what I find next..
We had the splunk folks come down to give us a demo a few weeks ago and I asked them what happens if one of us index a huge 1 GB data file by mistake that makes us go over the limit...(oops)....they say it happens all the time and they give their customers 5 oopsies per month, which means you have 5 chances per month when you can call your sales representative up and ask them to "reset" the quota for that day.
Friday, August 2, 2013
Netezza regexp_like. Regex to find integers only
regex is very handy and to be able to use it as part of a sql is just plain awesome I think.
I needed to find all records that only has integers in that field and this is how I did it:
select *
from TABLENAME
where regexp_like(INPUTCOLUMNNAME ,'^\d+$');
I needed to find all records that only has integers in that field and this is how I did it:
select *
from TABLENAME
where regexp_like(INPUTCOLUMNNAME ,'^\d+$');
Thursday, June 6, 2013
Netezza Regex
I had to update a column on a netezza table to the first 5 digits that came after a string like so "TID=" But TID could be lowercase, uppercase etc.
So for eg, Tid=31564abc, and I wanted to extract the "31564" portion. Here's how I did it.
update MYTABLE
set MYCOLUMN= regexp_extract_sp(ColA, '(tid|tID|ntID|ntid|NtID|ntid)=([0-9]{1,5})(&|\b)',1,3)
where adddate >= '2013-03-26 00:00:00'
and adddate < '2013-03-27 00:00:00'
and lower(colA) like '%tid%'
Note: this feature is only available if you installed SQL Extension Toolkit
So for eg, Tid=31564abc, and I wanted to extract the "31564" portion. Here's how I did it.
update MYTABLE
set MYCOLUMN= regexp_extract_sp(ColA, '(tid|tID|ntID|ntid|NtID|ntid)=([0-9]{1,5})(&|\b)',1,3)
where adddate >= '2013-03-26 00:00:00'
and adddate < '2013-03-27 00:00:00'
and lower(colA) like '%tid%'
Note: this feature is only available if you installed SQL Extension Toolkit
Friday, March 22, 2013
Mongodb group by count, update
The new Aggregation Framework from mongodb (available in version 2.1 and above) is pretty useful...rather than write mapreduce functions, you can use their aggregation framework to do simple group by, count(*), max and sum type of queries.
--This is how you insert into a collection based on query on another collection
db.basecollection.aggregate( [
{ $group: { _id: "$c1",
rowcount: { $sum: 1 },
lastupdated: { $max: "$c2" } } }
], {$sort:{'lastupdated':1}} ).result.forEach(function(doc){
db.aggcollection.insert(doc)});
The equivalent sql would be something like:
insert into your_aggregate_table (category, rowcount, last_updated)
select category, count(*), max(update_date)
from your_base_table
group by category
--update (insert if don't exist) a collection based on query on another collection
db.basecollection.aggregate( [
{ $group: { _id: "$c1",
rowcount: { $sum: 1 },
lastupdated: { $max: "$c2" } } }
]).result.forEach(function(doc){
db.aggcollection.update({_id: doc._id}, { $set: {rowcount: doc.rowcount, lastupdated: doc.lastupdated}}, {upsert:true})
});
--This is how you insert into a collection based on query on another collection
db.basecollection.aggregate( [
{ $group: { _id: "$c1",
rowcount: { $sum: 1 },
lastupdated: { $max: "$c2" } } }
], {$sort:{'lastupdated':1}} ).result.forEach(function(doc){
db.aggcollection.insert(doc)});
The equivalent sql would be something like:
insert into your_aggregate_table (category, rowcount, last_updated)
select category, count(*), max(update_date)
from your_base_table
group by category
--update (insert if don't exist) a collection based on query on another collection
db.basecollection.aggregate( [
{ $group: { _id: "$c1",
rowcount: { $sum: 1 },
lastupdated: { $max: "$c2" } } }
]).result.forEach(function(doc){
db.aggcollection.update({_id: doc._id}, { $set: {rowcount: doc.rowcount, lastupdated: doc.lastupdated}}, {upsert:true})
});
Thursday, March 21, 2013
useful unix commands for data processing
I use IBM Datastage for ETL at work. Datastage has an "Execute Command Activity" which allows us to issue an command on the operating system, in our case it's a linux.
Below are the commands that have come in very handy and efficient in helping me process the data or augment my data workflow.
To convert a single column into a comma separated row of values:
paste -s -d, your_input_file
To get last row:
tail -1 your_input_file
Below are the commands that have come in very handy and efficient in helping me process the data or augment my data workflow.
To convert a single column into a comma separated row of values:
paste -s -d, your_input_file
To get last row:
tail -1 your_input_file
Friday, February 22, 2013
MongoDB cheatsheet
I've been dabbling quite a bit with MongoDB lately due to my new hobby (developing iPhone apps).
Here's my list of MongoDB cheatsheet for the commands I run pretty often:
DROPPING COLLECTIONS
db.yourcollection.drop();
DELETIONS
--To remove documents from a collection with ObjectId greater than a given id.
db.yourcollection.remove({"_id":{$gt:ObjectId("51212d98e4b07347c88e0f6f")}})
--To remove documents from a collection with a Column greater than a value
Here's my list of MongoDB cheatsheet for the commands I run pretty often:
DROPPING COLLECTIONS
db.yourcollection.drop();
DELETIONS
--To remove documents from a collection with ObjectId greater than a given id.
db.yourcollection.remove({"_id":{$gt:ObjectId("51212d98e4b07347c88e0f6f")}})
--To remove documents from a collection with a Column greater than a value
db.yourcollection.remove({yourcolumn:{$gt:50}})
--To remove documents from a collection with ObjectId given an id.
db.yourcollection.remove({"_id":ObjectId("50ff1185e4b0f9e7d1c23586")});
IMPORT csv file into a collection in MongoDB
mongoimport -h yourmongodbhost -d database -c collection -u user -p password --type csv -f column1name,column2name --file c:\filename.csv
QUERYING
--find a document where ObjectId = an id
db.yourcollection.find({"_id":ObjectId("5101c346ade1db93e6f8bdff")})
--return first 10 rows
db.yourcollection..find().sort({_id:1}).limit(10);
--return last 10 rows, if you want to return last row, just change to limit(1)
db.yourcollection..find().sort({_id:-1}).limit(10);
Pig Latin - How to do a group by, count distinct
I had to convert this sql into pig recently, took me a few tries and it's not very intuitive to me, so if you ever come across this problem..here's a template you can copy.
SQL:
select month_year, osname, appname, count (distinct sessionid) as distinct_sessions
from tableA
where starttime >= '2013-01-01'
group by month_year, osname, appname;
PIG:
A = LOAD '/user/yourfolder/*/part*' USING PigStorage('\t') AS (startday:chararray, appname:chararray, osname:chararray, uniquesessionid:chararray);
B = FILTER A by startday matches '2013.*';
--project startday to month
C = FOREACH B GENERATE SUBSTRING(TRIM(startday), 0, 7) as monthyear, appname, osname, uniquesessionid;
D = GROUP C BY (monthyear, appname, osname);
SESSIONS = foreach D {
unique_sessions = DISTINCT C.uniquesessionid;
generate FLATTEN(group), COUNT(unique_sessions) as sessions;
};
STORE SESSIONS INTO '/user/yourfolder' USING PigStorage('\t');
SQL:
select month_year, osname, appname, count (distinct sessionid) as distinct_sessions
from tableA
where starttime >= '2013-01-01'
group by month_year, osname, appname;
PIG:
A = LOAD '/user/yourfolder/*/part*' USING PigStorage('\t') AS (startday:chararray, appname:chararray, osname:chararray, uniquesessionid:chararray);
B = FILTER A by startday matches '2013.*';
--project startday to month
C = FOREACH B GENERATE SUBSTRING(TRIM(startday), 0, 7) as monthyear, appname, osname, uniquesessionid;
D = GROUP C BY (monthyear, appname, osname);
SESSIONS = foreach D {
unique_sessions = DISTINCT C.uniquesessionid;
generate FLATTEN(group), COUNT(unique_sessions) as sessions;
};
STORE SESSIONS INTO '/user/yourfolder' USING PigStorage('\t');
Thursday, February 21, 2013
netezza nzsql command
nzsql -host YourServer -d YourDatabase -u YourUserName -pw YourPassword
Once you connect, you can just execute sql statements, end it with semi colon, hit enter and it will display the results of your sql.
prompt => select * from yourtableA limit 10;
Wednesday, January 9, 2013
Pig null handling
When I sqooped in data from our Netezza database into HDFS..the null values also translated into nulls in the hadoop file.
To replace null with an empty string, do it in the GENERATE statement.
A = LOAD 'your file' USING PigStorage('\t') AS (col1:charray, col2:charray);
B = FOREACH A GENERATE col1, (col2 is null ? '' : TRIM(col2)) as col2;
--Note: I also sneaked in a TRIM to get rid of any trailing spaces in col2, otherwise if it's a null, I set it to empty string.
To replace null with an empty string, do it in the GENERATE statement.
A = LOAD 'your file' USING PigStorage('\t') AS (col1:charray, col2:charray);
B = FOREACH A GENERATE col1, (col2 is null ? '' : TRIM(col2)) as col2;
--Note: I also sneaked in a TRIM to get rid of any trailing spaces in col2, otherwise if it's a null, I set it to empty string.
Friday, January 4, 2013
Doing a Group By and count in PIG
suppose you want to convert this SQL to PIG latin.
select count(*)
from gpl_limits gl
group by grouptypeid
In PIG, you'll have to do this:
--begin pig script
rmf /user/tsusanto/test;
--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);
--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;
--3)group by groupetypeid
C = GROUP B BY (grouptypeid);
D = FOREACH C generate FLATTEN(group), COUNT(B) as count;
--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');
--end pig script
If you then view the file, you should see 2 columns, the grouptypeid and the count
hadoop fs -text /user/tsusanto/test/part*
select count(*)
from gpl_limits gl
group by grouptypeid
In PIG, you'll have to do this:
--begin pig script
rmf /user/tsusanto/test;
--1) load data
A = LOAD '/user/tsusanto/full_load/gpl_limits/part*' USING PigStorage('\t') AS (groupid:chararray, originallimit:chararray, groupdesc:chararray, spentprior:chararray, actuallimit:chararray, effectivedate:chararray, grouptypeid:chararray, stacklim:chararray, displaytypeid:chararray);
--2) project required fields
B = FOREACH A GENERATE originallimit, groupid, grouptypeid;
--3)group by groupetypeid
C = GROUP B BY (grouptypeid);
D = FOREACH C generate FLATTEN(group), COUNT(B) as count;
--4)write final resultset
STORE D INTO '/user/tsusanto/test' USING PigStorage('\t');
--end pig script
If you then view the file, you should see 2 columns, the grouptypeid and the count
hadoop fs -text /user/tsusanto/test/part*
Subscribe to:
Posts (Atom)