My BI World

Wednesday, October 29, 2014

shell scripting cheatsheet

Here are some snippets of code to do some data manipulation in your *nix environment.

Say you have in a text file test.txt
a
b
c
and you want to convert it to a,b,c, in other words, convert rows to columns
you can run this:
tr -s '\n' ',' < test.txt

Say you have in your text file a,b,c and you want to change it to single quoted like 'a','b','c'
you can run this:
cat /tmp/your_csv_row.txt | awk '{n=split($0,a,","); for (i = 0; ++i <=n;) print "'\''" a[i] "'\'',"}' | tr -s '\n' ',' | sed 's/.$//'

Wednesday, October 8, 2014

Impala create table, add partitions etc cheatsheet

I used pig to convert the incoming csv file to parquet format, then in hive, created the external table:

create external table salestransactions
(
PartnerCode string,
TransactionNumber string,
TransactionEndDateTime string,
TransactionType string,
TransactionSubType string,
StoreCode string
)
PARTITIONED BY (batchdate BIGINT)
ROW FORMAT DELIMITED
fields terminated by '\t'
STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';

Store the parquet format data file in this hdfs folder /user/data/salestransactions_parquet/20141002

then again in hive, issue this command:
ALTER TABLE salestransactionsADD PARTITION (batchdate=20141002) LOCATION '/user/data/salestransactions_parquet/20141002';

Then in your impala-shell, if this is your first time accessing the table that you just created in hive, type "invalidate metadata", then type "refresh salestransactions". If you then type "show partitions salestransactions", you should be able to see the above partition you just added.

to drop a partition,
alter table riq_salestransactions drop partition (batchdate=20141002);

Thursday, February 13, 2014

Configuring Microstrategy 9.3.1 to work with Impala 1.2.3

We have Microstrategy 9.3.1 installed on CentOS 6.

Installing the driver and setting it up was relatively easy if you follow the instructions exactly.

I downloaded the driver from Cloudera's page here:
http://www.cloudera.com/content/support/en/downloads/download-components/download-products/downloads-listing/connectors/microstrategy.html

I followed Cloudera's instructions here to install the ODBC connector: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Connectors/MicroStrategy/Cloudera-Connector-for-MicroStrategy/ccms_linux.html

Then I followed the instructions here to set things up on our Microstrategy linux box:
https://resource.microstrategy.com/support/Authoring/DisplayTN.aspx?formatted=1&tnkey=43595

I did have to open a ticket to get a PDS file from Microstrategy and install it so that I can see 'Impala 1.x' in the drop down list when setting up a new database instance in Microstrategy admin.

odbc.ini file:
[ODBC Data Sources]

IMPALA-SERVER=Hive ODBC Driver

[IMPALA-SERVER]
Driver=/usr/lib/libhiveodbc.so.1
Description=Hive ODBC Driver
DATABASE=default
HOST=impalaservername
PORT=21050
FRAMED=0
Trace=Yes
TraceFile=/tmp/odbc.log
Type=HS2NoSasl

ODBC.sh in env folder:
#
# ODBC Driver for Hive
#
HIVE_CONFIG='/usr'
if [ "${HIVE_CONFIG}" != '<HIVE_CONFIG>' ]; then
export HIVE_CONFIG

mstr_append_path LD_LIBRARY_PATH "${HIVE_CONFIG:?}"/lib
export LD_LIBRARY_PATH
fi

It works and it's very exciting! I then created a quick free form sql report against our Impala table in hadoop. We can finally have our users run reports directly against our hadoop cluster. We'll have to do some more testing but this is a very good start.

Tuesday, January 28, 2014

Deduping rows in Netezza / Deleting rows in Netezza.

netezza has a hidden rowid for every row in all tables. to dedup, you can do this:

delete from omszip_stage
where rowid in (
select max(rowid) as rowid
from OMSZIP_STAGE
group by couponid, zip, pid
having count(*) > 1)

Tuesday, December 3, 2013

bash script to extract dates from sql statements from history table

We had to figure out the usage against one of our biggest table to decide how to archive it. We keep a history of all queries issued against the table and I wanted to extract the date filter used in the sql.

There are many ways to do it but I wanted to do it in bash script just for fun...Probably not the most efficient way of doing it but I'm posting this anyway mostly for my own benefit in case I need to figure out how to do something in bash..like how to find the position of a word in a string, how to extract words from a line etc etc

#! /bin/bash

#goal: to extract the dates used in where condition in sql ie adddate >= '2013-11-01' or daystring=20131105
nzsql -t -host serverhere -d databasehere -u usernamehere -W passwordhere -c "select QH_SESSIONID, QH_TSUBMIT, QH_USER, QH_SQL from QUERYHISTORY" > queryhistory.txt 2>&1

rm -f lhqueryhistory1.csv

while read line
do
if [ -z "$line" ];
then
echo "line is empty"
break
fi

sessionid=`echo $line | cut -d"|" -f1`
starttime=`echo $line | cut -d"|" -f2`
username=`echo $line | cut -d"|" -f3`
sql=`echo $line | cut -d"|" -f4`

#cut out sql starting from the where word
wherepos=$(echo $sql | awk '{print match(tolower($0), "where")}')

#if no where condition skip that sql
if [ ${wherepos:-0} -eq 0 ];
then
continue
fi
wheresql=$(echo ${sql:${wherepos}})

#cut out sql starting from the adddate or daystring word
wherepos2=$(echo $wheresql | awk '{print match(tolower($0), "adddate|daystring")}')
if [ ${wherepos2:-0} -eq 0 ];
then
wheresql2="none and"
else
wherepos2=$(($wherepos2 - 1))
wheresql2=$(echo ${wheresql:${wherepos2}})
fi

#cut out sql up until the and, group or limit word
wherepos3=$(echo $wheresql2 | awk '{print match(tolower($0), "and|group|limit")}')
if [ ${wherepos3:-0} -eq 0 ];
then
wheresql3=$wheresql2
else
wherepos3=$(($wherepos3 - 1))
wheresql3=$(echo ${wheresql2:0:${wherepos3:-0}})
fi

#extract the date (before single quote)
wherepos4=$(expr index "$wheresql3" "\'") #look for single quote
wheresql4=$(echo ${wheresql3:${wherepos4}})
if [ ${wherepos4:-0} -eq 0 ];
then
wheresql4=$wheresql3
else
wherepos4=$(expr index "$wheresql3" "\'")#look for next single quote
wherepos4=$(($wherepos4 - 1))
wheresql4=$(echo ${wheresql3:${wherepos4}})
fi
finalrow="$sessionid|$starttime|$username|$wheresql4"
echo $finalrow >> lhqueryhistory1.txt

done < queryhistory.txt

rm -f queryhistory.txt

Friday, November 22, 2013

Netezza ERROR: 65798 : no more space for TXMgr transactions array

Every now and then, in Netezza, we'll get an error message : ERROR: 65798 : no more space for TXMgr transactions array.

When this happens, nobody can issue an query against the database, and the database is unusable.

There is a limit of approximately 65,000 transaction objects in Netezza. Once that limit is reached, the system will not accept any new queries. It usually isn't a problem because as transactions completes, the spot that was taken up by that query is freed up and given to new queries. However, sometimes there will be queries that will get stuck and will hang onto the transaction objects until we go into the system and manually abort those sessions.

This script will look for queries that has been running for more than 24 hours (it's an argument thus configurable), and sends an email that displays how long it has been running, the username, the sql command, and automatically kills the sessions.

#! /bin/bash

MaxMinutes="${1}"

if [ -z "$1" ]
then MaxMinutes=1440
fi

rowcount=0

nzsql -t -c "select id, conntime, username, dbname, command from _v_session where status='active' and command is not null
;" > sessionstokill.txt 2>&1

CurrentTimeEpoch=$(date +%s)
message="Long Running Queries:\n----------------\n"

while read line
do

id=`echo $line | cut -d"|" -f1`
starttime=`echo $line | cut -d"|" -f2`
username=`echo $line | cut -d"|" -f3`
dbname=`echo $line | cut -d"|" -f4`
command=`echo $line | cut -d"|" -f5`
StartTimeEpoch=$(date -d "$starttime" +%s)
DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))

#if it has been running for more than X minutes (argument 1)
if [ $DifferenceInMinutes -ge $MaxMinutes ] && [ $id > 0 ];
then
rowcount=$[$rowcount+1];
message+="SESSIONID: ${id} has been running for ${DifferenceInMinutes} minutes.\nUSERNAME:${username}\nDATABASE:${dbname}\nSQL:${command}\n"

#we only want to kill it if the queries came from these users
if [[ "$username" == *INTERNAL*_W* ]] || [[ "$username" == *ADMIN* ]];
then
message+="\n!!This session will be killed!!\n";

`nzsession abortTxn -force -id $id`
fi

message+="-----------------\n\n"
fi

done < sessionstokill.txt

if [ $rowcount -gt 0 ]
then
sendMail -dst "youremail@yourcompany.com" -msg "Long Running Queries / Hung Transactions NZ" -bodyText "$message"

fi

Wednesday, November 20, 2013

Exporting resultset using Impala and chaining multiple commands

Using Hue's Impala UI, you can download the resultset as CSV or Excel. But Impala shell offers a much more flexible way of exporting resultset as you can specify any delimiter you want. Another benefit over using Hue UI Impala 1.1 is that in the Impala UI, you cannot execute multiple commands (even though it doesn't report a failure). Only the first statement will be executed.

If you want to chain a few commands, create a script file (can be any extension although the convention is to give it a .hql extension).

Put all your sql commands in a file. The last command should select the output you want to save. To execute the file, type:

impala-shell -f myscript.hql -B -o results.txt --output_delimiter='\t';

The above command will invoke the impala shell, and run the commands stored in myscrpt.hql, will output the results to a file called "results.txt", and the field delimited is tab.

Contents of myscript.hql:

drop table if exists mytable; create table mytable (adddate string, rowcount bigint); insert into mytable select to_date(adddate), count(*) from transactionmaster group by to_date(adddate); select * from mytable;

-------------------------------------------------------------------------------------------


You can then view the output in the file you specified, in my case:
cat results.txt