Tuesday, December 18, 2012

Hadoop etc

Cheat sheet of Hadoop stuff

SQOOP

IMPORT:
sqoop import --connect jdbc:netezza://host/dbname --username yourusername --password=yourpassword --query "SELECT * FROM tableA where \$CONDITIONS and columnA >= '2012-08-01'"  --direct --fields-terminated-by '\t' --split-by columnB--num-mappers 1 --target-dir /user/tsusanto/input_data/20120803 --compression-codec org.apache.hadoop.io.compress.GzipCodec


DELETE. You can also delete records using sqoop:
sqoop eval --connect jdbc:netezza://host/dbname --username yourusername --password=yourpassword# -e "delete from yourtable where columnA= 926"


Hadoop FS SHELL

write a value to a file in HDFS:
echo 'test123' | hadoop fs -put - yourfilepath

display a few rows of your data file stored in hadoop file system
hadoop fs -text /user/tsusanto/yourdatafolder/part* | less


to search a text string
hadoop fs -text /user/tsusanto/yourdatafolder/part* | grep -P '\byoursearchstring\b'

move a file from hadoop file system to your local current folder
hadoop fs -get /user/tsusanto/jobs/yourfolder/workflow.xml .

delete a file from hadoop file system.  You must delete the file before you can push a new file to the same location, file must not exists
hadoop fs -rm -r /user/tsusanto/jobs/yourfolder/workflow.xml

move a file to the hadoop file system from your current folder
hadoop fs -put workflow.xml /user/tsusanto/jobs/yourfolder/

Wednesday, December 12, 2012

Netezza SQL Extension Toolkit

There are many useful string manipulation functions in Netezza available if you install Netezza SQL Extension Toolkit.

To get the Right 3 characters of a string:

select strright( 'this is a string abc',  3)

The above will return abc.


Tuesday, December 11, 2012

Microstratgy Dashboard

Microstrategy Dashboard built using Microstrategy 9.3

Features:
  1. Row level security based on user login - Microsite IDs. The selectors displayed on the left side of the screen below changes based on the user login
  2. Cascading selectors. Selector Offer Type filters the Retailer selector, which in turn filters the Microsite selector
  3. Multi Panels, so user can easily tab between panels to view different data




Thursday, September 27, 2012

Microstrategy filter on dataset in document

On Microstrategy Web, if your document/dashboard is based on a dataset that is returning you too much data and you want to apply a filter on your dashboard..right click on your grid, click "Edit View Filter", then you can select what attribute/metric you want to filter on and the conditions.



Wednesday, September 19, 2012

Datastage Date Manipulation

I always have a hard time trying to add or minus days  to a given date in Datastage. You can use icon, ocon too....but I like this method when I need to manipulate a date column in Transformer Stage.

To add one day to a given date (an incoming date column):

DateFromJulianDay(JulianDayFromDate(DSLink1.your_date_column) + 1)

So basically you convert a date to a julian day, then add 1, then convert the julian day back to a date.

Friday, September 7, 2012

Microstrategy Modifying Tooltips on graphs

By default, when you have a graph, the tooltip on the web will display:
Series: Electronics Sales Amt
Category: 2012-09-07
Value: $2,509.24

But what if you want your tooltip to say:

Electronics Sales Amt: $2,509.24
Date: 2012-09-07?

Under Graph, Preferences, Series, select one of your series, show tooltip, remove the default value of {&TOOLTIP}
and instead type this:
{&SERIESLABEL}:{&VALUE}&#13Date:{&GROUPLABEL}




Friday, August 24, 2012

How to run Microstrategy Tutorial documents / dashboards

When you install Microstrategy Desktop, you can select to install the Microstrategy Tutorial, which by default is a 2-tier set up. By 2-tier, it means it does not require an I-Server. It's just Microstrategy Desktop - Database/Metadata.

The Microstrategy installation will create 3 Microsite Access databases in your local directory (C:\Program Files (x86)\MicroStrategy\Tutorial Reporting\TUTORIAL_DATA_7200.mdb  and TUTORIAL_METADATA_7200.mdb) and C:\Program Files (x86)\MicroStrategy\Tutorial Reporting\CSVFiles\Operational_datamart.mdb). The default Microstrategy Tutorial (Direct) project is simply pointing to the 3 local MS Access databases for the data and metadata..so you can still create and execute reports, but cannot run the documents/dashboards.

Since it's a 2-tier set up (as indicated by the word "Direct" is next to the project name) , if you go to your Microstrategy Tutorial project and try to run one of the dashboard documents, you will get an error message that says " Documents can only be executed through server connections ". Why? Documents can only be executed in a 3-tier environment.


2 Tier: Microstrategy Desktop - Database/Metadata DB
3 Tier: Microstrategy Desktop - Intelligence Server/Web Server - Database/Metadata DB


Suppose you already have a 3-tier set up, but your Intelligence Server tier is set up on a Linux/Unix machine,   and you wish to move the Microstrategy Tutorial project to your existing 3 tier set up so that you can execute the dashboards and play with it, this is what you'll have to do...

Step 1:
You cannot simply copy the Microsoft Access database to your Linux/Unix server where you have installed your I-Server...there is no ODBC driver available in Linux/Unix for MS Access database. The next closest thing to MS Access is MS SQL Server, so create 2 empty databases on a SQL Server (I used SQL Server 2008 R2). Call the first database MSTR_TutorialData, and the second database MSTR_OperationalDataMart. Using the import wizard in SQL Server, import the data from the 2 local MS Access databases on your machine (TUTORIAL_DATA_7200.mdb and Operational_datamart.mdb). You don't have to import the TUTORIAL_METADATA_7200.mdb database. Create a SQL logins for those 2 databases with read permissions.

Step 2:
Then on your Linux/Unix machine, navigate to your microstrategy installation path. then go to the bin folder. INSTALL_PATH/MicroStrategy/bin. Then type ./mstrconnectwiz. This will bring up the Microstrategy Connection wizard. Or if you wish, you can directly modify the odbc.ini file which sits in INSTALL_PATH/MicroStrategy folder. Select "Microstrategy ODBC Driver for SQL Server", click next, give it a name in the Data Source Name. Follow the rest of the instructions in the wizard.

Step 3:
Now, we need to move the metadata of the Microstrategy Tutorial (Direct) into your existing 3-tier's Metadata. To do that, all we have to do is duplicate the project. Connect to your Microstrategy Tutorial (Direct) project source, right click on the Microstrategy Tutorial project, and click "Duplicate Project". Follow the instructions and copy the project to your 3-tier project source. Be careful to select the option "None" under users and user groups. You do not want to bring over the existing users from that Microstrategy Tutorial into your 3-tier project (they will take up licenses and you'll have to clean up the mess later). By doing this you have thus moved over your metadata.

Step 4:
Finally, connect to your 3-tier project source, you should now see Microstrategy Tutorial project.  Go to Administration, Configurations, Database Instance. Look for "Tutorial Data" and "Operational Datamart". These 2 database instance names should have been brought over when you duplicated the Microstrategy Tutorial project. Edit them, change the Database Connection Type drop downlist to "Microsoft SQL Server 2008 with Native Client Driver". Select the Data Source Names that you created in Step 2 above. Change the login information to the SQL Server login/password you created in Step 1. Restart Microstrategy Intelligence Server.

Step 5:
To restart the I-Server, go to INSTALL_PATH/MicroStrategy/bin and type ./mstrsvcmgr. It will bring up the Microstrategy Service Manager UI, and just click the Restart button.

Now go to your Microstrategy Desktop or Web, and execute the tutorial documents, and voila...it works!! Enjoy...I love building dashboards, and their sample dashboards are like candies to me!






Tuesday, August 21, 2012

Custom Group vs Attribute ApplySimple in Microstrategy

You might have heard that creating a custom group in Microstrategy may result in bad database performance. To decide for yourself whether you should or should not use custom group, you must first understand the SQL executed by Microstrategy when  you create a custom group.

Say you want to create a custom group called "Sales Region" based on the State attribute:

  1. Tom sales region - California, Oregon, Nevada
  2. Mary sales region - Atlanta, New York
  3. Jame sales region - Florida, North Carolina

You then create a sales report based on the custom group you created above.

Microstrategy will execute 3 sql passes, 1 for each custom group element you created. It will then union the resultset of the 3 sql passes in the final report. 

sql pass 1 - 
select sum(salesamt)  WJXBFS1
from tbl_FACT a11
join region_DIM a12
 on (a11.region_SID = a12.region_SID)
where a12.state in ('California, 'Oregon', 'Nevada')

sql pass 2 - 
select sum(salesamt)  WJXBFS1
from tbl_FACT a11
join region_DIM a12
  on  (a11.region_SID = a12.region_SID)
where a12.state in ('Atlanta, 'New York')

sql pass 3 - 
select sum(salesamt)  WJXBFS1
from tbl_FACT a11
join region_DIM a12
  on  (a11.region_SID = a12.region_SID)
where a12.state in ('Florida', 'North Carolina')

Say you have 10 custom group elements, guess what? It's going to execute a similar query 10 times, with your attribute filter conditions being the only difference. But if the query executes really fast and you don't mind the many sql calls, custom group can be useful because it's simplifies your development effort.

You can achieve the same output using another method, which is to create an attribute based on an expression involving ApplySimple. Think of it sql case statement. Your attribute expression will look something like this:

ApplySimple("Case When #0 in ('Florida',  North Carolina') then 'James sales region' When #0  in ('Atlanta',  New York') then 'Mary sales region'  When #0 in ('California', 'Oregon', 'Nevada') then  'Tom sales region' end", [state] )

If you then create a report using this attribute, Microstrategy will general a sql statement that involves only 1 sql pass using case statement. The generated SQL looks something like this:

select Case When state in ('Florida',  North Carolina') then 'James sales region'
                         When state in ('Atlanta',  New York ') then ' Mary sales region' 
                         When state in ('California', 'Oregon', 'Nevada') then  'Tom sales region' end  StateCol_2,
sum(salesamt)  WJXBFS1
from tbl_FACT a11
join tbl_DIM a12
 on  (a11.region_SID = a12.region_SID)
group by Case When state in ('Florida',  North Carolina') then 'James sales region'
                         When state in ('Atlanta',  New York ') then ' Mary sales region'
                         When state in ('California', 'Oregon', 'Nevada') then  'Tom sales region' end












Wednesday, August 15, 2012

Sort a graph on a Microstrategy dashboard

If you created a graph on a Microstrategy dashboard (document) based on a dataset, and you need to sort the graph in descending order, you need to first switch the graph to grid view.

In my case I had a horizontal bar chart, with my attribute on the y axis and metric on my x axis. I had created the graph on my document based on an existing dataset. To sort my graph in descending order, I had to switch my graph to grid view (right click on graph, click "View Mode", then choose "Grid View"). Then go into the grid edit mode by clicking the square in the upper left corner. On the menu bar, click "Data", then click "Advanced Sorting", and in the "Rows" tab, choose the attribute/metric you want to sort by and select Ascending or Descending order. Switch you grid back to "Graph View".

Below is my graph after I sorted it.




Remove "metric" from report in Microstrategy

By default when you add a metric to a microstrategy report, it has the word "Metrics" in the column heading. To remove it from view, in microstrategy desktop, go to  Grid on the menu, and click Options, then check the option "Remove extra column (Web only)".

So like it says, checking the checkbox above only removes it on the Web view...when you run the report from your desktop, you'll still see the "Metrics" in the column heading.

Tuesday, August 7, 2012

Datastage - Sequence with Retries

Have you ever come across a scenario in which you need to build a datastage sequence that is smart enough to start calling the actual jobs only if all the required data exists? Then you need create a sequence that can poll the source and retry until it gets the signal that the data is ready or it reaches the retry attempts you configured.

Very frequently, a data warehouse sources its data from another operational data store (ODS), which is usually a database in which data from various other production databases/data sources are replicated to or pushed to on a regular basis (but usually not real time data syncing).

Suppose you have a nightly data warehouse ETL job that processes yesterday's data. You have a dimension table which sources data from 2 tables that originally came from 2 different databases. The 2 databases are replicated to your company's ODS, but for database A, the data is pushed to the ODS around 7 pm, while the other is pushed to the ODS around the same time but could be a couple of minutes behind.  There is also a very likely chance that replication to the ODS broke down. You want to call your datawarehouse ETL job only if both tables have the complete data for  yesterday's date.

In order to solve this issue, I built a datastage sequence with a retry attempt built in. The generic sequence flow is this. Create a loop that takes in the number of retries as a variable. Then inside the loop, you call a job that polls the ODS to see if your data is ready. Perhaps you can write a sql to check the last row_create_data in your source tables, or do a rowcount check and write the output to a textfile. Every environment is different so you'll have to come up with your own strategy as regards to what signals your data is ready. You then create a job to read the textfile output of the job above, and use a Nested Condition stage to decide if it meets your data completion criteria. If so, exit out of the loop and move on to your main ETL job, if not, stay inside the loop and call a simple routine to sleep for a certain amount of time (also use a variable), then continue the loop.

To create the sleep routine:
Create a Server Routine, select Transform Function. It should take in 1 argument, which is the number of seconds it should sleep for. In code, just type:


vSeconds = sleep_seconds
vArray = 0

sleep vSeconds
Ans=vArray

Then inside your loop, use a Routine Activity stage, and call the above routine. And pass in a job parameter for the sleep_seconds as the argument to this routine.


Below is the image of how the sequence should be designed:



Thursday, August 2, 2012

Netezza update based on a table join

I have worked with SQL Server for more than 8 years, and I'm used to writing queries like this when I'm trying to update data based on query involving table joins:

update tableA
set col1 = 0
from tableA A
join tableB B on A.id = B.bid
where b.name = 'x'

So after years of writing this kind of sql, which comes so natural to me, imagine my surprise when I moved to Netezza and wrote a query like this to update some data and I get an error message that says:

ERROR [HY000] ERROR:  Update canceled: attempt to update a target row with values from multiple join rows

Reading the message, it's trying to tell me that based on my join, I'm getting a one to many match on the rows I'm trying to delete from. So? SQL Server never complained about it. There are many cases where we do a join on tables with 1 to many relationships and do deletes/updates.

But Netezza doesn't like it. You can of course rewrite the query to have an IN clause. But there's also another funky way (at least funky to me) to achieve the desired results:


update tableA A
set col1 = 0
from tableB B
where A.id = B.bid
and b.name = 'x'

For a 3 table join:

update tableA f
set msid = 0
from tableB d, tableC t
where A.p_sid = D.d_sid
and A.t_sid = t.t_sid
and D.day_start = '2012-01-03'
and T.time_minute = 10



Useful Netezza Date Functions

If you haven't already done so, I highly recommend installing the SQL Extension Toolkit package on your Netezza box. Upon installation, you will have access to many very useful additional functions to make your life easier when writing sql.

Here are some useful date functions I tend to use:

--first of this year
select date_trunc('year', CURRENT_DATE)

--how to get first of the month given a date
select date_trunc('month', current_timestamp)

--how to display monthname-YYYY format - ie December-2012
select trim(to_char(now(),'Month')) || '-' || year(current_date);

-how to display YYYY-MM format - ie 2012-12

select TO_CHAR(current_date,'YYYY-MM');


--truncate timestamp given a date
select date(current_timestamp)--this does not require SQL Extension Toolkit
or
select date_trunc('day', current_timestamp) --requires SQL Extension Toolkit

--select next Monday's date based on today's date
select next_day(DATE(current_timestamp), 'Monday') --if your date is a timestamp and you just want the truncated date for next monday
select next_day(current_date, 'Monday')--if your date is a truncated date

--subtract 1 month from today's date
select date(current_date -  cast('1 month' as interval));

--add 6 days  from today's date
select date(current_date +  cast('6 days' as interval))
or
select  current_date - 7

--if you have a timestamp column and want to subtract say 5 days
select date(yourtimestampcolumn) - 5



--Difference in days between 2 dates (if you have sql exension toolkit installed) 
select EXTRACT (DAY FROM DateCol1 -DateCol2) as difference_in_days
 from tableA
 limit 100;
OR
-- epoch returns number of seconds
select  EXTRACT(epoch FROM DateCol1 -DateCol2)/86400 as difference_in_days
 from TableA
limit 100;

--Difference in hours between 2 dates (epoch returns number of seconds)
select  EXTRACT(epoch FROM DateCol1 -DateCol2)/3600 as difference_in_hours
 from TableA
limit 100;


--for those of you who are used to SQL SERVER's datepart function, you are in luck!
select date_part('day', current_date) as day_of_month, date_part('week', current_date) as week_of_year

--Add 1 hour to  a timestamp
select current_timestamp, current_timestamp +  cast('1 hour' as interval)

--Add 1 hour to a date
select TO_CHAR(datetime(date('2013-02-26') +  cast('1 hour' as interval)),'YYYY-MM-DD HH24:MI:SS')

Netezza External Tables


You can use Netezza's external table to view data from an external file and use it like a database table. When you create an external table, the actual data still sits in that physical external file, but you can query it from Netezza like you can query a normal database table.

You can also use external tables to export data out of a Netezza table into a file.

From Netezza's Data Loading Manual  - An external table allows Netezza to treat an external file as a database table. An external table has a definition (a table schema), but the actual data exists outside of the Netezza appliance database. External tables can be used to access files which are stored on the Netezza host server or, in the case of a remote external table, Netezza can treat a file on a client system as an external table (see REMOTESOURCE option).

Below are 2 examples to show you how to do both data export and "import".

I am using Aginity Workbench for Netezza on my windows machine (which comes with a netezza odbc driver), and I have a text file stored in my local drive C:\temp\testfile.txt which has the following column header and 3 rows:


employeeid,employeename,salary
1,'John Lee',100000
2,'Marty Short', 120000
3,'Jane Mars', 150000


CREATE EXTERNAL TABLE ext_tbl_employees(
employee_id integer, 
employee_name character varying(100), 
salary decimal (10,2))
USING (
dataobject('c:\temp\testfile.txt') 
remotesource 'odbc'
delimiter ','
skiprows 1);

Then in your Aginity workbench object browser, expand the folder External Tables, and you will your new external table listed there.

You can then query the table just like a normal database table:


select *
from ext_tbl_employees


You can also create a transient external table, in which it only exists for the duration of your query, which means the external table definition does not persist in the system catalog.


--transient external table
SELECT 
employee_id,
employee_name,
salary
FROM EXTERNAL 'c:\temp\testfile.txt' 
(
employee_id integer, 
employee_name character varying(100), 
salary decimal (10,2))
USING (
remotesource 'odbc'
delimiter ','
skiprows 1);

Transient external table is also a very useful way to export data from a netezza database out to a text file. You can export not just an entire table, but the output of any sql statement. The beauty of it is you don't have to specify the schema definition of the data, which can save you a lot of typing:

create table mytesttable (
studentid integer,
studentname character varying(100)
);

insert into mytesttable 
select 1, 'Jane Doe'
union
select 2, 'John Smith';

CREATE EXTERNAL TABLE 'c:\Temp\ExportDataTest.csv' USING (remotesource 'ODBC' DELIM ',') AS
SELECT *
from mytesttable;

drop table mytesttable;


Note: if the file already exists, any existing data will be deleted by Netezza before it inserts data.

If there's comma in your data, and you want to export to csv, use escapeChar option:

CREATE EXTERNAL TABLE 'c:\Temp\OfferAttributesNightlyFeed.csv' USING (remotesource 'ODBC' DELIM ',' ESCAPECHAR '\') AS
SELECT *
from COUPONATTRIBUTESTEST;