347 647 9001+1 714 797 8196Request a Call
Call Me

Sellkart Case Study-Using data strategically

August 13, 2015
, , , , , , ,

Sellkart is a primary e-commerce Indian website and retail store and like every other e-commerce website, it has to store user data and inventory of all the items it sells. It uses MySQL database which is one of the chief data storage choices. MySQL is safe, secure and reliable data storage. But as Sellkart grew larger, more and more customers started diving into this E-commerce portal. Now Sellkart, collects user data like their location, their behaviour, demographic logs, item purchased and items returned, their payment option such as COD, Credit card/Debit card. Analysts use this data to predict customer value, better payment options, and expand the business using better ad campaigns, easing up the processes.

To depict how the collected data can be strategically used, we will consider here a fake dataset of consumers which consists of the usernames, their city, their occupation, total no. of items purchased and total no. of items returned due to defects.

Here in the fake dataset(data.csv), we have the data of the customers which have bought more than 25 items.

Fields in the data.csv

  • Username (String)
  • City (String)
  • Occupation (String)
  • Purchased (Integer)
  • Returned (Integer)
  • Sellkart uses this data, which has multiple column and considerably large dataset of over several gigabytes, to predict and understand the key insight of its system, such as:

  • Customer's lifetime value
  • Product success
  • Demographic purchase of a product
  • Regions of concern
  • Real time pricing of products
  • Up sell engaging customers

Now we will show how a small range analysis takes place in Hadoop. Things we will find out using the data:

a) Finding the top consumer on the basis of amount of product they bought.

Uses:Give special privileges to best buyers, upselling average customers.

b) Finding the job sector which purchases most products.

Uses: keeping more products of bestselling range, recommendation to similar type of products.

c) Finding the cities where most products are purchased.

Uses:Region wise targeting buyers, focusing on areas where less customers are there, trying new products in good market.

1.Copying the data into HDFS.

a) Make the directory in hadoop HDFS

[training@localhost ~]$ hadoop fs -mkdir /user/training/case1

[training@localhost ~]$ hadoop fs -ls

Found 3 items

drwxr-xr-x - training supergroup 0 2015-07-21 16:48 /user/training/case1

-rw-r--r-- 1 training supergroup 44009445 2015-07-20 12:39 /user/training/crime

drwxr-xr-x - training supergroup 0 2015-07-20 14:21 /user/training/csvfile

making a directory in Hadoop HDFS

b) Copying the file for further usage

[training@localhost ~]$ hadoop fs -copyFromLocal ~/Desktop/case1/data.csv /user/training/case1

[training@localhost ~]$ hadoop fs -ls /user/training/case1

Found 1 items

-rw-r--r-- 1 training supergroup 2489638 2015-07-21 16:49 /user/training/case1/data.csv

Copying the file for further usage

After this file is copied, it is processed using Hive, to obtain different results.

2. Using hive to process the data.

a) Creating table in hive, which will be used to store this data.

[training@localhost ~]$ hive

creatHive history file=/tmp/training/hive_job_log_training_201507211652_319213433.txt

hive> create table shopping (username string,city string,occupation string,purchaseditem,int,returneditem int) row format delimited fields

terminated by ',' stored as textfile


Time taken: 2.892 seconds

hive> show tables;




Time taken: 0.171 seconds

creating table in Hive

b) Load file into hive tables shopping

Hive>load data inpath '/user/training/case1/data.csv' overwrite into table shopping;

Loading data to table

Deleted hdfs://localhost/user/hive/warehouse/shopping


Time taken: 2.798 seconds

loading file into Hiv

c) Knowing the insights of the table.

hive> describe formatted shopping;


# col_name              data_type               comment             

username                string                  None                
city                    string                  None                
occupation              string                  None                
purchaseditem           int                     None                
returneditem            int                     None                

# Detailed Table Information             
Database:               default                  
Owner:                  training                 
CreateTime:             Tue Jul 21 16:56:38 IST 2015     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://localhost/user/hive/warehouse/shopping    
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1437478210          

# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe    
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             ,                   
        serialization.format    ,                   
Time taken: 0.302 seconds

3. Now we need to see how this data can be used.

a) Finding people who are best customer to offer them more discounts. First we will find what is the highest quantity ordered then we will match that quantity to all the records and find the top customers.

hive> select * from shopping limit 5;


Seattle Podiatric doctor 187 5

Detroit Lakes Engineering and natural sciences manager 168 11

Jackson Map editor 187 17

Miami Clinical manager 193 5

Santa Fe Springs Radiographer 207 20

Time taken: 0.335 seconds

hive> select max(purchaseditem) from shopping;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapred.reduce.tasks=

Starting Job = job_201507212155_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201507212155_0001

Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201507212155_0001

2015-07-21 23:16:01,887 Stage-1 map = 0%, reduce = 0%

2015-07-21 23:16:06,988 Stage-1 map = 100%, reduce = 0%

2015-07-21 23:16:16,075 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201507212155_0001



Time taken: 18.794 seconds

using data for making conclusions

hive> select * from shopping where purchaseditem = 250;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_201507212155_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201507212155_0002

Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201507212155_0002

2015-07-21 23:18:31,586 Stage-1 map = 0%, reduce = 0%

2015-07-21 23:18:33,598 Stage-1 map = 100%, reduce = 0%

2015-07-21 23:18:34,608 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201507212155_0002


Ablay1993 Seattle Ergonomist 250 7

Ablent71 Sugar Land Professional property manager 250 10

Affew1942 Alquippa Human resources consultant 250 13

Agar1976 Bell Oak Zookeeper 250 5

Akeem Manchester Marine 250 11

Alat1954 Columbus Assembler 250 7

Albertha Baton Rouge Mixing and blending machine operator 250 21

Amelia.Skiles Westfield Mediator 250 0

Amir_Spencer Polk City Epidemiologist 250 11

Angel.Thiel Hamilton Keying machine operator 250 21

Annalise_Langosh East Northport Fitness director 250 24

using data for making conclusions

And so on list continues, now the company can find the top customers, this step can be performed easily on MySQL, but only when data is very small when data is in gigabits Hadoop comes in to save the day.

b) Now the analyst can find which job sector is their best client, ie analyse the jobs of the people who purchase the most, so as to increase the products of that sector.

hive> select distinct occupation from shopping where purchaseditem = 250;;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapred.reduce.tasks=

Starting Job = job_201507220045_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201507220045_0006

Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201507220045_0006

2015-07-22 01:09:52,904 Stage-1 map = 0%, reduce = 0%

2015-07-22 01:09:54,914 Stage-1 map = 100%, reduce = 0%

2015-07-22 01:10:01,968 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201507220045_0006


Account clerk

Accounting clerk


Administrative assistant.

And so on.

Now knowing that an account clerk is one of the top job sector from which people are buying stuff, they can increase the sales in this department.

using data for making conclusions

4. Using Pig

Now we will use pig to perform some other functions such as collecting the groups of usernames from particular city, and these cities are the top consumers.

a) Load the data in pig

grunt> fulldata = load '/user/training/case1/' using PigStorage(',') as

(username:chararray,city:chararray,occupation:chararray,purchaseditem:int,returneditem:int); grunt>dump fulldata;

loading data in Pig

b) Now we will group the usernames of the top consumers according to cities.

For segregating records of top consumers

Hive>top = filter fulldata by purchaseditem==250;

grouping data in Pig

grouping data in Pig

Grouping the top consumer cities now, to obtain the usernames as a collection.

 grouping data in Pig

The data is grouped but is unordered.

 grouping data in Pig

So we need to order it now.

 changing order of data in pig

So in this way the final data is ordered.

 changing order of data in pig

Now we can use this data. Similarly we can extract the cities with least consumers, in those cities the company can define more budget for the advertising and publicity campaigns so that more people interact with the portal.

And for the cities and consumers where there is high consumption rate, company can launch new products and expand their circle.

If you have any doubts or queries, mention them in the comments box below.


About the Author

Saharsh Jain is an Big Data and programming enthusiast. He is pursuing B.Tech at Maharaja Agrasen Institute of Technology in Information Technology Engineering.He loves to get his hands dirty on different programming languages.


Global Association of Risk Professionals, Inc. (GARP®) does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine for FRM® related information, nor does it endorse any pass rates claimed by the provider. Further, GARP® is not responsible for any fees or costs paid by the user to EduPristine nor is GARP® responsible for any fees or costs of any person or entity providing any services to EduPristine Study Program. FRM®, GARP® and Global Association of Risk Professionals®, are trademarks owned by the Global Association of Risk Professionals, Inc

CFA Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine. CFA Institute, CFA®, Claritas® and Chartered Financial Analyst® are trademarks owned by CFA Institute.

Utmost care has been taken to ensure that there is no copyright violation or infringement in any of our content. Still, in case you feel that there is any copyright violation of any kind please send a mail to and we will rectify it.

Popular Blogs: Whatsapp Revenue Model | CFA vs CPA | CMA vs CPA | ACCA vs CPA | CFA vs FRM

Post ID = 79375