Data Exploration and Us

LivePerson Tech Blog
LivePerson Tech Blog
10 min readMay 20, 2017

--

Introduction

When was the last time you found yourself exploring your data ad-hoc? Were you using your own scripts, tools, and technology stack, was it shared among fellow IT members and teams? Did you use version control for your data exploration and troubleshooting codebase? What is the relationship between your ad-hoc data exploration scripts the reporting provided by your product and your backoffice?

In this post, we would like to suggest both the relevant tools and key processes for exploring and assimilating troubleshooting knowledge into various tiers namely: data scientists, developers, support staff, and business analysts.

By key processes, we mean taking data driven exploration and troubleshooting and lifting it up so that any tier can easily reuse it. Not only do we want various tiers to reuse shareable data exploration code but we also hope that if we have discovered enough important exploration processes, those are reflected in the product itself or at least in the product backoffice. It’s a simple fact that in most cases what you want to know about your data so do other teams. So, how do we formalize and simplify these processes? If you think about it, this concept resembles lifting up unstructured data into structured data. The unstructured data involves informal data exploration processes while the structured one involves formal methodology that includes UI scripts and reporting. In our case we have on one end a developer trying to make sense of his data and on the other end we have a customer trying to get better analysis and reporting from the product he is using. Now is the time to take a small step forward to pursue a better connection between those ends.

The tools at hand

There are many tools for data exploration, analysis, and visualization, and all of them are great: Microsoft Excel, Google Spreadsheets, Apache Hive, R, Zeppelin, Python’s scikit-learn, Spark Notebook, Spark SQL, Spark ML, Impala, Graphite, Dashing, ELK, Breeze, D3, Custom made tools, and last but not least simple piped shell commands (awk, cut, sort, etc.), Note that shell scripts should not be taken lightly. In his seminal book, Data Science At the Command Line, Jeroen Janssens describes full data science flows using a simple command line and small shell scripts! In my own data exploration along with those of my peers, many times we found the command line to be the most effective, immediate, simple, and straightforward way in which to explore data. In order for us to choose the right tools for our needs, let’s first imagine our requirements and how we want to see the end results. From this thought process, we should be able to derive which tools we need to choose.

The requirements

  1. Reduce to a bare minimum the process of manually or explicitly moving data.
  2. Reuse of the proper tools among various tiers.
  3. Version control for anything we code.
  4. No duplication of code among various tools.
  5. Simple code — so simple and so minimal that a quick glance makes it obvious as to what it explores! We want to eliminate the need to further troubleshooting or maintenance to exploration code!

The right stack for the right task

Let’s start by checking into which technology is suitable for our tasks at hand and also reads well for all tiers. Consider this section a personal opinion by myself and my peers describing the tools and processes we found optimal to work with. If you choose a different stack and it plays well with the process and requirements outlined herewith, then great.

Choosing the tools. While Microsoft Excel and Google Spreadsheet have immense benefits — mainly the fact that you can share them with any IT or business personnel for seamless use and immediate benefits — there is another side to the coin. We have a few limitations, as spreadsheet sources are less manageable (think about revision control and that we prefer to have source code to work with). In addition, when digesting large amounts of data, such tools may simply be too slow or crash your browser. D3 is concentrated on visualization; while it can be utilized to create worthy visualization, it mainly solves only one aspect of the problem. As well, it would be too complex for reuse by non-developers. We are attempting to have the least number of tools to solve the maximum number of problems, useful to the maximum number of users.

We mentioned ELK, Graphite and other similar tools. These tools are concentrated on metrics and logging and are less relevant for our needs; although they can be utilized as endpoints for data exploration that can be streamed to plots. Hive, SparkSQL, and Impala are worthy; however, we do not always have access to them. The question is whether or not we can find a lower level option that is more commonplace. Scala, Python, and R are excellent higher-level languages but are mainly developers’ tools. Having evaluated these options, we arrive at an inevitable conclusion with some compromises to offer a sane process with three layers:

  1. Command line for all users. Fast data exploration that can be utilized and shared by all users (could incorporate Hive and other similar libraries if accessible; still, the entry point is the command line).
  2. Higher-level scriptology for slightly more advanced data exploration (should be kept simple; as mentioned, we don’t want to be in a situation in which we need to maintain code).
  3. Formal jobs, UI.

The process

Figure 1 — The Process, from low level scripts to reporting and backoffice feeding back low level scripts

On the left side of Figure 1, we show various users exploring data using simple shell scripts, manageable in revision control, shared, and reusable by all. The scripts in this section are straightforward and simple, mostly one-liners. If we need an enhanced version of the scripts using more advanced tools and languages (R, Scala, and Zeppelin), we move to the middle column of Figure 1 where we are still using simple exploration methods but with more power at hand. Here we can plot using higher-level languages, but we still want to keep the scripts simple.

At some point, you may find that you need complex scripts in order to analyze your data. Again, the idea being the exploration scripts are not too complex. If they are, this means we need to enhance our product to either spit out a more simplified version of our data to HDFS or else we need to locate a more simplified version that we are not aware of. Next we identify the need to do coding in order to simplify our data to support the exploration scripts. We show this on the right side of Figure 1, which is where our business logic resides. This is the scenario for unit tests, integration tests, system tests, end-to-end tests, and complex logic.

On the front with the command line

The point of the following section is not the specifics of the commands we are going to use, but rather in showing that one-liners as the first layer for data exploration can be extremely fast and productive. No high-level tools are needed, the shell is the most common denominator, and you could use it in every environment you work with.

Note: We did not invent this exploration exercise. It was shared in the Machine Learning With Spark book. In this book, exploration is run with Scala and Apache Spark. In this article, we will run the exploration with only the command line and HDFS in our hand!

Exercise start!

Let’s download MovieLens 100k dataset:

$ curl -O http://files.grouplens.org/datasets/movielens/ml-100k.zip

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 4808k 100 4808k 0 0 498k 0 0:00:09 0:00:09 –:–:– 609k
$ unzip ml-100k.zip

Archive: ml-100k.zip

creating: ml-100k/

inflating: ml-100k/allbut.pl

inflating: ml-100k/mku.sh

inflating: ml-100k/README

Upload it to HDFS:

$hdfs dfs -put u.user

View the first few users:

$ hdfs dfs -cat u.user | head

1|24|M|technician|85711

2|53|F|other|94043

3|23|M|writer|32067

4|24|M|technician|43537

Now we’ll count the number of distinct users:

$ hdfs dfs -cat u.user | awk -F’|’ ‘{print $1}’ u.user | sort -u | wc -l

943

The number of unique genders:

$ hdfs dfs -cat u.user | awk -F’|’ ‘{print $3}’ u.user | sort -u | wc -l

2

Not surprised! Hopefully… 🙂

Count by gender:

$ hdfs dfs -cat u.user | awk -F’|’ ‘{print $3}’ | sort | uniq -c

273 F

670 M

Unique occupations:

$ hdfs dfs -cat u.user | awk -F’|’ ‘{print $4}’ u.user | sort -u | wc -l

21

Now, let’s check how many of the users are between the ages of 40 and 45 (excellent for filtering events by timeframe if the value we were checking for was timestamp instead):

$ hdfs dfs -cat u.user | awk -F’|’ ‘$2 >=40 && $2 <= 45 {print $2}’ u.user | wc -l

103

A distribution of the occupations:

$ hdfs dfs -cat u.user | awk -F’|’ ‘{print $4}’ | sort | uniq -c | sort -nk1

7 doctor

7 homemaker

9 none

12 lawyer

12 salesman

14 retired

16 healthcare

18 entertainment

26 marketing

27 technician

28 artist

31 scientist

32 executive

45 writer

51 librarian

66 programmer

67 engineer

79 administrator

95 educator

105 other

196 student

Note: You must issue the sort command before uniq, because uniq works on adjacent lines (after all, it scans the file line by line).

Now for the movie dataset. Let’s find the min rating:

$ hdfs dfs -cat u.data | sort -nk3 | head -1 | awk ‘{print “Min rating: ” $3}’

Min rating: 1

Max rating:

$ hdfs dfs -cat u.data | sort -nk3 | tail -1 | awk ‘{print “Max rating: ” $3}’

Max rating: 5

Let’s compute the average rating:

$ hdfs dfs -cat u.data | sort -nk3 | awk ‘{ total += $3; count++ } END { print “Average rating: ” total/count}’

Average rating: 3.52986

Let’s compute the median rating:

$ index=$(hdfs dfs -cat u.data | echo $((`wc -l`/2))) && hdfs dfs -cat u.data | sort -nk3 | tail -$index | head -1 | awk ‘{print “Median rating: ” $3}’

Median rating: 4

Let’s compute the average number of ratings per user:

num_users=$(hdfs dfs -cat u.user | awk -F’|’ ‘{print $1}’ u.user | sort -u | wc -l)

num_ratings=$(hdfs dfs -cat u.data | sort -u | wc -l)

$ echo $((num_ratings/num_users))

106

And the average number of ratings per movie:

num_movies=$(hdfs dfs -cat u.item | sort -u | wc -l)

$ echo $((num_ratings/num_movies))

59

Wow! No Spark, Hive, MapReduce, absolutely nothing — we didn’t use any higher-level language, Simple shell commands, and we have achieved quite a lot of exploration for our data we had in HDFS. I have to say that such simple toys are not only useful for exercises but also extremely important and easy for exploration and troubleshooting in many real life scenarios, in real life they are not much complex than these exercises! which is super!

Higher level scriptology

There are actually some really great entry points into writing scripts with relation to data exploration, scripts which could be shareable and nicely packaged in UI for various tiers to work with. Apache Zeppelin is a great tool which can assist in that process by taking in many kind of script languages, and being able to plot output nicely. Once you have the process up and running, you can achieve most higher-level scripting demands; the only thing left to do is write the script and share it. You can write code in Scala, pass the data to R and back, and plot using various plotting libraries. Now, a whole new world of data exploration is available to you. It even has a nice small icon on top of it’s toolbar, when you hover on it, it says version control, what it would actually do is add a local git repository to your notebooks, you could then connect it to a remote repository (manually at this stage) and have version control up and running for your notebooks. It would be useful to know where zeppelin actually creates the local git repository:

With your browser pointing to your notebook URL, you would see:

http://zeppelin-host:8080/#/notebook/2BMRMTWNM

On the disk, you would see the directory 2BMRMTWNM under: zeppelin/notebook

Beneath that, you would see a file called: note.json

Let us show you how to bootstrap with apache zeppelin. We’ll use a local CSV file with a local version of Apache Zeppelin (which is cool anyway — using it locally to run Scala/R code on local data files). We used the latest Apache Zeppelin incubated source code compiled with the -Pr maven profile enabled (to include R).

First, we load the library to ease our work with the CSV file:

%dep

z.reset()

z.addRepo(“Spark Packages Repo”).url(“http://dl.bintray.com/spark-packages/maven”)

z.load(“com.databricks:spark-csv_2.10:1.2.0”)
DepInterpreter(%dep) deprecated. Remove dependencies and repositories through GUI interpreter menu instead.

DepInterpreter(%dep) deprecated. Add repository through GUI interpreter menu instead.

DepInterpreter(%dep) deprecated. Load dependency through GUI interpreter menu instead.

res0: org.apache.zeppelin.dep.Dependency = org.apache.zeppelin.dep.Dependency@2565720c

In a new section, use the loaded library and start exploring your data:

%spark.r

users <-read.csv(“ml-100k/u.user”, sep=”|”)

Let’s examine the users by showing an example of the first few:

%spark.r

users <-read.csv(“ml-100k/u.user”, sep=”|”)

head(users)

X1 X24 M technician X85711

2 53 F other 94,043

3 23 M writer 32,067

4 24 M technician 43,537

5 33 F other 15,213

6 42 M executive 98,101

7 57 M administrator 91,344

Count the number of distinct users:

%spark.r

length(unique(users[,1]))

[1] 942

If some of your scripts are already coded in another language like Scala, you can pass data from Scala to R and back. Let’s show a mini example of that:

%spark

val usersText = sc.textFile(“file:///tmp/ml-100k/u.user”)

val ageIndex = 1

val genderIndex = 2

val occupationIndex = 3

case class User(age:Integer, gender: String, occupation: String)

val users = usersText.map(s=>s.split(“\\|”)).map(

s=>User(s(ageIndex).toInt,

s(genderIndex),

s(occupationIndex))

)

z.put(“numusers”, users.collect().size)

usersText: org.apache.spark.rdd.RDD[String] = file:///tmp/post/ml-100k/u.user MapPartitionsRDD[122] at textFile at <console>:29

ageIndex: Int = 1

genderIndex: Int = 2

occupationIndex: Int = 3

defined class User

users: org.apache.spark.rdd.RDD[User] = MapPartitionsRDD[124] at map at <console>:39

usersDF: org.apache.spark.sql.DataFrame = [age: int, gender: string, occupation: string]

usersTempTable: Unit = ()

And then read it from R:

%spark.r

numusers <- .z.get(“numusers”)

numusers

[1] 943

Summary

We started with an awareness: that we can do a better job with data troubleshooting and exploration code. We agree that we should keep our exploration process simple, with the least amount of maintenance required. In addition, we want shareability among various tiers. We have multiple layers of complexity, with each layer having a higher level of complexity and thus involving more UI, business logic and tests. Our awareness also extended into choosing the right technology for the right task. Today, we have an immense number of data exploration tools and languages to choose from. We have chosen the right stack: lower level one-liners shell scripts for simple data exploration, and Scala Spark and R for higher-level scriptology. We finished our process by having our higher level code feedback the simple processes by writing simpler data into HDFS so that our troubleshooting code can do it’s job more easily.

-Tomer Ben David

--

--