Kaggler.com

I started a new blog, Kaggler.com to write mainly about Data Science competitions at Kaggle.

I’ve enjoyed participating in Kaggle competitions since 2011.  In every competition, I learned new things – new algorithms (Factorization Machine, Follow-the-Regularized-Leader), new tools (Vowpal Wabbit, XGBoost), and/or new domains.  It’s been really helpful for me to be up-to-date in the fast evolving fields of Machine Learning and Data Science.

With Kaggler.com, I’d like to share my learning and experiences with others.  Hope it can be useful to someone.

60 Day Journey of Deloitte Churn Prediction Competition

Competition

Last December, I teamed up with Michael once again to participate in the Deloitte Churn Prediction competition at Kaggle, where to predict which customers will leave an insurance company in the next 12 months.

It was a master competition, which is open to only master level Kagglers (top 0.2% out of 138K competitors), with $70,000 cash prizes for top 3 finishers.

Result

We managed to do well and finished in 4th place out of 37 teams in spite of that we did not have much time due to projects at work and family events (especially for Michael, who became a dad during the competition).

Although we were little short to earn the prize, it was a fun experience working together with Michael, competing with other top competitors across the world, and climbing the leaderboard day by day.

Visualization

I visualized our  60 day journey during the competition below, and here are some highlights (for us):

  • Day 22-35: Dived into the competition, set up the github repo and S3 for collaboration, and climbed up the leaderboard quickly.
  • Day 41-45: Second spurt.  Dug in GBM and NN models.  Michael’s baby girl was born on Day 48.
  • Day 53-60: Last spurt.  Ensembled all models.  Improved our score every day, but didn’t have time to train the best models.

Motion Chart - Deloitte Churn Prediction Leaderboard

Once clicked the image above, it will show a motion chart where:

  • X-axis: Competition day.  From day 0 to day 60.
  • Y-axis: AUC score.
  • Colored circle: Each team.  If clicked, it shows which team it represents.
  • Right most legend: Competition day.  You can drag up and down the number to see the chart on a specific day.

Initial positions of circles show the scores of their first submissions.

For the chart, I reused the code using rCharts published by Tony Hirst at github: https://github.com/psychemedia (He also wrote a tutorial on his blog about creating a motion chart using rCharts).

Closing

We took a rain check on this, but will win next time!  🙂

Related Articles:

Machine Learning as a Service vs Feature – BigML vs. Infer

bigml_vs_infer

This is a personal follow-up for the LA Machine Learning meetup, David Gerster @ eHarmony on October 8th, 2013.

David Gerster, VP of Data Science at BigML, also former Director of Data Science at Groupon, gave an overview of BigML’s Machine Learning (ML) platform for predictive analytics. Here are my thoughts about its business model and other alternatives offering ML.

1. BigML

BigML offers, so called, Machine-Learning-as-a-Service (MLaaS), that allows users (ex. a wine seller) to upload their data (ex. historical wine sales data) to BigML and get predictions for the variable of interest (ex. right prices for new wines, expected sales number for future) while keeping users from complicated Machine Learning algorithms, which are key components of predictive analytics.

2. Machine Learning as a Service

As of November 2013, there are a few start-up companies + Google offering (or claiming to offer) MLaaS other than BigML:

It is similar to Analytics-as-a-Service (AaaS) in a way that it delivers the power of predictive analytics to users, but it is different from AaaS in a way that it leaves data ETL (Extract-Transform-Load) and problem identification steps to users.

The pros and cons of MLaaS over AaaS would be:

  • Pros – Cheaper: $150-300 / month for MLaaS from BigML vs. $200-300 / hour / person (+ hardware, licensing fees) for AaaS from analytics consulting firms
  • Cons – Harder: ETL and problem identification are by far the hardest parts in predictive modeling.  No matter how good your algorithm is, garbage-in leads garbage-out (ETL), and aiming wrong target leads wrong predictions (problem definition).

If you know what you’d like to predict and how to clean up your data, then MLaaS would be the right solution for you.  However, for many prospective users of MLaaS (those who are inexperienced in data analytics), I guess that it’s not the case.

3. Machine-Learning-as-a-Feature, Infer.com

Another business model other than MLaaS and AaaS to provide users with the power of ML for their predictive modeling needs is Machine-Learning-as-a-Feature (MLaaF, Don’t google it.  I just made it up).

Infer, a startup founded in 2010, is on this track, and offers ML plugins for popular CRM softwares (Salesforce, Marketo and Eloqua) to predict sales leads.

By focusing on the specific need (sales lead prediction) and  specific users (those who use popular CRM softwares), Infer manages to provide the power of ML with the painless user experience and affordable price tag.

Closing Thought

To be fair, BigML’s user interface and visualization are quite impressive, and it is equipped with the Random Forest algorithm, which is one of most popular algorithms with good out-of-the-box performance.  For data scientists who do not have much ML experience, it will be worth to try out.

However, I believe that for most of users, it will work best with well defined MLaaF rather than MLaaS (Investors seem to agree with me based on the fact that Infer got 10M in funding compared to BigML got 1M from CrunchBase profiles).

Related Articles:

Data Science Career for Neuroscientists + Tips for Kaggle Competitions

main-qimg-a08366402fddbb5bd52ab151f51c7dfa

Recently Prof. Konrad Koerding at Northwestern University asked for an advice on his Facebook for one of his Ph.D student, who studies Computational Neuroscience but wants to pursue his career in Data Science.  It reminded me of the time I was looking for such opportunities, and shared my thoughts (now posted on the webpage of his lab here).  I decide to post it here too (with a few fixes) so that it can help others.

First, I’d like to say that Data Science is a relatively new field (like Computational Neuroscience), and you don’t need to feel bad to make the transition after your Ph.D.  When I was out to the job market, I didn’t have any analytic background at all either.

I started my industrial career at one of analytic consulting companies, Opera Solutions in San Diego, where one of Nicolas‘ friends, Jacob, runs the R&D team of the company.  Jacob did his Ph.D under the supervision of Prof. Michael Arbib at University of Southern California in Computational Neuroscience as well.  During the interview, I was tested to prove my thought process, basic knowledges in statistics and Machine Learning, and programming, which I’d practiced through out my Ph.D everyday.

So, if he has a good Machine Learning background with programming skills (I’m sure that he does, based on the fact he’s your student), he can be competent to pursue his career in Data Science.

Tools in Data Science

Back in the graduate school, I used mostly MATLAB with some SPSS and C.  In the Data Science field, Python and R are most popular languages, and SQL is a kind of necessary evil.

R is similar to MATLAB except that it’s free.  It is not a hardcore programming language and doesn’t take much time to learn.  It comes with the latest statistical libraries and provides powerful plotting functions.  There are many IDEs, which make easy to use R, but my favorite is R Studio.  If you run R on the server with R Studio Server, you can access it from anywhere via your web browser, which is really cool.  Although native R plotting functions are excellent by themselves, the ggplot2 library provides more eye-catching visualization.

For Python, Numpy + Scipy packages provides similar vector-matrix computation functionalities as MATLAB.  For Machine Learning algorithms, you need Scikit-Learn, and for data handling, Pandas will make your life easy.  For debugging and prototyping, iPython Notebook is really handy and useful.

SQL is an old technology but still widely used.  Most of data are stored in the data warehouse, which can be accessed only via SQL or SQL equivalents (Oracle, Teradata, Netezza, etc.).  Postgres and MySQL are powerful yet free, so it’s perfect to practice with.

Hints for Kaggle Data Mining Competitions

Fortunately, I had a chance to work with many of top competitors such as the 1st and 2nd place teams at Netflix competitions, and learn how they do at competitions.  Here are some tips I found helpful.

1. Don’t jump into algorithms too fast.

Spend enough time to understand data.  Algorithms are important, but no matter how good algorithm you use, garbage-in only leads to garbage-out.  Many classification/regression algorithms assume the Gaussian distributed variables, and fail to make good predictions if you provide non-Gaussian distributed variables.  So, standardization, normalization, non-linear transformation, discretization, binning are very important.

2. Try different algorithms and blend.  

There is no universal optimal algorithm.  Most of times (if not all), the winning algorithms are ensembles of many individual models with tens of different algorithms.  Combining different kinds of models can improve prediction performance a lot.  For individual models, I found Random Forest, Gradient Boosting Machine, Factorization Machine, Neural Network, Support Vector Machine, logistic/linear regression, Naive Bayes, and collaborative filtering are mostly useful.  Gradient Boosting Machine and Factorization Machine are often the best individual models.

3. Optimize at last.

Each competition has a different evaluation metric, and optimizing algorithms to do the best for that metric can improve your chance to win.  Two most popular metrics are RMSE and AUC (area under the ROC curve).  Algorithms optimizing one metric is not the optimal for the other. Many open source algorithm implementations provide only RMSE optimization, so for AUC (or other metric) optimization, you need to implement it by yourself.

Related Articles:

For Teenagers to Learn Programming – Where and How to Start

code monkey, code!

When I talk to teenagers or their parents, I always recommend teenagers to learn a programming language because:

  • It is a fun toy like LEGO that allows you to create whatever you like instead of playing with what others created.
  • It is a universal language like music, arts, and sports that allows you to communicate with and reach out to other people around the world. 
  • It is an effective tool that increases your productivity to infinity by automating and delegating your work to a computer (or a “cloud” of computers).
  • Since it is a language like other spoken human languages, the earlier you start, the easier and faster you can learn.

In return, I’ve been asked how and where to start.  Here are some good resources to start with:

1. Get Motivated

  • Leaders and trend-setters all agree on one thing – Leaders in business, government, arts and entertainment, and education share why they think programming is important.
  • Why software is eating the world – Wall Street Journal article by Marc Andreessen, one of top venture capitalists, about how and why the software industry has been taking over other industries such as healthcare, financial, telecom, arts and entertainment, etc.

2. Resources

3. Which language to learn

Here I list some languages that, I think, are relatively easy to learn, fast to get working results, and useful even after becoming professional (except Scratch).

  • For kids to get familiar with programming – Scratch, which is designed for ages 8 to 16 to create programs without using programming languages (check out the TED video below to see how it looks like).
  • For high schoolers or beginners with interests in developing working programs – Python, which is an easy yet powerful language that can cover almost every verticals (from web pages to scientific programs).
  • For designers with interests in developing interactive art works – Processing, which offers beautiful and interactive visualization capabilities.
  • For those who are interested in Finance or business intelligence – R, which is originally developed as a statistical computing language, but widely used in predictive modeling, data exploration, and data visualization.

I hope everyone who comes across this article can give a try to learn programming, and discover how fun and useful it is. 🙂

Related Articles:

Faith and Suffering

Inspired by today’s Family Life Today, “Why Me”.

Gerald Sittser lost his wife, mother and daughter in an accident that left him wondering, “Why?”. Jerry came to realize that God didn’t promise him a pain-free life, but promised instead to be with him in his loss and suffering.

신앙과 고난

우리는 믿음을 지키고 바르게 살면 하나님께서 좋은 – 고난 없는 – 삶을 허락하실 것이라는 소박한 – 어떤 사람들처럼 크게 “성공”하는 “축복”을 바라는 것이 아니라는 면에서 – 믿음을 가지고 있다. 그러기에 우리는 삶 가운데 어려움을 만나게 되면, 우선 무엇을 잘못했는지 찾고, 그것을 바로 잡음으로서 고난에서 벗어나려고 한다.

하지만, 많은 경우 우리는, 질병, 사고, 성실함 뒤에 따르는 실패, 옳은 일을 한 뒤에 따르는 어려움 등, 딱히 잘못한 것이 없는데도, 소위 까닭없는 고난을 마주치게 된다. 그러면 우리는 당황해하고, “Why me?”라는 질문을 던지며, 하나님께 그 잘못된 상황을 – 때론 따지기도 하며 – 바로잡아 주시길 기도한다.

이러한 반응은 고난은 정상적이지 않은, 잘못된 것이라는 인식에서 비롯된다.

하지만, 성경에서 고난은 오히려 우리 삶 가운데 반드시 필요한 것으로 보여진다. 성경은 고난은 믿는 자에게 당연히 따르는 것이라 말하고 있다.

  • 오직 하나님의 능력을 따라 복음과 함께 고난을 받으라. (디모데후서 1:8 중)
  • 선을 행함으로 고난 받는 것이 하나님의 뜻일진데. (베드로전서 3:17 중)

하나님께서 우리에게 약속하신 것은 성령과 부활이지, 고난 없는 삶이 아니다. 대신, 성경은 고난 중에 하나님이 우리의 위로가 되시고, 훗날 (혹은 부활 후에) 보상 받을 것을 말하고 있다. 또한 고난을 통해 우리가 더 성숙해지고, 고난을 당한 다른 사람들을 위로할 수 있게 되기에 고난 자체로도 유익하다고 말한다.

  • 우리의 모든 환난 중에서 우리를 위로하사, 우리로 하여금 하나님께 받는 위로로써 모든 환난 중에 있는 자들을 능히 위로하게 하시는 이시로다. (고린도후서 1:4)
  • 이 말씀은 나의 고난 중의 위로라. 주의 말씀이 나를 살리셨기 때문이니이다. (시편 119:50)
  • 현재의 고난은 장차 우리에게 나타날 영광과 비교할 수 없도다. (로마서 8:18)
  • 고난 당하기 전에는 내가 그릇 행하였더니, 이제는 주의 말씀을 지키나이다. 고난 당한 것이 내게 유익이라. 이로 말미암아 내가 주의 율례들을 배우게 되었나이다. (시편 119:67, 71)

나 또한 삶의 가장 힘들었던 기간에 내 슬픔과 고통을 아시고 위로하고 격려해 주시는 하나님을 만났다. 그리고, 그 기간을 극복한 후에는 그 이전과 비교할 수 없을 정도로 성숙하고 강한 내가 되는 경험을 한 적이 있다. 내 삶의 고난은 나로 하여금 하나님을 더 가깝게 체험하고, 하나님께 더 가까이 가게 만든 너무도 소중한 자산이다.

  • 그러나 내가 가는 길을 그가 아시나니 그가 나를 단련하신 후에는 내가 정금같이 되어 나오리라. (욥기 23:10)

그렇다면, 우리는 삶 가운데 고난을 당할 때 어떻게 해야할까.

성경은 고난을 당할 때, 인내하고 기도하라고 이야기 한다.

  • 소망 중에 즐거워하며, 환난 중에 참으며, 기도에 항상 힘쓰며 (로마서 12:12)
  • 너희 중에 고난 당하는 자가 있느냐, 그는 기도할 것이요. 즐거워하는 자가 있느냐, 그는 찬송할지니라. (야고보서 5:13)

신약에서 가장 많은 고난을 당한 인물이라고 할 수 있는 바울은 어떤 상황에서든지 기뻐하라고 하며 자신은 고난 중에도 즐거워한다고 고백하고 있다.

  • 주 안에서 항상 기뻐하라. 내가 다시 말하노니, 기뻐하라. (빌립보서 4:4)
  • 우리가 환난 중에도 즐거워하나니 이는 환난은 인내를, 인내는 연단을, 연단은 소망을 이루는 줄 앎이로다. (로마서 5:3,4)

바울은 자신이 항상 기뻐할 수 있는 비결을 염려하지 않고, 감사하며, 기도하는 것이며, 그럴 때에 하나님의 평강이 자신의 생각과 마음을 지킨다고 이야기한다. 또한 그로인해 자신이 어떤 환경에 처하던지 모든 일을 할 수 있다고 고백한다.

  • 아무 것도 염려하지 말고, 다만 모든 일에 기도와 간구로, 너희 구할 것을 감사함으로 하나님께 아뢰라. 그리하면 모든 지각에 뛰어난 하나님의 평강이 그리스도 예수 안에서 너희 마음과 생각을 지키시리라. (빌립보서 4:6-7)
  • 나는 비천에 처할 줄도 알고, 풍부에 처할 줄도 알아, 모든 일 곧 배부름과 배고픔과 풍부와 궁핍에도 처할 줄 아는 일체의 비결을 배웠노라. 내게 능력 주시는 자 안에서 내가 모든 것을 할 수 있느니라. (빌립보서 4:12-13)

이 얼마나 멋진 고백인가.

마지막으로 예수님께서는 산상수훈을 마무리 지으며, 반석 위에 세운 집 (말씀대로 사는 사람)과 모래 위에 세운 집 (말씀대로 살지 않는 사람) 모두 비바람과 풍랑을 맞을 것을 말씀하신다.

  • 나의 이 말을 듣고 행하는 자는 그 집을 반석 위에 지은 지혜로운 사람 같으리니, 비가 내리고 창수가 나고 바람이 불어 그 집에 부딪치되 무너지지 아니하나니, 이는 주추를 반석 위에 놓은 까닭이요. 나의 이 말을 듣고 행하지 아니하는 자는 그 집을 모래 위에 지은 어리석은 사람 같으리니, 비가 내리고 창수가 나고 바람이 불어 그 집에 부딪치매, 무너져 그 무너짐이 심하니라. (마태복음 7:24-27)

단, 반석 위에 세운 집은 비비람과 풍랑 속에서도 굳건히 선다고 말씀하신다.

우리 모두 삶 가운데 고난을 만날 때 바울처럼 의연히 대처하고, 반석 위에 세운 집처럼 굳건히 설 수 있길 바란다.

이를 위해 고난이 없을 때 말씀대로 살고, 고난 중에 염려하지 않고 기도와 감사로 대처하는 우리가 되길 바란다.

Restarting the Blog

얼마전 그동안 사두었던 도메인과 웹서비스들을 가장 오래된 ethiel.orgyoungnjeong.com만 남겨두고 모두 정리하였다. 그러면서 몇 안되지만 여기 저기 흩어져있던 예전 글들을 이 곳에 모아두었다.

1994년부터 2010년까지 127개의 글.

몇 해 걸쳐 한 번 쓴 적도 있었고, 며칠 연속 부지런히 쓴 적도 있었다. 오랜 세월에 비하면 얼마되지 않지만, 이렇게 모아진 글들을 보니 뭔가 대단한 보물을 찾은 듯 기분이 좋다.

나는 글을 쓰면 왠지 내 가난한 밑천이 다 드러나는 것 같아 무척 부담스러워 한다. 하지만, 요즘은 지금의 가난한 밑천을 훗날의 내가 되돌아보고 싶고, 찾아보고 싶어하지 않을까 하는 생각이 든다.

앞으로도 얼마나 될지 모르지만, 한 번씩 이 곳에 찾아와 내 삶의 흔적, 생각의 흔적을 남겨두고자 한다. 가난하면 가난한데로, 풍성하면 풍성한데로.

굿나잇.

Prayer of Serenity by Reinhold Niebuhr

God, grant me the serenity
to accept the things I cannot change;
courage to change the things I can;
and wisdom to know the difference.

Living one day at a time;
Enjoying one moment at a time;
Accepting hardships as the pathway to peace;

Taking, as He did, this sinful world
as it is, not as I would have it;
Trusting that He will make all things right
if I surrender to His Will;

That I may be reasonably happy in this life
and supremely happy with Him
Forever in the next.
Amen.

–Reinhold Niebuhr

Amen

What is leadership – Jack Welch in 1979

Leadership이란

주위의 사람들로하여금

항상 주도적이지는 않더라도
보다 열심히 일하고
더욱 일을 즐기며

마침내는
그들이 가능하다고 여겼던 것 이상 것을 이루도록 도움으로써
자신에 대해 더 많은 존경심과 자신감을 가지도록 하는 것

입니다.

GE의 차기 CEO 후보로 경합을 벌이던 1979년
전 회장 레그 존스에게 보낸 편지 中

Mentoring diary – What Is Success

Recently God leads me to the places to meet awesome mentors and many mentees in Christ. Also, He gives me ideas of how to mentor mentees. I think God wants to use me either to be a good mentor to mentees or to help them to meet good mentors. I want to share some of those here.

First, I want to talk about what success is, which I learned from Elder Brian Chun’s sermon at KPM.

Why do we mentor mentees? Why do mentees get mentoring? I would say it’s for leading mentees to succeed in their lives. Then, what is success? According to the definition of success, the way how we mentor should be different. As a Christian, our success should not be the same as success in the world: wealth, fame, prosperity.

Elder Chun defines the success of Christians as being faithful to God’s calling. It’s rather a process than a result. It’s rather about the giver than about the receiver of talents.

If we read the Bible, about some generations, there is only a few description or even no description at all. There should be some successful people in those generation as well in terms of the success of the world. However, God doesn’t count it. Those were not important to God.

We all, as Christians, want to be called as “good and faithful servants” at the end of our lives in front of God. Then we need to check our perception of success first.

I think that’s the very first thing to remind mentees of.