Lean In

2018년의 두 번 째 책으로 Sheryl Sandberg의 “Lean In”을 마쳤다.

책이 출간되어 한창 이슈일 때는 “여성을 위한 책”이라는 생각에 지나쳤는데, 얼마 전 아내가 육아를 위해 직장을 그만 둔 것을 계기로 찾아 읽게 되었다.

페이스북의 COO인 저자는 여성이 직장과 가정에서 겪는 차별, 편견, 그리고 자책에 대한 경험담과 연구 사례들을 나눈다. 그리고 여성이 직장의 반을 담당하고 남성이 가정의 반을 담당할 수 있는 사회를 위한 실질적인 조언을 한다.

Sheryl은 구글 재직 중 첫 임신 후, 남편 Dave에게 만차인 주차장에서 멀리 차를 데고 사무실까지 걷는게 힘들다고 하소연했다. 그러자 Dave는 자신의 직장 야후에는 임산부를 위한 주차 공간이 있다고 알려주었다. 다음 날 Sheryl이 구글 공동 창업자 Sergey를 찾아가 임산부 주차 공간이 필요하다고 전했고, Sergey는 바로 알았다고 하며 이 문제에 대해 그 때까지 “한 번도 생각해 본 적이 없었다”고 했다.

Sheryl은 이 일화를 통해 자신이 직접 겪기 전까지 자신도 임산부 주차 공간에 대해 생각해 본 적이 없었다는 점이 부끄러웠고, 왜 다른 임산부들은 의견을 개진하지 않았는지 의문을 가지기 시작했다고 한다.

나 역시 아내가 직장과 가정 사이에서 겪는 문제와 고민에 대해 무지했다.

첫째 아이가 생겼을 때 나는 미국에서 첫 직장을 잡았었고, 아내는 이미 6년 경력의 세무 회계사로 딜로이트에서 잘 자리 잡고 있었다. 하지만 아내는 출산 후에 출산 휴가를 쓰고 곧 이어 집 근처 3분 거리에 있는 작은 회사 회계 담당으로 직장을 옮겼다. 그나마도 1년 남짓 후에 그만두고 전업 주부가 되었다. 이후 삼둥이가 태어나고 질풍노도의 육아 전투를 거친 후, 집 근처 혼다에 취직을 하였다가 1년 후 다시 그만두었다.

아내가 이렇게 경력보다 가정을 우선시 하는 결정을 내릴 때마다 나는 아내가 고맙고 아내의 재능이 아깝다고 생각했지만, 한 번도 “왜 내가 아니고 아내가 이런 결정을 내리는가”에 대해 진지하게 고민해보지 않았다.

작년에 마이크로소프트로 옮기고 재택근무를 하면서 처음으로 “내가 가정 일을 더 챙기고 아내가 직장 일을 제대로 할 수 있게 도와주자”고 마음을 먹었다. 하지만 막상 시작해보니 도저히 감당이 안 되어서 불과 몇 달 만에 두 손 들고 말았다. 직장 일을 보통 희생하지 않고는 가정 일을 다 챙길 수가 없었고 나는 그런 희생을 할 준비도 용기도 없었다. 그리고 그 힘든 결정을 다시 한 번 아내가 내리게 되었다.

(“태양의 후예” 중: 그 어려운 걸 자꾸 해냅니다. 내가.)

“Lean In”을 읽으며 참 안타깝고 부끄럽고 미안했다.

지금부터라도 다시 마음을 잡고 시작해야겠다. 내가 가정의 반을 감당하고 아내가 직장의 반을 감당할 수 있는 날이 올 수 있도록, 둘 다 가정과 직장에서 우리의 가능성을 최대한 실현할 수 있도록 아내를 돕고 지원해야겠다.

Kaggler.com

I started a new blog, Kaggler.com to write mainly about Data Science competitions at Kaggle.

I’ve enjoyed participating in Kaggle competitions since 2011.  In every competition, I learned new things – new algorithms (Factorization Machine, Follow-the-Regularized-Leader), new tools (Vowpal Wabbit, XGBoost), and/or new domains.  It’s been really helpful for me to be up-to-date in the fast evolving fields of Machine Learning and Data Science.

With Kaggler.com, I’d like to share my learning and experiences with others.  Hope it can be useful to someone.

60 Day Journey of Deloitte Churn Prediction Competition

Competition

Last December, I teamed up with Michael once again to participate in the Deloitte Churn Prediction competition at Kaggle, where to predict which customers will leave an insurance company in the next 12 months.

It was a master competition, which is open to only master level Kagglers (top 0.2% out of 138K competitors), with $70,000 cash prizes for top 3 finishers.

Result

We managed to do well and finished in 4th place out of 37 teams in spite of that we did not have much time due to projects at work and family events (especially for Michael, who became a dad during the competition).

Although we were little short to earn the prize, it was a fun experience working together with Michael, competing with other top competitors across the world, and climbing the leaderboard day by day.

Visualization

I visualized our  60 day journey during the competition below, and here are some highlights (for us):

  • Day 22-35: Dived into the competition, set up the github repo and S3 for collaboration, and climbed up the leaderboard quickly.
  • Day 41-45: Second spurt.  Dug in GBM and NN models.  Michael’s baby girl was born on Day 48.
  • Day 53-60: Last spurt.  Ensembled all models.  Improved our score every day, but didn’t have time to train the best models.

Motion Chart - Deloitte Churn Prediction Leaderboard

Once clicked the image above, it will show a motion chart where:

  • X-axis: Competition day.  From day 0 to day 60.
  • Y-axis: AUC score.
  • Colored circle: Each team.  If clicked, it shows which team it represents.
  • Right most legend: Competition day.  You can drag up and down the number to see the chart on a specific day.

Initial positions of circles show the scores of their first submissions.

For the chart, I reused the code using rCharts published by Tony Hirst at github: https://github.com/psychemedia (He also wrote a tutorial on his blog about creating a motion chart using rCharts).

Closing

We took a rain check on this, but will win next time!  🙂

Related Articles:

Machine Learning as a Service vs Feature – BigML vs. Infer

bigml_vs_infer

This is a personal follow-up for the LA Machine Learning meetup, David Gerster @ eHarmony on October 8th, 2013.

David Gerster, VP of Data Science at BigML, also former Director of Data Science at Groupon, gave an overview of BigML’s Machine Learning (ML) platform for predictive analytics. Here are my thoughts about its business model and other alternatives offering ML.

1. BigML

BigML offers, so called, Machine-Learning-as-a-Service (MLaaS), that allows users (ex. a wine seller) to upload their data (ex. historical wine sales data) to BigML and get predictions for the variable of interest (ex. right prices for new wines, expected sales number for future) while keeping users from complicated Machine Learning algorithms, which are key components of predictive analytics.

2. Machine Learning as a Service

As of November 2013, there are a few start-up companies + Google offering (or claiming to offer) MLaaS other than BigML:

It is similar to Analytics-as-a-Service (AaaS) in a way that it delivers the power of predictive analytics to users, but it is different from AaaS in a way that it leaves data ETL (Extract-Transform-Load) and problem identification steps to users.

The pros and cons of MLaaS over AaaS would be:

  • Pros – Cheaper: $150-300 / month for MLaaS from BigML vs. $200-300 / hour / person (+ hardware, licensing fees) for AaaS from analytics consulting firms
  • Cons – Harder: ETL and problem identification are by far the hardest parts in predictive modeling.  No matter how good your algorithm is, garbage-in leads garbage-out (ETL), and aiming wrong target leads wrong predictions (problem definition).

If you know what you’d like to predict and how to clean up your data, then MLaaS would be the right solution for you.  However, for many prospective users of MLaaS (those who are inexperienced in data analytics), I guess that it’s not the case.

3. Machine-Learning-as-a-Feature, Infer.com

Another business model other than MLaaS and AaaS to provide users with the power of ML for their predictive modeling needs is Machine-Learning-as-a-Feature (MLaaF, Don’t google it.  I just made it up).

Infer, a startup founded in 2010, is on this track, and offers ML plugins for popular CRM softwares (Salesforce, Marketo and Eloqua) to predict sales leads.

By focusing on the specific need (sales lead prediction) and  specific users (those who use popular CRM softwares), Infer manages to provide the power of ML with the painless user experience and affordable price tag.

Closing Thought

To be fair, BigML’s user interface and visualization are quite impressive, and it is equipped with the Random Forest algorithm, which is one of most popular algorithms with good out-of-the-box performance.  For data scientists who do not have much ML experience, it will be worth to try out.

However, I believe that for most of users, it will work best with well defined MLaaF rather than MLaaS (Investors seem to agree with me based on the fact that Infer got 10M in funding compared to BigML got 1M from CrunchBase profiles).

Related Articles:

Data Science Career for Neuroscientists + Tips for Kaggle Competitions

main-qimg-a08366402fddbb5bd52ab151f51c7dfa

Recently Prof. Konrad Koerding at Northwestern University asked for an advice on his Facebook for one of his Ph.D student, who studies Computational Neuroscience but wants to pursue his career in Data Science.  It reminded me of the time I was looking for such opportunities, and shared my thoughts (now posted on the webpage of his lab here).  I decide to post it here too (with a few fixes) so that it can help others.

First, I’d like to say that Data Science is a relatively new field (like Computational Neuroscience), and you don’t need to feel bad to make the transition after your Ph.D.  When I was out to the job market, I didn’t have any analytic background at all either.

I started my industrial career at one of analytic consulting companies, Opera Solutions in San Diego, where one of Nicolas‘ friends, Jacob, runs the R&D team of the company.  Jacob did his Ph.D under the supervision of Prof. Michael Arbib at University of Southern California in Computational Neuroscience as well.  During the interview, I was tested to prove my thought process, basic knowledges in statistics and Machine Learning, and programming, which I’d practiced through out my Ph.D everyday.

So, if he has a good Machine Learning background with programming skills (I’m sure that he does, based on the fact he’s your student), he can be competent to pursue his career in Data Science.

Tools in Data Science

Back in the graduate school, I used mostly MATLAB with some SPSS and C.  In the Data Science field, Python and R are most popular languages, and SQL is a kind of necessary evil.

R is similar to MATLAB except that it’s free.  It is not a hardcore programming language and doesn’t take much time to learn.  It comes with the latest statistical libraries and provides powerful plotting functions.  There are many IDEs, which make easy to use R, but my favorite is R Studio.  If you run R on the server with R Studio Server, you can access it from anywhere via your web browser, which is really cool.  Although native R plotting functions are excellent by themselves, the ggplot2 library provides more eye-catching visualization.

For Python, Numpy + Scipy packages provides similar vector-matrix computation functionalities as MATLAB.  For Machine Learning algorithms, you need Scikit-Learn, and for data handling, Pandas will make your life easy.  For debugging and prototyping, iPython Notebook is really handy and useful.

SQL is an old technology but still widely used.  Most of data are stored in the data warehouse, which can be accessed only via SQL or SQL equivalents (Oracle, Teradata, Netezza, etc.).  Postgres and MySQL are powerful yet free, so it’s perfect to practice with.

Hints for Kaggle Data Mining Competitions

Fortunately, I had a chance to work with many of top competitors such as the 1st and 2nd place teams at Netflix competitions, and learn how they do at competitions.  Here are some tips I found helpful.

1. Don’t jump into algorithms too fast.

Spend enough time to understand data.  Algorithms are important, but no matter how good algorithm you use, garbage-in only leads to garbage-out.  Many classification/regression algorithms assume the Gaussian distributed variables, and fail to make good predictions if you provide non-Gaussian distributed variables.  So, standardization, normalization, non-linear transformation, discretization, binning are very important.

2. Try different algorithms and blend.  

There is no universal optimal algorithm.  Most of times (if not all), the winning algorithms are ensembles of many individual models with tens of different algorithms.  Combining different kinds of models can improve prediction performance a lot.  For individual models, I found Random Forest, Gradient Boosting Machine, Factorization Machine, Neural Network, Support Vector Machine, logistic/linear regression, Naive Bayes, and collaborative filtering are mostly useful.  Gradient Boosting Machine and Factorization Machine are often the best individual models.

3. Optimize at last.

Each competition has a different evaluation metric, and optimizing algorithms to do the best for that metric can improve your chance to win.  Two most popular metrics are RMSE and AUC (area under the ROC curve).  Algorithms optimizing one metric is not the optimal for the other. Many open source algorithm implementations provide only RMSE optimization, so for AUC (or other metric) optimization, you need to implement it by yourself.

Related Articles:

For Teenagers to Learn Programming – Where and How to Start

code monkey, code!

When I talk to teenagers or their parents, I always recommend teenagers to learn a programming language because:

  • It is a fun toy like LEGO that allows you to create whatever you like instead of playing with what others created.
  • It is a universal language like music, arts, and sports that allows you to communicate with and reach out to other people around the world. 
  • It is an effective tool that increases your productivity to infinity by automating and delegating your work to a computer (or a “cloud” of computers).
  • Since it is a language like other spoken human languages, the earlier you start, the easier and faster you can learn.

In return, I’ve been asked how and where to start.  Here are some good resources to start with:

1. Get Motivated

  • Leaders and trend-setters all agree on one thing – Leaders in business, government, arts and entertainment, and education share why they think programming is important.
  • Why software is eating the world – Wall Street Journal article by Marc Andreessen, one of top venture capitalists, about how and why the software industry has been taking over other industries such as healthcare, financial, telecom, arts and entertainment, etc.

2. Resources

3. Which language to learn

Here I list some languages that, I think, are relatively easy to learn, fast to get working results, and useful even after becoming professional (except Scratch).

  • For kids to get familiar with programming – Scratch, which is designed for ages 8 to 16 to create programs without using programming languages (check out the TED video below to see how it looks like).

[ted id=1657]

  • For high schoolers or beginners with interests in developing working programs – Python, which is an easy yet powerful language that can cover almost every verticals (from web pages to scientific programs).
  • For designers with interests in developing interactive art works – Processing, which offers beautiful and interactive visualization capabilities.
  • For those who are interested in Finance or business intelligence – R, which is originally developed as a statistical computing language, but widely used in predictive modeling, data exploration, and data visualization.

I hope everyone who comes across this article can give a try to learn programming, and discover how fun and useful it is. 🙂

Related Articles:

[Scrap] The PhD Attitude

아래는 장현군이 친구들 게시판에 올린 글.

마리오 가벨리가 밝히는 인재관을 들어보면 그가 어린 시절에 겪었던 가난의 경험이 오히려 인생을 진지하게 사는 데 큰 발판이 됐음 을 짐작할 수 있다.

“나는 Ph.D(박사학위) 출신들을 채용합니다. 내가 원하는 사람은 가난하고(Poor), 배고프고(Hungry), 성공에 대한 열망이 깊은(Driven) 사람들입니다.

원문을 찾아보니 아래와 같았다. 자, PhD attitude를 갖자. 단 Ph는 졸업 전까지만 -_-;

“Get the PHD Attitude”
Excerpted from “Success is a Choice” © 1997 by Rick Pitino

When I was helping Jamal Mashburn, one of my top players, find someone to manage his money a couple of years back, I was looking for someone who was conservative, who had alot of experience, and who had withstood tough times. My research took me to a man named Mario Gabelli, who handles millions of dollars in corporate investments and endowments.

“Mario gave me a tour of his company in Rye, New York; he must have had between seventy-five and one hundred employees in the back room all handling various accounts. I asked him what he looks for in hiring people, how he created his employee base. Was it a Wharton diploma? Harvard Business School? What was it?

“I hire PHDs,” he said

“I don’t understand,” I said. ” I would think in your business you would hire people with expertise in managing money, not PHDs.”

“Not in an academic sense,” he said. ” I’m looking for Poor, Hungry and Driven people.

I’ve never forgotten Mario’s phrase, and now I, too, look for people who are poor, hungry, and driven.

Now I’m not talking about people being poor economically. I’m talking about being poor in terms of knowledge, about people who are constantly searching to learn more, to find more wisdom. And hungry in this context refers to those with a tremendous desire to succeed, people who won’t ever be satisfied with an ordinary level of accomplishment. And driven people are the ones who set ambitious goals and then pursue them with real ferocity.”

WordPress – Spiffty 2.0 Theme Black

지금 사용하고 있는 Spiffty 2.0 테마를 검정색 계열로 수정해 보았다.

style.css 파일과 ./img 폴더 아래의 이미지 만을 수정하여 비교적 손쉽게 만들 수 있었다.

지금 보이는 Spiffty 2.0 Black 테마를 사용하려면:

  1. Spiffty 2.0 테마를 설치한 후,
  2. spiffty-20-black-patch.tar.gz를 다운로드 받아서 해당 스킨 폴더에 풀어주면 된다.

지금은 필요없겠지만, 차후 스킨 변경을 할 것을 대비하여 일단 스크린 샷도 올려본다.

screenshot.gif

New Plugins for WordPress

요즘 계속 TiStoryWordPress 사이에서 방황 중이다.

하루는 TiStory에 포스팅 했다가, 다음 날은 WordPress에 포스팅 했다가, 결국 두 곳 모두 업데이트 해보다가…

오늘은 WordPress 업데이트 하는 날이었나보다.

지금 시각 4:40am. 나름 시차적응이 안 된 것이라는 변명도…쿨럭;

아무튼 아래는 오늘 새로 설치/적용한 플러그인/테마 들이다.

자, 그럼 이만.

보너스로 Tag Cloud

{flashtagcloud}