인생은 벡터

인생은 벡터다.  “얼마나 열심히 사는가” (magnitude)뿐 아니라 “어떻게 사는가” (direction)가 더 중요하다.

I go by the name of Vector. It's a mathematical term, represented by an arrow with both direction and magnitude. Vector! That's me, because I commit crimes with both direction and magnitude. Oh yeah!
I go by the name of Vector. It’s a mathematical term, represented by an arrow with both direction and magnitude. Vector! That’s me, because I commit crimes with both direction and magnitude. Oh yeah!

서둘러 시작하기

아침에 알람이 울린 시각보다 한참을 더 침대에서 뒤척이다 일어났다. 바쁜 마음에 서둘러 준비하고 일어난지 15분 만에 집을 나섰다.

남여 준비하는데 걸리는 시간
남여 준비하는데 걸리는 시간

차가운 아침 공기를 가르고 차가 있는 곳까지 단숨에 도착했는데.

웬걸.  키가 없다.  OTL

난 서두르다 제대로 준비하지 못하고 시작해서 낭패를 보거나, 꼭 낭패까지는 아니어도 초초함에 일을 효율적으로 하지 못하는 때가 많다.

빨리 하는 것보다 제대로 하는 것이 중요하다.  제대로 하기 위해서는 시작하기 전에 신중히 계획하고 검토하는 것이 중요하다.

너희 중에 누가 망대를 세우고자 할찐대 자기의 가진 것이 준공하기까지에 족할는지 먼저 앉아 그 비용을 예산하지 아니하겠느냐 – 누가복음 14:28

서두르지 말자.  분주하지 말자.

 

Kaggler’s Toolbox – Setup (from Kaggler.com)

This article is originally posted on Kaggler.com.


I’d like to open up my toolbox that I’ve built for data mining competitions, and share with you.

Let me start with my setup.

System

I have access to 2 machines:

  • Laptop – Macbook Pro Retina 15″, OS X Yosemite, i7 2.3GHz 4 Core CPU, 16GB RAM, GeForce GT 750M 2GB, 500GB SSD
  • Desktop – Ubuntu 14.04, i7 5820K 3.3GHz 6 Core CPU, 64GB RAM, GeForce GT 620 1GB, 120GB SSD + 3TB HDD

I purchased the desktop from eBay around at $2,000 a year ago (September 2014).

Git

As the code repository and version control system, I use git.

It’s useful for collaboration with other team members.  It makes easy to share the code base, keep track of changes and resolve conflicts when two people change the same code.

It’s useful even when I work by myself too.  It helps me reuse and improve the code from previous competitions I participated in before.

For competitions, I use gitlab instead of github because it offers unlimited number of private repositories.

S3 / Dropbox

I use S3 to share files between my machines.  It is cheap – it costs me about $0.1 per month on average.

To access S3, I use AWS CLI.  I also used to use s3cmd and like it.

I use Dropbox to share files between team members.

Makefile

For flow control or pipelining, I use makefiles (or GNU make).

It modularizes the long process of a data mining competition into feature extraction, single model training, and ensemble model training, and controls workflow between components.

For example, I have a top level makefile that defines the raw data file locations, folder hierarchies, and target variable.

[code title=”Makefile” lang=”bash”]
# directories
DIR_DATA := data
DIR_BUILD := build
DIR_FEATURE := $(DIR_BUILD)/feature
DIR_VAL := $(DIR_BUILD)/val
DIR_TST := $(DIR_BUILD)/tst

DATA_TRN := $(DIR_DATA)/train.csv
DATA_TST := $(DIR_DATA)/test.csv

Y_TRN := $(DIR_DATA)/y.trn.yht

$(Y_TRN): $(DATA_TRN)
cut -d, -f2 $< | tail -n +2 > $@
[/code]

Then, I have makefiles for features that includes the top level makefile, and defines how to generate training and test feature files in various formats (CSV, libSVM, VW, libFFM, etc.).

[code title=”Makefile.feature.feature3″ lang=”bash”]
include Makefile

FEATURE_NAME := feature3

FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.sps
FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.sps

FEATURE_TRN_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).trn.ffm
FEATURE_TST_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).tst.ffm

$(FEATURE_TRN) $(FEATURE_TST): $(DATA_TRN) $(DATA_TST) | $(DIR_FEATURE)
src/generate_feature3.py –train-file $< \
–test-file $(lastword $^) \
–train-feature-file $(FEATURE_TRN) \
–test-feature-file $(FEATURE_TST)
%.ffm: %.sps
src/svm_to_ffm.py –svm-file $< \
–ffm-file $@ \
–feature-name $(FEATURE_NAME)

[/code]

Then, I have makefiles for single model training that includes a feature makefile, and defines how to train a single model and produce CV and test predictions.

[code title=”Makefile.xg” lang=”bash”]
include Makefile.feature.feature3

N = 400
DEPTH = 8
LRATE = 0.05
ALGO_NAME := xg_$(N)_$(DEPTH)_$(LRATE)
MODEL_NAME := $(ALGO_NAME)_$(FEATURE_NAME)

PREDICT_VAL := $(DIR_VAL)/$(MODEL_NAME).val.yht
PREDICT_TST := $(DIR_TST)/$(MODEL_NAME).tst.yht
SUBMISSION_TST := $(DIR_TST)/$(MODEL_NAME).sub.csv

all: validation submission
validation: $(METRIC_VAL)
submission: $(SUBMISSION_TST)
retrain: clean_$(ALGO_NAME) submission

$(PREDICT_TST) $(PREDICT_VAL): $(FEATURE_TRN) $(FEATURE_TST) \
| $(DIR_VAL) $(DIR_TST)
./src/train_predict_xg.py –train-file $< \
–test-file $(word 2, $^) \
–predict-valid-file $(PREDICT_VAL) \
–predict-test-file $(PREDICT_TST) \
–depth $(DEPTH) \
–lrate $(LRATE) \
–n-est $(N)

$(SUBMISSION_TST): $(PREDICT_TST) $(ID_TST) | $(DIR_TST)
paste -d, $(lastword $^) $< > $@

[/code]

Then, I have makefiles for ensemble features that defines which single model predictions to be included for ensemble training.

[code titile=”Makefile.feature.esb9″ lang=”bash”]
include Makefile

FEATURE_NAME := esb9

BASE_MODELS := xg_600_4_0.05_feature9 \
xg_400_4_0.05_feature6 \
ffm_30_20_0.01_feature3 \

PREDICTS_TRN := $(foreach m, $(BASE_MODELS), $(DIR_VAL)/$(m).val.yht)
PREDICTS_TST := $(foreach m, $(BASE_MODELS), $(DIR_TST)/$(m).tst.yht)

FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.csv
FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.csv

$(FEATURE_TRN): $(Y_TRN) $(PREDICTS_TRN) | $(DIR_FEATURE)
paste -d, $^ > $@

$(FEATURE_TST): $(Y_TST) $(PREDICTS_TST) | $(DIR_FEATURE)
paste -d, $^ > $@
[/code]

Finally, I can (re)produce the submission from XGBoost ensemble with 9 single models described in Makefile.feature.esb9 by (1) replacing include Makefile.feature.feature3 in Makefile.xg with include Makefile.feature.esb9 and (2) running:

$ make -f Makefile.xg

SSH Tunneling

When I’m connected to Internet, I always ssh to the desktop for its computational resources (mainly for RAM).

I followed Julian Simioni’s tutorial to allow remote SSH connection to the desktop.  It needs an additional system with a publicly accessible IP address.  You can setup an AWS micro (or free tier) EC2 instance for it.

tmux

tmux allows you to keep your SSH sessions even when you get disconnected.  It also let you split/add terminal screens in various ways and switch easily between those.

Documentation might look overwhelming, but all you need are:
# If there is no tmux session:
$ tmux

or

# If you created a tmux session, and want to connect to it:
$ tmux attach

Then to create a new pane/window and navigate in between:

  • Ctrl + b + " – to split the current window horizontally.
  • Ctrl + b + % – to split the current window vertically.
  • Ctrl + b + o – to move to next pane in the current window.
  • Ctrl + b + c – to create a new window.
  • Ctrl + b + n – to move to next window.

To close a pane/window, just type exit in the pane/window.

 

Hope this helps.

Next up is about machine learning tools I use.

Please share your setups and thoughts too. 🙂

Kaggler 0.4.0 Released (from Kaggler.com)

This article is originally posted at Kaggler.com.

UPDATE on 9/15/2015

I found a bug in OneHotEncoder, and fixed it.  The fix is not available on pip yet, but you can update Kaggler to latest version from the source as follows:

$ git clone https://github.com/jeongyoonlee/Kaggler.git
$ cd Kaggler
$ python setup.py build_ext --inplace
$ sudo python setup.py install

If you find a bug, please submit a pull request to github or comment here.


I’m glad to announce the release of Kaggler 0.4.0.

Kaggler is a Python package that provides utility functions and online learning algorithms for classification.  I use it for Kaggle competitions along with scikit-learn, LasagneXGBoost, and Vowpal Wabbit.

Kaggler 0.4.0 added the scikit-learn like interface for preprocessing, metrics, and online learning algorithms.

kaggler.preprocessing

Classes in kaggler.preprocessing now support fit, fit_transform, and transform methods. Currently 2 preprocessing classes are available as follows:

  • Normalizer – aligns distributions of numerical features into a normal distribution. Note that it’s different from sklearn.preprocessing.Normalizer, which only scales features without changing distributions.
  • OneHotEncoder – transforms categorical features into dummy variables.  It is similar to sklearn.preprocessing.OneHotEncoder except that it groups infrequent values into a dummy variable.

[code language=”python”]
from kaggler.preprocessing import OneHotEncoder

# values appearing less than min_obs are grouped into one dummy variable.
enc = OneHotEncoder(min_obs=10, nan_as_var=False)
X_train = enc.fit_transform(train)
X_test = enc.transform(test)
[/code]

kaggler.metrics

3 metrics are available as follows:

  • logloss – calculates the bounded log loss error for classification predictions.
  • rmse – calculates the root mean squared error for regression predictions.
  • gini – calculates the gini coefficient for regression predictions.

[code language=”python”]
from kaggler.metrics import gini

score = gini(y, p)
[/code]

kaggler.online_model

Classes in kaggler.online_model (except ClassificationTree) now support fit, and predict methods. Currently 5 online learning algorithms are available as follows:

  • SGD – stochastic gradient descent algorithm with hashing trick and interaction
  • FTRL – follow-the-regularized-leader algorithm with hashing trick and interaction
  • FM – factorization machine algorithm
  • NN (or NN_H2) – neural network algorithm with a single (or double) hidden layer(s)
  • ClassificationTree – decision tree algorithm

[code language=”python”]
from kaggler.online_model import FTRL
from kaggler.data_io import load_data

# load a libsvm format sparse feature file
X, y = load_data(‘train.sparse’, dense=False)

# FTRL
clf = FTRL(a=.1, # alpha in the per-coordinate rate
b=1, # beta in the per-coordinate rate
l1=1., # L1 regularization parameter
l2=1., # L2 regularization parameter
n=2**20, # number of hashed features
epoch=1, # number of epochs
interaction=True) # use feature interaction or not

# training and prediction
clf.fit(X, y)
p = clf.predict(X)
[/code]

Latest code is available at github.
Package documentation is available at https://pythonhosted.org/Kaggler/.

Please let me know if you have any comments or want to contribute. 🙂

Catching Up (from Kaggler.com)

This article is originally posted at Kaggler.com.

Many things have happened since the last post in February.

1. Kaggle and other competitions

2. Kaggler package

  • Kaggler 0.3.8 was released.
  • Fellow Kaggler, Jiming Ye added an online tree learner to the package.

I will post about each update soon.  Stay tuned! 🙂

Restarting the Blog

얼마전 그동안 사두었던 도메인과 웹서비스들을 가장 오래된 ethiel.orgyoungnjeong.com만 남겨두고 모두 정리하였다. 그러면서 몇 안되지만 여기 저기 흩어져있던 예전 글들을 이 곳에 모아두었다.

1994년부터 2010년까지 127개의 글.

몇 해 걸쳐 한 번 쓴 적도 있었고, 며칠 연속 부지런히 쓴 적도 있었다. 오랜 세월에 비하면 얼마되지 않지만, 이렇게 모아진 글들을 보니 뭔가 대단한 보물을 찾은 듯 기분이 좋다.

나는 글을 쓰면 왠지 내 가난한 밑천이 다 드러나는 것 같아 무척 부담스러워 한다. 하지만, 요즘은 지금의 가난한 밑천을 훗날의 내가 되돌아보고 싶고, 찾아보고 싶어하지 않을까 하는 생각이 든다.

앞으로도 얼마나 될지 모르지만, 한 번씩 이 곳에 찾아와 내 삶의 흔적, 생각의 흔적을 남겨두고자 한다. 가난하면 가난한데로, 풍성하면 풍성한데로.

굿나잇.

Prayer of Serenity by Reinhold Niebuhr

God, grant me the serenity
to accept the things I cannot change;
courage to change the things I can;
and wisdom to know the difference.

Living one day at a time;
Enjoying one moment at a time;
Accepting hardships as the pathway to peace;

Taking, as He did, this sinful world
as it is, not as I would have it;
Trusting that He will make all things right
if I surrender to His Will;

That I may be reasonably happy in this life
and supremely happy with Him
Forever in the next.
Amen.

–Reinhold Niebuhr

Amen

What is leadership – Jack Welch in 1979

Leadership이란

주위의 사람들로하여금

항상 주도적이지는 않더라도
보다 열심히 일하고
더욱 일을 즐기며

마침내는
그들이 가능하다고 여겼던 것 이상 것을 이루도록 도움으로써
자신에 대해 더 많은 존경심과 자신감을 가지도록 하는 것

입니다.

GE의 차기 CEO 후보로 경합을 벌이던 1979년
전 회장 레그 존스에게 보낸 편지 中

Mentoring diary – What Is Success

Recently God leads me to the places to meet awesome mentors and many mentees in Christ. Also, He gives me ideas of how to mentor mentees. I think God wants to use me either to be a good mentor to mentees or to help them to meet good mentors. I want to share some of those here.

First, I want to talk about what success is, which I learned from Elder Brian Chun’s sermon at KPM.

Why do we mentor mentees? Why do mentees get mentoring? I would say it’s for leading mentees to succeed in their lives. Then, what is success? According to the definition of success, the way how we mentor should be different. As a Christian, our success should not be the same as success in the world: wealth, fame, prosperity.

Elder Chun defines the success of Christians as being faithful to God’s calling. It’s rather a process than a result. It’s rather about the giver than about the receiver of talents.

If we read the Bible, about some generations, there is only a few description or even no description at all. There should be some successful people in those generation as well in terms of the success of the world. However, God doesn’t count it. Those were not important to God.

We all, as Christians, want to be called as “good and faithful servants” at the end of our lives in front of God. Then we need to check our perception of success first.

I think that’s the very first thing to remind mentees of.

Recursive vs. Iterative Method for 4 Year Old Kid

Today, I had a phone interview for one of software engineering companies. I really like one of my answers to interview questions and want to introduce it here.

Question: How can you explain differences between the recursive and iterative methods to solve a problem to a 4 year old kid?

My answer was:

We have a russian doll, inside which a lot of (more than thousands!!) other dolls one-by-one and each of them has a unique name. Only one of those dolls is a boy and all the others are girls. We want to find out the name of the boy doll. How are you going to do this? 🙂

Well, I think we can do that in two different ways.
First, we can open up one doll at a time and until we see the boy. Then we can ask his name! Right? But then, we might need to ask the same question all day long and will get tired soon. 🙁
Instead, I think we can also find his name as follows. We don’t open up all the dolls and ask to each doll, but we ask only to the first doll if she has the boy doll right inside her. If she says yes, then ask her to ask his name to him and let us know. If she says, no, then ask her to ask the exactly same questions to the doll inside her, i.e.:
  1. if she has the boy doll inside.
  2. if not, ask the same questions to the doll inside.
  3. if yes, then ask his name and let me know.
Right?
We call the first way the iterative method, and the second way the recursive method.
Do you think a 4 year old kid can get the idea?
Or am I too demanding? 🙂