인생은 벡터다. “얼마나 열심히 사는가” (magnitude)뿐 아니라 “어떻게 사는가” (direction)가 더 중요하다.
아침에 알람이 울린 시각보다 한참을 더 침대에서 뒤척이다 일어났다. 바쁜 마음에 서둘러 준비하고 일어난지 15분 만에 집을 나섰다.
차가운 아침 공기를 가르고 차가 있는 곳까지 단숨에 도착했는데.
웬걸. 키가 없다. OTL
난 서두르다 제대로 준비하지 못하고 시작해서 낭패를 보거나, 꼭 낭패까지는 아니어도 초초함에 일을 효율적으로 하지 못하는 때가 많다.
빨리 하는 것보다 제대로 하는 것이 중요하다. 제대로 하기 위해서는 시작하기 전에 신중히 계획하고 검토하는 것이 중요하다.
너희 중에 누가 망대를 세우고자 할찐대 자기의 가진 것이 준공하기까지에 족할는지 먼저 앉아 그 비용을 예산하지 아니하겠느냐 – 누가복음 14:28
서두르지 말자. 분주하지 말자.
This article is originally posted on Kaggler.com.
I’d like to open up my toolbox that I’ve built for data mining competitions, and share with you.
Let me start with my setup.
I have access to 2 machines:
- Laptop – Macbook Pro Retina 15″, OS X Yosemite, i7 2.3GHz 4 Core CPU, 16GB RAM, GeForce GT 750M 2GB, 500GB SSD
- Desktop – Ubuntu 14.04, i7 5820K 3.3GHz 6 Core CPU, 64GB RAM, GeForce GT 620 1GB, 120GB SSD + 3TB HDD
I purchased the desktop from eBay around at $2,000 a year ago (September 2014).
As the code repository and version control system, I use git.
It’s useful for collaboration with other team members. It makes easy to share the code base, keep track of changes and resolve conflicts when two people change the same code.
It’s useful even when I work by myself too. It helps me reuse and improve the code from previous competitions I participated in before.
S3 / Dropbox
I use S3 to share files between my machines. It is cheap – it costs me about $0.1 per month on average.
I use Dropbox to share files between team members.
For flow control or pipelining, I use makefiles (or GNU
It modularizes the long process of a data mining competition into feature extraction, single model training, and ensemble model training, and controls workflow between components.
For example, I have a top level makefile that defines the raw data file locations, folder hierarchies, and target variable.
[code title=”Makefile” lang=”bash”]
DIR_DATA := data
DIR_BUILD := build
DIR_FEATURE := $(DIR_BUILD)/feature
DIR_VAL := $(DIR_BUILD)/val
DIR_TST := $(DIR_BUILD)/tst
DATA_TRN := $(DIR_DATA)/train.csv
DATA_TST := $(DIR_DATA)/test.csv
Y_TRN := $(DIR_DATA)/y.trn.yht
cut -d, -f2 $< | tail -n +2 > $@
Then, I have makefiles for features that includes the top level makefile, and defines how to generate training and test feature files in various formats (CSV, libSVM, VW, libFFM, etc.).
[code title=”Makefile.feature.feature3″ lang=”bash”]
FEATURE_NAME := feature3
FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.sps
FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.sps
FEATURE_TRN_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).trn.ffm
FEATURE_TST_FFM := $(DIR_FEATURE)/$(FEATURE_NAME).tst.ffm
$(FEATURE_TRN) $(FEATURE_TST): $(DATA_TRN) $(DATA_TST) | $(DIR_FEATURE)
src/generate_feature3.py –train-file $< \
–test-file $(lastword $^) \
–train-feature-file $(FEATURE_TRN) \
src/svm_to_ffm.py –svm-file $< \
–ffm-file $@ \
Then, I have makefiles for single model training that includes a feature makefile, and defines how to train a single model and produce CV and test predictions.
[code title=”Makefile.xg” lang=”bash”]
N = 400
DEPTH = 8
LRATE = 0.05
ALGO_NAME := xg_$(N)_$(DEPTH)_$(LRATE)
MODEL_NAME := $(ALGO_NAME)_$(FEATURE_NAME)
PREDICT_VAL := $(DIR_VAL)/$(MODEL_NAME).val.yht
PREDICT_TST := $(DIR_TST)/$(MODEL_NAME).tst.yht
SUBMISSION_TST := $(DIR_TST)/$(MODEL_NAME).sub.csv
all: validation submission
retrain: clean_$(ALGO_NAME) submission
$(PREDICT_TST) $(PREDICT_VAL): $(FEATURE_TRN) $(FEATURE_TST) \
| $(DIR_VAL) $(DIR_TST)
./src/train_predict_xg.py –train-file $< \
–test-file $(word 2, $^) \
–predict-valid-file $(PREDICT_VAL) \
–predict-test-file $(PREDICT_TST) \
–depth $(DEPTH) \
–lrate $(LRATE) \
$(SUBMISSION_TST): $(PREDICT_TST) $(ID_TST) | $(DIR_TST)
paste -d, $(lastword $^) $< > $@
Then, I have makefiles for ensemble features that defines which single model predictions to be included for ensemble training.
[code titile=”Makefile.feature.esb9″ lang=”bash”]
FEATURE_NAME := esb9
BASE_MODELS := xg_600_4_0.05_feature9 \
PREDICTS_TRN := $(foreach m, $(BASE_MODELS), $(DIR_VAL)/$(m).val.yht)
PREDICTS_TST := $(foreach m, $(BASE_MODELS), $(DIR_TST)/$(m).tst.yht)
FEATURE_TRN := $(DIR_FEATURE)/$(FEATURE_NAME).trn.csv
FEATURE_TST := $(DIR_FEATURE)/$(FEATURE_NAME).tst.csv
$(FEATURE_TRN): $(Y_TRN) $(PREDICTS_TRN) | $(DIR_FEATURE)
paste -d, $^ > $@
$(FEATURE_TST): $(Y_TST) $(PREDICTS_TST) | $(DIR_FEATURE)
paste -d, $^ > $@
Finally, I can (re)produce the submission from XGBoost ensemble with 9 single models described in
Makefile.feature.esb9 by (1) replacing
include Makefile.feature.feature3 in
include Makefile.feature.esb9 and (2) running:
$ make -f Makefile.xg
When I’m connected to Internet, I always ssh to the desktop for its computational resources (mainly for RAM).
I followed Julian Simioni’s tutorial to allow remote SSH connection to the desktop. It needs an additional system with a publicly accessible IP address. You can setup an AWS micro (or free tier) EC2 instance for it.
tmux allows you to keep your SSH sessions even when you get disconnected. It also let you split/add terminal screens in various ways and switch easily between those.
Documentation might look overwhelming, but all you need are:
# If there is no tmux session:
# If you created a tmux session, and want to connect to it:
$ tmux attach
Then to create a new pane/window and navigate in between:
Ctrl + b + "– to split the current window horizontally.
Ctrl + b + %– to split the current window vertically.
Ctrl + b + o– to move to next pane in the current window.
Ctrl + b + c– to create a new window.
Ctrl + b + n– to move to next window.
To close a pane/window, just type exit in the pane/window.
Hope this helps.
Next up is about machine learning tools I use.
Please share your setups and thoughts too. 🙂
UPDATE on 9/15/2015
I found a bug in OneHotEncoder, and fixed it. The fix is not available on pip yet, but you can update Kaggler to latest version from the source as follows:
$ git clone https://github.com/jeongyoonlee/Kaggler.git
$ cd Kaggler
$ python setup.py build_ext --inplace
$ sudo python setup.py install
If you find a bug, please submit a pull request to github or comment here.
I’m glad to announce the release of Kaggler 0.4.0.
Kaggler is a Python package that provides utility functions and online learning algorithms for classification. I use it for Kaggle competitions along with scikit-learn, Lasagne, XGBoost, and Vowpal Wabbit.
kaggler.preprocessing now support
transform methods. Currently 2 preprocessing classes are available as follows:
Normalizer– aligns distributions of numerical features into a normal distribution. Note that it’s different from
sklearn.preprocessing.Normalizer, which only scales features without changing distributions.
OneHotEncoder– transforms categorical features into dummy variables. It is similar to
sklearn.preprocessing.OneHotEncoderexcept that it groups infrequent values into a dummy variable.
from kaggler.preprocessing import OneHotEncoder
# values appearing less than min_obs are grouped into one dummy variable.
enc = OneHotEncoder(min_obs=10, nan_as_var=False)
X_train = enc.fit_transform(train)
X_test = enc.transform(test)
3 metrics are available as follows:
logloss– calculates the bounded log loss error for classification predictions.
rmse– calculates the root mean squared error for regression predictions.
gini– calculates the gini coefficient for regression predictions.
from kaggler.metrics import gini
score = gini(y, p)
ClassificationTree) now support
predict methods. Currently 5 online learning algorithms are available as follows:
SGD– stochastic gradient descent algorithm with hashing trick and interaction
FTRL– follow-the-regularized-leader algorithm with hashing trick and interaction
FM– factorization machine algorithm
NN_H2) – neural network algorithm with a single (or double) hidden layer(s)
ClassificationTree– decision tree algorithm
from kaggler.online_model import FTRL
from kaggler.data_io import load_data
# load a libsvm format sparse feature file
X, y = load_data(‘train.sparse’, dense=False)
clf = FTRL(a=.1, # alpha in the per-coordinate rate
b=1, # beta in the per-coordinate rate
l1=1., # L1 regularization parameter
l2=1., # L2 regularization parameter
n=2**20, # number of hashed features
epoch=1, # number of epochs
interaction=True) # use feature interaction or not
# training and prediction
p = clf.predict(X)
Please let me know if you have any comments or want to contribute. 🙂
Many things have happened since the last post in February.
1. Kaggle and other competitions
- KDD Cup 2015 – My team InterContinental Ensemble (Andreas, Michael, Tam, Mert, Kohei, Song, Xiaocong, Peng, and me) finished first out of 821 teams!
- Otto Group Product Classification Challenge at Kaggle – Michael, Abhishek and I finished 3rd out of 3,514 teams.
- Liberty Mutual Group: Property Inspection Prediction at Kaggle – Michael and I finished 5th out of 2,236 teams while we were 1st on the public leaderboard.
- Countable Care: Modeling Women’s Health Care Decisions at Driven Data – Abhishek and I finished 3rd out of 491 teams.
2. Kaggler package
- Kaggler 0.3.8 was released.
- Fellow Kaggler, Jiming Ye added an online tree learner to the package.
I will post about each update soon. Stay tuned! 🙂
1994년부터 2010년까지 127개의 글.
몇 해 걸쳐 한 번 쓴 적도 있었고, 며칠 연속 부지런히 쓴 적도 있었다. 오랜 세월에 비하면 얼마되지 않지만, 이렇게 모아진 글들을 보니 뭔가 대단한 보물을 찾은 듯 기분이 좋다.
나는 글을 쓰면 왠지 내 가난한 밑천이 다 드러나는 것 같아 무척 부담스러워 한다. 하지만, 요즘은 지금의 가난한 밑천을 훗날의 내가 되돌아보고 싶고, 찾아보고 싶어하지 않을까 하는 생각이 든다.
앞으로도 얼마나 될지 모르지만, 한 번씩 이 곳에 찾아와 내 삶의 흔적, 생각의 흔적을 남겨두고자 한다. 가난하면 가난한데로, 풍성하면 풍성한데로.
God, grant me the serenity
to accept the things I cannot change;
courage to change the things I can;
and wisdom to know the difference.
Living one day at a time;
Enjoying one moment at a time;
Accepting hardships as the pathway to peace;
Taking, as He did, this sinful world
as it is, not as I would have it;
Trusting that He will make all things right
if I surrender to His Will;
That I may be reasonably happy in this life
and supremely happy with Him
Forever in the next.
항상 주도적이지는 않더라도
보다 열심히 일하고
더욱 일을 즐기며
그들이 가능하다고 여겼던 것 이상 것을 이루도록 도움으로써
자신에 대해 더 많은 존경심과 자신감을 가지도록 하는 것
GE의 차기 CEO 후보로 경합을 벌이던 1979년
전 회장 레그 존스에게 보낸 편지 中
Recently God leads me to the places to meet awesome mentors and many mentees in Christ. Also, He gives me ideas of how to mentor mentees. I think God wants to use me either to be a good mentor to mentees or to help them to meet good mentors. I want to share some of those here.
First, I want to talk about what success is, which I learned from Elder Brian Chun’s sermon at KPM.
Why do we mentor mentees? Why do mentees get mentoring? I would say it’s for leading mentees to succeed in their lives. Then, what is success? According to the definition of success, the way how we mentor should be different. As a Christian, our success should not be the same as success in the world: wealth, fame, prosperity.
Elder Chun defines the success of Christians as being faithful to God’s calling. It’s rather a process than a result. It’s rather about the giver than about the receiver of talents.
If we read the Bible, about some generations, there is only a few description or even no description at all. There should be some successful people in those generation as well in terms of the success of the world. However, God doesn’t count it. Those were not important to God.
We all, as Christians, want to be called as “good and faithful servants” at the end of our lives in front of God. Then we need to check our perception of success first.
I think that’s the very first thing to remind mentees of.
Today, I had a phone interview for one of software engineering companies. I really like one of my answers to interview questions and want to introduce it here.
Question: How can you explain differences between the recursive and iterative methods to solve a problem to a 4 year old kid?
My answer was:
We have a russian doll, inside which a lot of (more than thousands!!) other dolls one-by-one and each of them has a unique name. Only one of those dolls is a boy and all the others are girls. We want to find out the name of the boy doll. How are you going to do this? 🙂Well, I think we can do that in two different ways.First, we can open up one doll at a time and until we see the boy. Then we can ask his name! Right? But then, we might need to ask the same question all day long and will get tired soon. 🙁Instead, I think we can also find his name as follows. We don’t open up all the dolls and ask to each doll, but we ask only to the first doll if she has the boy doll right inside her. If she says yes, then ask her to ask his name to him and let us know. If she says, no, then ask her to ask the exactly same questions to the doll inside her, i.e.:
- if she has the boy doll inside.
- if not, ask the same questions to the doll inside.
- if yes, then ask his name and let me know.Right?We call the first way the iterative method, and the second way the recursive method.