Kan Li 2023-11-10T19:19:01+00:00 kan.li@uth.tmc.edu Just too lazy!! (nothing in it) 2022-10-28T00:00:00+00:00 Kan Li http://li-kan.com/2022/10/Update Rcpp - Using C++ and R code Together 2016-11-07T00:00:00+00:00 Kan Li http://li-kan.com/2016/11/Rcpp-tutorial This tutorial was presented as part of PH 1930 Statistical Computing. The examples for this session are available on Github at the link.

Overview
Configure the environment
Rcpp and RcppArmadillo
Getting started
First example
Parallelization
MCMC for GLMM

While it is possible to write C or Fortran code for use in R as covered previously in this course, Rcpp provides a clean, approachable API that lets we write high-performance code much easily and joyfully.

##Overview

In general, R is not a fast or most memory efficient language – but it is very easy to use, to share, and makes very pretty output. C++ is very fast, but not so easy to use, and debugging could be painful. Rcpp supports implementing R functions in C++ for high performance computing and to easily interface R with external libraries.

##Configure the environment for using C++ code with R

I only test the tool on Windows and Linux. No matter which platform you prefer, RStudio is highly recommended to do the work. For a Windows user, you will need to make sure you have the latest release of R (3.2.0+) and will also need to install the Rtools library before you can use any packages with C++ code in them. Usually RStudio will help to check whether the Rtools are installed and install it for you when you first use Rcpp. If you are Linux user, everything probably just works. You may also want to visit this blog post which has more information on making C++ work with R under Windows/Mac if you end up encountering any weirdness.

##Rcpp and RcppArmadillo

In this session we’ll learn how to improve our R code performance by rewriting key functions in C++. Typical bottlenecks that C++ can address include:

  • Loops that can’t be easily vectorized because subsequent iterations depend on previous ones.
  • Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than that in R.
  • Problems that require advanced data structures and algorithms that R doesn’t provide. Through the standard template library (STL), C++ has efficient implementations of many important data structures, from ordered maps to double-ended queues.

To use Rcpp, a working knowledge of C++ is helpful, but not essential. Many good tutorials and references are freely available, including http://www.learncpp.com/ and http://www.cplusplus.com/. You may also enjoy Dirk Eddelbuettel’s Seamless R and C++ integration with Rcpp, which goes into much detail into all aspects of Rcpp.

RcppArmadillo is an add-on package that gives you access to tons of useful linear algebra functionality in C++. In particular it makes passing in and working with arrays, matrices and vectors pretty easy. You can check out the Armadillo docs here: http://arma.sourceforge.net/docs.html. This is where I go to look up functions to see how to use them or if they exist.

Although the base Rcpp package provides its own data structures (e.g., array, matrix, list) that can be passed easily between R and C++, the Armadillo data structures provided by the RcppArmadillo package are really nice and easier to use. Additionally, the Armadillo data structures are native C++ data structures, have multi-thread support and less unpredictable bad errors are encountered in practice. So I suggest using Armadillo data structures in most cases.

##Getting started

Before we start the first example, we install the two R package using


install.packages("Rcpp")
install.packages("RcppArmadillo")

in Rstudio command line.

We start by creating a new file in RStudio by clicking on the icon in the top left corner of the screen and select C++ File from the drop down menu.

A sample timesTwo* C++ program will pop up. We will now need to save that file somewhere. Once we have done so, we can click on the Source button and you will see that RStudio automatically calls Rcpp::sourceCpp() on the file.

This will compile the function and make it so we can access the C++ code from R - much easier than Calling C from R we taught before. The example can help you to check everything needed are correctly installed on your computer.

Some other things to note:

  • This statement include Rcpp header file in the program. The function is similar as library( ) in R. It also tells us that we can call any Rcpp constructs by their given name without the Rcpp:: prefix.

#include <Rcpp.h>
using namespace Rcpp;
  • We also put this statement before each function we want to make available to R:

//[[Rcpp::export]]

We can also define multiple C++ functions in the same file (not necessarily recommended unless some of them will be used by the main function), so we can put one in front of each one we want to make visible.

  • We can include R code blocks in C++ files processed with sourceCpp . The R code will be automatically run after the compilation.

/*** R
timesTwo(42)
*/

##First example We introduce our first example of calculating inner product of two vectors, and use data structure of Armadillo. The C++ code is:


#include <RcppArmadillo.h>
// [[Rcpp::depends( RcppArmadillo)]]
using namespace arma;

// [[Rcpp::export]]
arma::mat inner(arma::vec x, arma::vec y){
  arma::mat ip=x.t()*y;
  return(ip);
}

/*** R
vecA = rnorm(100000)
*/

Note that we no longer use the:

#include <Rcpp.h>

statement as in the example code provided by RStudio, but replace it with:


#include <RcppArmadillo.h>
//[[Rcpp::depends(RcppArmadillo)]]

Also note that for all of the objects, we have to specify their type as we define the argument. This is a feature of C++ that is different from R where we just create objects without having to specify their type. Some basic C++ Armadillo objects we can pass in:

  • For decimal numbers like 1.2347 we need to use the double declaration, followed by the name of the argument (e.g.. my_double)
  • For integers (whole numbers) like 26 we use the int declaration, followed by the argument.
  • For numeric vectors, we use the arma::vec declaration, followed by the argument. This code should crash if you try to pass in anything other than a numeric vector (can contain doubles or integers). When you refer to the i th elements of a vector such as x , we use x(i) instead of x[i] as in R.
  • For numeric matrices, we use the arma::mat declaration, followed by the argument. Again, make sure it is just numbers in there.

We then write following R code and run it.


> library(Rcpp)
> library(RcppArmadillo)

> sourceCpp('test.cpp')
> vecA = rnorm(100000)

> inner(vecA,vecA)
         [,1]
[1,] 100317.2

##Parallelization in Rcpp via OpenMP

Now we make a small extension of the example and simply introduce using OpenMP to parallelize our C++ function. In order for the compiler to recognize OpenMP, we need to include the OpenMP header flag. Additionally, Rcpp has a plugin that will automatically add the OpenMP compiler flags. Add the following to the top of your C++ program.


#include <omp.h>
// [[Rcpp::plugins(openmp)]]

We also have to set some compiler flags in R code. Add the following to your R code.


Sys.setenv("PKG_CXXFLAGS" = "-fopenmp")
Sys.setenv("PKG_LIBS" = "-fopenmp")

The example we are going to use is element-wised sum square of a vector. It is a special case of the inner product of two same vectors. In serial way, we can realize this in Rcpp like so:


// [[Rcpp::export]]
double sumsq_serial(vec x)
{
  double sum = 0.0;
  for (int i=0; i<x.size(); i++){
    sum += sq(x(i));
  }  
  return sum;
}

The function sq() is simply squaring a number as


double sq(double x){
  return(x*x);
}

I do not put // [[Rcpp::export]] before this function because I will never call it from R. Calling such a function will slow down the performance of the code, comparing with sum += x(i)*x(i); . I include a function here to ascertain that each thread can correctly recognizes the user defined function when doing parallel computing later.

And in parallel using OpenMP, we just add the basic compiler pragma:


// [[Rcpp::export]]
double sumsq_parallel(vec x, int ncores)
{
  double sum = 0.0;
  omp_set_num_threads(ncores);
  #pragma omp parallel for shared(x) reduction(+:sum)
  for (int i=0; i<x.size(); i++){
	// cout << i << ", "omp_get_thread_num() <<  " of " <<  omp_get_num_threads() << endl;
    sum += sq(x(i));
  }
  return sum;
}

Basically, we identify the loop in which we want to parallelize and add the following lines above it


omp_set_num_threads(int)
#pragma omp parallel for

The first line selects the number of cores to use. This should not exceed the number of cores in your system. If we do not write this line, all the cores available on your computer will be used. The only real trick here in this example is making use of the shared() and reduction() keyword, and it pretty much does what it looks like. We’re telling the compiler that while each iteration of the loop should be run in parallel, they share the same information from vector x , and in the end we want the private (by thread) copies of the variable sum to be added up. The code cout << i << ", "omp_get_thread_num() << " of " << omp_get_num_threads() << endl; is C++ standard way to do print() . This code just tells which thread is currently working on the i th iteration, and the total number of thread are working together.

A small experiment is used to compare different methods proposed. A small set of micro-benchmarks in a variety of methods are conducted.


library(rbenchmark)
bench = benchmark(
		  vecA%*%vecA,
          inner(vecA,vecA),
          sumsq_parallel(vecA,1),
          sumsq_parallel(vecA,2),
          sumsq_parallel(vecA,3),
          sumsq_parallel(vecA,4), order = NULL, replications=1000)

print(bench[,1:4])

# test replications elapsed relative
# 1           vecA %*% vecA         1000   0.721    4.682
# 2       inner(vecA, vecA)         1000   0.229    1.487
# 3 sumsq_parallel(vecA, 1)         1000   0.319    2.071
# 4 sumsq_parallel(vecA, 2)         1000   0.203    1.318
# 5 sumsq_parallel(vecA, 3)         1000   0.154    1.000
# 6 sumsq_parallel(vecA, 4)         1000   0.157    1.019

We can see that C++ functions are much faster than R function %*% . As more cores were used in parallel computing, the execution time becomes shorter. Because I only had 4 cores on my desktop, the maximal number of core I tested was 4. It seems requiring all the resources on your computer to do the work may slow down the overall performance.

##MCMC for GLMM

This tutorial session has only touched on a small part of Rcpp, giving you the basic tools to rewrite poorly performing R code in C++. With the basic knowledge of Rcpp, it will be much easier for you to understand the example of MCMC for generalized linear mixed model using Adaptive rejection Metropolis sampling (ARMS) I provide here . Three methods are compared. The first method is pure R code, using the arms function in R Hi package. The second method is writing log density functions in C++, but sampling process is still coded in R with modified arms function. The third method is coding both log density functions and sampling process in Rcpp/C++. The pure R example takes 36.14 mins. The hybrid method takes 9.80 mins and the C++ code used only 1.17 mins.

The log density function coded in R:


logden.beta < - function(x, myu) {
  myu.long < - as.vector(mapply(rep, myu, nobs))
  temp < - as.vector(design.matrix %*% as.matrix(x, ncol=1)) + myu.long
  return(sum(y*temp - log(1+exp(temp))))
}

logden.u < - function(x, mybeta, myy, mycovariate, mytau) {
  temp <- mycovariate %*% as.matrix(mybeta, ncol=1) + x
  return(sum(myy * temp - log(1+exp(temp))) - 0.5*x^2*mytau)
}

The log density function coded in C++:


// [[Rcpp::export]]
double logden_beta(vec x, vec myu_long, mat design_matrix, vec y){
  vec temp = design_matrix*x + myu_long;
  vec one; one.ones(size(temp));
  vec logden = y%temp - log(one + exp(temp));
  return(sum(logden));
}

// [[Rcpp::export]]
double logden_u(double x, vec mybeta, vec myy, mat mycovariate, double mytau){
  vec x_long(size(myy)); x_long.fill(x);
  vec temp = mycovariate*mybeta + x_long;
  vec one; one.ones(size(temp));
  vec logden = myy%temp - log(one + exp(temp));
  return(sum(logden)-0.5*pow(x,2)*mytau);
}

##Common Pitfalls

  • The number one pitfall I have made while working with C++ and R is forgetting that while R starts vector and matrix indexes from 1, C++ (along with pretty much every other programming language), starts indexes from zero, so you need to plan accordingly.
  • In C++, we need to define the variable before use it while R does not require that.
  • Don’t forget to write the semicolon after each line of the code in C++.

Resources

</span>

For additional information and advanced use, following resources provide a helpful introduction:

]]>
PART - Parallelizing MCMC with Random Partition Trees 2016-10-06T00:00:00+00:00 Kan Li http://li-kan.com/2016/10/Parallelizing-MCMC Background: In particular, conventional MCMC algorithms are computationally very expensive for large data sets. A promising approach to solve this problem is embarrassingly parallel MCMC (EP-MCMC), which first partitions the data into multiple subsets and runs independent sampling algorithms on each subset. The subset posterior draws are then aggregated via some combining rules to obtain the final approximation. This work is testing the method of parallelizing MCMC with Random Partition Trees (PART) proposed by Xiangyu Wang and David Dunson ( [paper] , [code] ) in a simulation setting of a multi-level item response theory (MLIRT) model .

Methods: For the simulation study, I generate 100 datasets. Each dataset contain 1200 subjects with 7200 observations. There are 23 parameters need to be estimated. The dataset is randomly split into 6 subsets. For each subset chains, I run the MLIRT model [code] for 1000 iterations after 3000 burn-in. Thinning by 1 results in 1000 samples. The posteriors samples of each subset chain are aggregated by PART-KD. The results of PART are obtained with minimal cut length as 0.01, minimal fraction block as 0.1 and 10 trees. The model inference using 1200 subjects takes around 27 hours to finish, while each sub-chain (using 200 subjects) takes 2 to 3 hours to finish, and could be run in parallel. I repeat the process on all 100 datasets and obtain 100 copies of aggregated posterior each has 10000 samples.

Results: The simulation results is summarized in the blow table. Majority of the parameters have a small bias and the coverage probability is around 0.95.

The posteriors are plot in figures. The grey lines are the density of 100 copies of aggregated posterior. The solid wide line is density of combine the 100 copies together. The vertical line is the true value of the parameter.

</span>

In conclusion, the PART method could dramatically decreases the computation time for Bayesian inference of large data set and provides relative good estimations (after aggregating sub-chains) in a moderate complex model.

]]>
Classification of Alzheimer’s Disease patient base on gene expression data 2016-08-10T00:00:00+00:00 Kan Li http://li-kan.com/2016/08/Sparse-classification The work investigates the sparse classification methods for high dimensional data. Here I provide a toy example to classify patient with Alzheimer’s Disease (AD) using gene expression data (select 500 gene expression). Two methods were adopted to carry out classification: sparse logistic regression and sparse proximal support vector machine (SVM). I assessed how well the methods can discriminate between patients based on classification accuracy, sensitivity and specificity. I conducted a 5-fold cross validation to avoid overestimation of the performance. All gene expression measures were scaled to zero mean and unit variance. The gglasso package is used. The optimal value of penalty parameter lambda was selected from a series of lambdas which gave the minimum cross validation error on training data set. The simulation results were presented in below tables.



The 500 gene expression gave a relative poor discriminative capability. The mean classification accuracy was around 0.62 via both methods. The sensitivity was surprising low, about 0.2, which indicated a poor capability of identifying AD patients. The high specificity and low sensitivity indicated that the model was inclined to classify patient as cognitive normal. The frequency of cognitive normal patient was 273/447 = 0.611, which was close to the mean accuracies we got from the models. The classification model can be viewed as simple guessing.

I then conducted functional principal component (FPC) analysis on the scaled 500 gene expression data. I selected the first 2 FPC scores which explained 95% of variance in the data. Hotelling’s t-test was conducted between case and control group on FPC scores. The test statistic is 1.49 with p-value 0.226, which indicates that there was on significant different in the gene expression sequence between cognitive normal patients and AD patients. This somehow explains the poor performance of the models in the first analysis.

In conclusion, using all 500 gene expression may not help to identify AD patient. Fitting the model with selected significant genes may improve the discriminative capability. The R code for this toy example is provided.

]]>
Beta Rectangular (BR) Regression 2016-04-02T00:00:00+00:00 Kan Li http://li-kan.com/2016/04/Beta-Regression Background: CAMD of Critical Path Institute developed a quantitative model to describe the progression of cognitive changes in mild to moderate to test and optimize operating characteristics of trial designs for AD (via simulations based on the model). The model and the simulation tool got approval from FDA as “Fit for Purpose”. The original ADAS-cog progression model was based on Beta regression. However, as noted by Hahn (2008), Garcia et al. (2011), and Bayes et al. (2012), the Beta distribution considers neither the tail area events nor the outlying events, fails to represent excess variability and overoccurrence of tail-area events, which could limit its applications for modeling the proportional data (data in the open unit interval (0, 1) such as ADAS-cog score after being divided by 70). An alternative to the Beta distribution is the Beta rectangular (BR) distribution, which assigns more weight to extremal tail-area events and more probability to the outliers and extremal events. Specifically, the BR distribution has a shape parameter \theta (between 0 and 1) controlling the thickness of the tails. When, \theta = 0, the BR distribution reduces to the Beta distribution. In general, when 0<\theta <1, the BR distribution has heavier tails than its Beta distribution counterpart. Please refer to Figure 2 in Wang and Luo (2016) for a visual comparison of Beta and BR distributions.

In the analysis, I compared of the Beta regression model and Beta-rectangular regression model in describing the longitudinal progression of the ADAS-cog in AD patients. I used the testing data set (with ADNI study and 49 meta studies) provided in the tool development group web page, combining with the Study 1013 from Coalition Against Major Diseases (CAMD) database required from Critical Path Institute. The total observations were 6339 from 1005 subjects in 51 studies.

Results: The table below presents the mean of estimated posterior of parameters directly from BUGS models. The “Original Model” refers to the original Beta regression model. The “BR” refers to the Beta-rectangular model. The BUGS code for this Beta-rectangular model is provided [code].


The table indicated that the shape parameter \theta = 0.1184 > 0, and the deviance information criterion (DIC, a Bayesian model selection tool and smaller values indicate a better model fit) in favor of the Beta-rectangular regression model over the Beta regression model. These facts suggested that BR model provided a better model fit in describing longitudinal progression of the ADAS-cog and the existence of potential outliers and heavier tails than what the Beta distribution can model. It worth to include more well processed clinical trial data to do further analysis.

]]>
IEOR PhD 101 2012-05-16T00:00:00+00:00 Kan Li http://li-kan.com/2012/05/ieor-phd

注: 本文首发于一亩三分田数学统计IEOR版 细说IEOR PhD

在上一篇里我对打算在毕业之后找工作的IE master同学提了一些个人建议。其中许多建议和应该思考的问题都不只是限于IE专业的学生。希望拿到AD的同学及早的开始规划这两年的学习。同时感谢大家对上一篇的支持。

这里另起一篇要谈谈IE 的Phd都是做什么的。 内容对master的同学帮助不大,不过有时间了解了解也没坏处。

Phd

IE起源于传统制造业,具体传统上IE是做什么的请自己wiki一下。随着制造业在美国的落寞,IE也变成了一个夕阳学科,或者说很成熟了, 所以如今的IE已经很少再做制造业方面的研究(比如生产排程),培养一些master 输入业界已经够用了。

作为有志向攻读Phd的青年,有必要了解一下如今的IEOR倾向的研究方向。

Operations Research/Management science是IE的核心。 就是应用 mathematical modeling,statistics, simulation 来为某个应用领域做决策的学科。其中大家听说的manufactory, supply chain, healthcare, revenue management, disaster respond等等方向都是OR/MS 作为依托的应用领域。有些系在某应用领域项目比较多,研究的老师比较多,带头人就愿意专门分出来,成立个组什么的。 但实质差别不大,只是学的应用领域的东西不同。OR是所有IE系都有的, 只是应用领域各自都有特点和侧重。OR方向我会之后详细讲解。 类似的还有商学院 operations management/decision science 专业,很多数学课都会来IE系来学, 但他们不要求特别高的数学理论, 而侧重商业分析。

剩下的方向都是小众,你只有在不多的学校找到相应的研究方向。我对每个方向和每个学校并不是特别清楚,所以大概介绍一下。 你可以去 IIE 网站 了解更多IE的东西。

Statistics,IE 也研究统计, 主要是研究制造业中 quality control, process improvement, reliability analysis 。代表学校ASU, 做集成电路生产制造, rutgers。

Stochastic/simulation, 研究仿真理论, 理论性比较强。 Northwestern 仿真界的牛校。

Human factor, 我只知道是设计产品更符合人体工效学,使人操作更顺手,降低疲劳积累速度,减少失误,从而提高工作效率。代表学校,VT, Wisconsin-M, OSU

Information system,这个方向在IE系成气候的并不多。代表学校,Wisconsin-M 有研究healthcare information system, USC

Manufactory, 这个现在主要研究制造技术(快速成型, 机械切割什么的),纳米材料,材料加工等。 和ME有很多重叠的。 一些老牌的IE系还保留着这方面的研究,比如PSU,PITT。 所以一些做材料或是ME的可以考虑申请这些IE系,竞争小些。

还有一些更小众的研究领域,像是 Energy, 比如 iowa 有研究电力系统的, UT – Austin 有研究石油领域的。具体干什么我不清楚。还有 complex system/social network , 比如 西北。

其实现在的研究越来越多的都是跨学科研究了,各个院系研究的领域很大因素取决于它周围的资源。 比如Cornell,Columbia 就比较多的是做 金融, risk management, revenue management; Phoenix由于有很多电子制造业还是西部distribution center, 所以ASU研究电子器材制造,质量控制,供应链比较突出; U Michigan, Wisconsin, Pitt 由于都有附属的大型医学中心, 所以研究偏向healthcare。具体各系老师做什么,你只有和该系的老师学生进一步交流才能知道了。很多学校网站上老师的介绍都是百年不更新,所以不要太依赖那个信息。另一种方法是把教授名字输入google scholar 中,年限定2009年以后,看看他这几年都发了什么文章,也可判断他最近在做什么。

这里我来具体讲一下Operations Research 一些情况。如果你对OR有兴趣,你就要先知道一下INFORMS 协会, 你可以找他们的网站,从那你能了解更多关于OR的东西。

OR的研究领域

现在OR的常见的研究领域是 logistics/supply chain, 还有Health care。 当然还有别的, Phd是个把问题越研究越专的东西,各方向下面越分越细,由于我做的是healthcare方向的东西,所以其他方向具体些研究什么问题,用什么方法,我就说不了那么详细了。

Logistics 也在走下坡,因为研究的已经差不多了。supply chain 研究的主要是inventory control,distribution network, coordination等, 很多学校都有。我只说自己觉的有一个研究趋势,是把商业供应的研究成果扩展到humanitarian and non-profit 供应链。 GIT 一直是supply chain 领域的领导者,当然他是IE学科的领导者,Northwestern 也有研究小组, 还有Berkley,法国 INSEAD 商学院都有教授做。这里我们可以看到,牛校总是走在前沿,这也与他们有一定经济实力拉的来项目有关。毕竟投钱做non-profit 事情的机构还是少。 这个领域主要研究疫苗或是救灾物资的储备,选址,灾难发生后的evacuation, last mile distribution,vehicle scheduling ,routing,GIS的应用;灾后debride 的处理, NGO之间的合作。 目标是以有限的物资最快最好的服务尽可能多的目标群体,控制成本,减少浪费, 同时要兼顾公平等人道主义问题。如果你对这方面感兴趣,可以关注一下上面几个学校相关小组的页面。

Health care 不能说是新兴的,但是也是很多学校逐渐涉猎的。很多学校都在招具有这方面研究经验的教师, 原因是做医疗有钱。 IE教授拉钱是很辛苦的,一般从NSF 申请到的一个科研项目三年只有30多万。但是healthcare 那边的从CDC和各种基金会拉到的项目经费都是按百万计算的,能养活一大批人。IE教授是拉不来这样项目的,一般都是由医学院或是公卫那边的PI牵头,IE这边给他们打工,不过也能分得挺大一杯粥了。这存在一个问题就是,发paper你永远名字写不到前面,因为你只是在做一种技术支持的感觉。当然你也可以把你技术支持部分的内容放大,独立发些文章。那IE 在health care 具体做什么? 一般和医学院合作的话 研究therapy decision making, medical decision making. 比如研究化疗应该什么时候开始,器官移植匹配度达到怎样就选择移植,用药每次要多大计量,每一疗程出现什么症状采取什么疗法 等等, 通过dynamic模型来预测优化治疗效果。 和医院合作的话,研究优化医院系统,提高医疗资源利用,减少浪费,OR(operation room) 的排程,healthcare quality, 主要运用process improvement, simulation 的东西比较多。 和公卫合作,主要是health policy 的制定。 像pandemic 的预测, influenza dynamic modeling,预测该用哪种vaccine 类型,物资储备、资源分配、人员调度等优化。大概就这些范围,如果有兴趣可以下载一些Michigan,Pitt,W-M, NCSU相关领域教授的文章看看introduction。

OR都学些什么

上面讲的都是OR现在热门的应用领域, 那OR都具体学什么? 其实OR就是把上面那些情景中的问题抽象成数学模型,然后再解数学模型的学科。OR对数学要求很高,不但要你熟练掌握各种类型的模型,抽象实际问题的能力,还要具备解数学模型的理论推导能力,还有优化算法的能力。编程方面的要求就是能把你的数学公式转变成机器语言,调用solver, 或是自己写算法,让计算机求解。

OR 一般要解决的都是 optimization 的问题, 根据不同的问题,采用不同的两类模型,确定性模型 和概率模型。IEOR Phd修的课和master修的课差别很大,一些学校要求学生修过 real analysis。

确定性模型一般都要学习optimization 的课,一般都有 Linear optimization, non- linear optimization, integer programing, network optimization, dynamic optimization等。像上面提到的supply chain 领域,如不牵扯需求预测, 多用确定性模型。

概率模型一般都要学 Probability theory,Statistics,Stochastic process, queuing theory, Markov decision process 等。像解决healthcare领域的问题,多会用到概率模型。

一般每一系列中前面几门都是必修的,靠后的课程是你根据你要研究的方向才选是多修确定性模型还是概率模型的课。当然每一系列还有一些更高级的课,通常是把上面课程中的某一章抽出来再往细将,一般比较理论。

另外一般还会学到statistics 和 simulation。 因为另一种解决optimization问题的方法是simulation。 一般在要研究的系统非常复杂的情景(如医院系统,某一地区的disaster respond),无法建立数学模型的情况采用。 Simulation 和 mathematical modeling 是很不一样的方法。 Simulation 其实是一种枚举的方法,没有最优解,只能说是所得结果中最好的。 常常要对上千次的仿真结果做统计分析,确定那个解。

作为OR的Phd 一般都需要你掌握的软件有, CPLEX (这个是解确定性模型的软件,不是开源的,你可以先学习另外一种软件和语言GLPK/GMPL, 这个是开源的,先了解一下是怎么回事) MATLAB (很多时候用他来编算法,解问题,相比快捷) C++ 得拿得出手啊, 很多数学运算软件都有C++API Excel/VBA( excel 要用好) Arena(商业仿真软件,实际研究中不太会用到,但是一般是IE under的课,学会了可以找找有没TA的机会)

IEOR Phd 怎么申请

Phd 的申请不是很乐观,如果不是清华,上交,复旦,同济,西安交大,北大学IE,数学,物流,学习成绩处于上游的同学我的建议还是谨慎申请排名前15的phd. 因为一般ie系每年招生只在5-10个之间,除了GIT这种称谓IE school的,好的年份最多都不会超过15个。我们就按每个系50%招的是中国学生,名额也不够呀。你可以上学校网站上找在读phd,有些人会有主页或cv, 看看这些人的背景,和你自己比较比较。

另外一种方法就是曲线救国,先申master, 找教授参与项目,不要怕吃亏,没钱也白干,而且还要干的很积极。一般教授也不是瞎子,看你这么用功,他要有余钱或是ta机会也一般会帮你的。还有要向教授表明自己想读phd, 修phd的课,一方面看你是不是真喜欢这领域,另一方面验证你学phd的课不会吃力。如果你这两点做好了,考qualify 或是 重申phd 都是顺水推舟的事,考过qualify 一般都会给你钱的。一般qualify都是第一学年结束的夏天。当然还要注意,用这种方式的人越来越多,master 都海招了,大家都这样做你就不一定就能凸显出来。

IEOR Phd 干什么工作?

具体Phd 干什么工作由于我还没经历过找工作, 所以也谈不周详。 不过有这样几种选择, 去业界的话 航空公司做航班调度排程. UPS、Fedex这样的物流公司做operations research analysis. Consultant。 还是推荐 ZS 医疗机构,如果你研究的是healthcare 相关的

找教职的话, 一般排名10以前的去排名20-30的找教职,排名20以前的去排名30-40的找教职。当然你也不一定非要选择IE系教职。也可以考虑先去做几年post-do , 或是去哪个研究机构历练两年,heathcare的比如Merck Research Laboratories 或者CDC。然后根据情况看你再回不回学校教书。

IEOR Phd 相对master 就要好找工作了,一般科研结构起薪平均在7万。 我也不知道问题究竟出在哪了,可能中国学生多半表现自己的欲望和技巧要差一些,找工作一般感觉要比其他国家的Phd逊色。一般OR Phd 的大部分是土耳其,伊朗,中国学生。Phd其实更注重口语的,所以练口语也是很重要的,建议和给master的一样。

]]>
IEOR 101 2012-05-13T00:00:00+00:00 Kan Li http://li-kan.com/2012/05/ieor-master

注: 本文首发于一亩三分田数学统计IEOR版 细说IEOR

偶然从Warald 的blog看到这个论坛和这个版块, 发现关于IEOR的帖子很少,多半都是提问贴,而且回答也多少有失准确性。所以想着可以结合自己的经历和所见所闻给大家讲讲IEOR的东西,希望对后来的申请者和学弟学妹们有所帮助。我所发现的这个问题其实是所有留学论坛的通病,参与发帖与互动的基本都是还未踏上美洲大陆的申请者,然而真正的过来人,如果不是当年的版主或是什么,基本很少会再回来。所以,如果一亩三分田没有相关的什么规定,我倒是不介意你转载到其他论坛上去让更多的人有所收获。能标上出处也好。

由于一路工科走过来,文笔上大家就不要强求了。 而且我只能抽空写写,想到哪里写哪里,留言提问可能也没时间回答,所以抱歉抱歉。当然,我也不能说我在这学科有多资深,除了我的领域,很多地方也了解的不全面或是压根不了解(比如human factor 方向, 印象中有这个方向的学校不多,比如VT, Wisconsin- Madison),也可能含个人见解和局限性,如果你有不同意见,也很乐意交流。

由于Master 和 Phd 基本是两条不同的道路,所以我也分开来侧重介绍。

Master

Master 主要的培养方向是在业界找工作,录取条件相对较低,所以很多申请Master的一般都能有不错的AD。 我在申请方面的建议就是,如果你觉得你条件还行,GPA,GRE够高,手头有钱,大胆!胆大!

对于master来说, 如果你不是IE学院下特殊项目, 比如finance engineering (Columbia, Cornell) ,并不明确的分什么方向, 大家都修一样的必修课,一般是基础的probability,statistics,manufactory system,operations research然后就是根据你想做的工作选择其他一些选修课,比如去做supply chain (facility layout,inventory control,supply chain), 去manufactory做质量管理(total quality management,6 sigma),去医院做health care方面 (个别学校有这个条件) health care management, health care quality, 还有一些通识类比如,decision science,simulation,project management, data base 等等, 都是很基础的。

由于IE是个小系,很多排名10名以后的IE系根本就没能力开那么多master的课让你选,所以你基本不得不把系里开的master的课都得上过一边,别无选择。当然很多学校允许你到外系去修一两门课, 这样我建议可以去商学院选选,还有就是到计算机或是信息科学系学学data mining, database 之类的。一般修够30个学分(10门课)就能毕业。

IE 的master相对是比较好读的,如果你家底殷实,主要目的是体验美国生活,顺便捞个文凭,那IE是个好选择,去名校的概率也大。如果你觉得五六十万对你家来说还是挺吃力的,之后想在美国找个工作回报父母呀,我建议你还是不要考虑IE了,因为IE真心不好找工作。 IE看着是个万精油的专业,可万精油也意味着啥都不如人家专业的。我知道有些人会想我工科背景,学些management的东西都好呀,可等你找工作的时候,你会感觉到这两年怎么啥都没学到,management(还有商学院的finance和marketing)对于中国人,特别是没有工作经验的这群学生来说真的就是鬼扯呢。你会发现你竞争不过美国硕士,美国本科生,印度同学(基本都是学EE,ME的),还有国内读过IE本科的同胞。 当然有些人是找到工作了,一种原因是他有比较多cs背景,每天做的编程的工作;一种原因是他有一些名企的相关实习或几年的工作经历;一种原因他是“人家家孩子”,他做什么就是比你强。

好,如果你觉得自己是一个别无长处的IE master学生, 这里我有些建议告诉你该怎么做(其他专业也可借鉴)。如果你拿到了某个学校的master并且准备去了,就去网页上找系里面中国学长学姐的联系方式(最好是PhD)。 我提的这个建议可能会被这些学长学姐打的,因为就我本身来说也是不喜欢有人打扰的。所以你发一次没回就不要再发了。不回的意思就是我没工夫搭理你,我啥时候心情好了再考虑回不回你。第一封用英文,因为学校邮箱很多显示不了中文。信里要有礼貌,别一上来就让人感觉你有所求。信里可以了介绍一下你,问其他可以用中文的联系方式。之后的联系中问些生活方面的信息,然后了解一下哪些课需要TA,grader, 哪些课的TA经常大家不愿做; 至于哪个老板做什么,可能不可能要学生,我觉得这种问题就不要问了,因为对你来说要RA渠道获得钱的可能性几乎没有。 知道哪些课需要TA或是grader,先看看自己有没修过这些课,修过太好,没修过赶紧看看。

当你踏上美洲大陆的那一刻,你就要试着让自己变成一个外向的人。对人热情,有问题就问,放开,不要扭扭捏捏,没啥不好意思的。 及早的去学院报到,找小秘聊,聊学校提供得能为我所用得各种资源(计算机,图书馆,就业指导中心,健身房乱七八糟的)。找系主任聊、透漏你想做grader,TA的意思,聊你想找工作的打算。找管研究生教学的教授聊,聊选课,透漏你要TA,grader的意思。要不要的上不重要,重要的是让他们知道有你这么个人认存在。而且我一直觉得在国外机会都是自己争取的,中国学生很多都怕这个怕那个,不知道在怕什么,这一点就输给印度学生了。脸皮要厚,事情成不成一回事,咱先问一下也损失不了个咱啥。找工作也一样。

如果你的志向是继续Phd,你的选课和所要做的事情是不一样的, 这个我之后会说。

读master应该做什么?

对于致力于毕业找工作的同学,你在美国学好课的同时,最重要的是交际。 这里的交际不是你跟中国狐朋狗友开party! 而是广泛的认识人,美国人,其他国家的人,练口语,练到你能清楚地表达你的想法让别人听懂,练到你做presentation自如不怯场,练到你听得明白别人的问题,并能很快反应出来用英语回答。至于口语音调美不美都是次要的。

这里最直接的资源就是教会,我不是个教徒,也没怎么去过,不过我认识的人都从这段经历中收获很多。有的美国教会会举办很多活动邀请国际学生参加,意在让你感受上帝的仁爱。每周会组织学习学习圣经啊,讨论讨论其他文化还有新闻问题呀,做些游戏什么的。节日里便会邀请你去hosting family一起过节。在这个过程中你就练了口语,了解了美国文化,至于你信不信教一般是没人强迫你的。如果你排斥这种方法,那也只有多和本班的美国同学交朋友了,做课程项目的时候多和他们一组,虽然和美国人一组效率很低,但是他们有他们的资源。中国学生许多因为觉得自己口语不行,文化差异大而选择中国人一活做课程项目,这是毫无意义的。要想成功,就要更多的暴露自己的弱点。

除了好的GPA 和 口语能力外, 对行业的了解和面试技巧都是要及早准备的。一般IE master 项目是1年半到两年,由于只是休学分,有些也可以暑假修课,一年毕业,不过我不推荐。由于短学制,所以你一来的那个fall就应该开始准备材料CV之类去找实习,去学校的就业指导中心改CV, cover letter, 参加模拟面试之类的。 在秋季的校园招聘中去了解来的都是哪些企业,当地企业,他们要什么样得人,有没有实习机会。 另一个重要资源是 学校 的 co-op program(co-op 是什么自己查),这个和你专业直接对口,干的好了以后可能直接留到那公司。coop一般是面向本科生的,不过还是我那句话,机会都是要出来的,问啊!问啊!长嘴光出气那!美国没有说什么规定是死的。如果你能在来年春季或是夏季做个实习的话,之后找工作就会顺很多。所以你要想找工作的话第一年就按我上面说的做。

其实我不提倡找工作的再去操心哪个小组招不招做研究的。这些只是个人想法。因为IE的科研研究经历和项目对你找工作基本没价值,另外你第一年能找到给你钱的老板的机会不大,因为项目funding的支出和分配,哪一部分干什么干什么早就是定好的。但是这也不妨碍多去了解各个老板都是做什么的。哪个我比较感兴趣就去聊聊,万一找不到实习什么的还可能有条路,就是看他暑假要不要人,打个工。或者他带什么课要不要TA之类的。所以和偏应用研究的老师处好关系也是有益无害的。

IE Master 找什么样的工作?

IE是个万精油,所以你如果觉的你哪个方面有所长,能和人家专业的拼拼,那你可以去试一下。 有人说我学OR找什么样的工作,答案是没有你能做的工作。这里不是说学OR找不到工作,而是你要具体到什么应用领域。真正设置专门做OR的单位很少,而且可以说要的都是Phd,招人的一般是国防部门(你没戏),带研究中心的大型医疗机构,大咨询公司。至于master,Consultant喜欢要IE的,但是进去很难,他们一般喜欢有美国本科学历的,而且这是个跟人打交道的行业,所以我上面交代的准备工作没做好的话,你也不要对他报什么希望,参考公司ZS associate。 传统的制造业里,如果你本科不是IE或是没相关实习工作经验的比较难。做supply chain 的一般是商学院的会和你竞争,比没优势,你可以选择偏向做modeling的职位,比如Fedex 有operations research analysis I 这样的职位,做供应链的建模仿真,所以这就要求你学好线性规划,network,库存管理,simulation还有相关解线性规划问题的软件和仿真软件,对计算机编程要求高。另外一行能找到工作的是做质量管理,这就要凸现你统计的优势,上好TQM,6sigma,考个green belt,熟练应用其中process improvement的方法和模型,用好minitab。找到这种职位工作的几率比较大。 如果你对这些没兴趣,其他我真心觉的你应该去学计算机。

其实我觉的在美国找工作,不分专业,第一就是口语,第二就是和系主任拉好关系,借别人的关系网,自己去多问多争取,机会不会自己来敲门。

]]>