机器学习该怎么入门?

本人大学本科,对机器学习很感兴趣,想从事这方面的研究。在网上看到机器学习有一些经典书如Bishop的PRML, Tom Mitchell的machin…
关注者
29,988
被浏览
6,988,049

573 个回答

我要翻译一把quora了,再加点我的理解,我相信会是一个好答案,链接我都放到一起了,没插入到正文中,要求其实比较高了,我觉得我自己都差很远很远~~~我尽量持续更新翻译质量以及自己理解

1. Python/C++/R/Java - you will probably want to learn all of these languages at some point if you want a job in machine-learning. Python's Numpy and Scipy libraries [2] are awesome because they have similar functionality to MATLAB, but can be easily integrated into a web service and also used in Hadoop (see below). C++ will be needed to speed code up. R [3] is great for statistics and plots, and Hadoop [4] is written in Java, so you may need to implement mappers and reducers in Java (although you could use a scripting language via Hadoop streaming [5])

首先,你要熟悉这四种语言。Python因为开源的库比较多,可以看看Numpy和Scipy这两个库,这两个都可以很好的融入网站开发以及Hadoop。C++可以让你的代码跑的更快,R则是一个很好地统计工具。而你想很好地使用Hadoop你也必须懂得java,以及如何实现map reduce

2. Probability and Statistics: A good portion of learning algorithms are based on this theory. Naive Bayes [6], Gaussian Mixture Models [7], Hidden Markov Models [8], to name a few. You need to have a firm understanding of Probability and Stats to understand these models. Go nuts and study measure theory [9]. Use statistics as an model evaluation metric: confusion matrices, receiver-operator curves, p-values, etc.

我推荐统计学习方法 李航写的,这算的上我mentor的mentor了。理解一些概率的理论,比如贝叶斯,SVM,CRF,HMM,决策树,AdaBoost,逻辑斯蒂回归,然后再稍微看看怎么做evaluation 比如P R F。也可以再看看假设检验的一些东西。

3. Applied Math + Algorithms: For discriminate models like SVMs [10], you need to have a firm understanding of algorithm theory. Even though you will probably never need to implement an SVM from scratch, it helps to understand how the algorithm works. You will need to understand subjects like convex optimization [11], gradient decent [12], quadratic programming [13], lagrange [14], partial differential equations [15], etc. Get used to looking at summations [16].

机器学习毕竟是需要极强极强数学基础的。我希望开始可以深入的了解一些算法的本质,SVM是个很好的下手点。可以从此入手,看看拉格朗日,凸优化都是些什么

4. Distributed Computing: Most machine learning jobs require working with large data sets these days (see Data Science) [17]. You cannot process this data on a single machine, you will have to distribute it across an entire cluster. Projects like Apache Hadoop [4] and cloud services like Amazon's EC2 [18] makes this very easy and cost-effective. Although Hadoop abstracts away a lot of the hard-core, distributed computing problems, you still need to have a firm understanding of map-reduce [22], distribute-file systems [19], etc. You will most likely want to check out Apache Mahout [20] and Apache Whirr [21].

熟悉分布计算,机器学习当今必须是多台机器跑大数据,要不然没啥意义。请熟悉Hadoop,这对找工作有很大很大的意义。百度等公司都需要hadoop基础。

5. Expertise in Unix Tools: Unless you are very fortunate, you are going to need to modify the format of your data sets so they can be loaded into R,Hadoop,HBase [23],etc. You can use a scripting language like python (using re) to do this but the best approach is probably just master all of the awesome unix tools that were designed for this: cat [24], grep [25], find [26], awk [27], sed [28], sort [29], cut [30], tr [31], and many more. Since all of the processing will most likely be on linux-based machine (Hadoop doesnt run on Window I believe), you will have access to these tools. You should learn to love them and use them as much as possible. They certainly have made my life a lot easier. A great example can be found here [1].

熟悉Unix的Tool以及命令。百度等公司都是依靠Linux工作的,可能现在依靠Windows的Service公司已经比较少了。所以怎么也要熟悉Unix操作系统的这些指令吧。我记得有个百度的面试题就是问文件复制的事情。

6. Become familiar with the Hadoop sub-projects: HBase, Zookeeper [32], Hive [33], Mahout, etc. These projects can help you store/access your data, and they scale.

机器学习终究和大数据息息相关,所以Hadoop的子项目要关注,比如HBase Zookeeper Hive等等

7. Learn about advanced signal processing techniques: feature extraction is one of the most important parts of machine-learning. If your features suck, no matter which algorithm you choose, your going to see horrible performance. Depending on the type of problem you are trying to solve, you may be able to utilize really cool advance signal processing algorithms like: wavelets [42], shearlets [43], curvelets [44], contourlets [45], bandlets [46]. Learn about time-frequency analysis [47], and try to apply it to your problems. If you have not read about Fourier Analysis[48] and Convolution[49], you will need to learn about this stuff too. The ladder is signal processing 101 stuff though.

这里主要是在讲特征的提取问题。无论是分类(classification)还是回归(regression)问题,都要解决特征选择和抽取(extraction)的问题。他给出了一些基础的特征抽取的工具如小波等,同时说需要掌握傅里叶分析和卷积等等。这部分我不大了解,大概就是说信号处理你要懂,比如傅里叶这些。。。

Finally, practice and read as much as you can. In your free time, read papers like Google Map-Reduce [34], Google File System [35], Google Big Table [36], The Unreasonable Effectiveness of Data [37],etc There are great free machine learning books online and you should read those also. [38][39][40]. Here is an awesome course I found and re-posted on github [41]. Instead of using open source packages, code up your own, and compare the results. If you can code an SVM from scratch, you will understand the concept of support vectors, gamma, cost, hyperplanes, etc. It's easy to just load some data up and start training, the hard part is making sense of it all.

总之机器学习如果想要入门分为两方面:

一方面是去看算法,需要极强的数理基础(真的是极强的),从SVM入手,一点点理解。

另一方面是学工具,比如分布式的一些工具以及Unix~

Good luck.

祝好

[1]

http://radar.oreilly.com/2011/04...

[2]

NumPy — Numpy

[3]

The R Project for Statistical Computing

[4]

Welcome to Apache™ Hadoop®!

[5]

http://hadoop.apache.org/common/...

[6]

http://en.wikipedia.org/wiki/Nai...

[7]

http://en.wikipedia.org/wiki/Mix...

[8]

http://en.wikipedia.org/wiki/Hid...

[9]

http://en.wikipedia.org/wiki/Mea...

[10]

http://en.wikipedia.org/wiki/Sup...

[11]

http://en.wikipedia.org/wiki/Con...

[12]

http://en.wikipedia.org/wiki/Gra...

[13]

http://en.wikipedia.org/wiki/Qua...

[14]

http://en.wikipedia.org/wiki/Lag...

[15]

http://en.wikipedia.org/wiki/Par...

[16]

http://en.wikipedia.org/wiki/Sum...

[17]

http://radar.oreilly.com/2010/06...

[18]

AWS | Amazon Elastic Compute Cloud (EC2)

[19]

http://en.wikipedia.org/wiki/Goo...

[20]

Apache Mahout: Scalable machine learning and data mining

[21]

incubator.apache.org/wh

[22]

http://en.wikipedia.org/wiki/Map...

[23]

HBase - Apache HBase™ Home

[24]

http://en.wikipedia.org/wiki/Cat...

[25]

grep

[26]

en.wikipedia.org/wiki/F

[27]

AWK

[28]

sed

[29]

http://en.wikipedia.org/wiki/Sor...

[30]

http://en.wikipedia.org/wiki/Cut...

[31]

http://en.wikipedia.org/wiki/Tr_...

[32]

Apache ZooKeeper

[33]

Apache Hive TM

[34]

http://static.googleusercontent....

[35]

http://static.googleusercontent....

[36]

http://static.googleusercontent....

[37]

http://static.googleusercontent....

[38]

http://www.ics.uci.edu/~welling/...

[39]

http://www.stanford.edu/~hastie/...

[40]

http://infolab.stanford.edu/~ull...

[41]

https://github.com/josephmisiti/...

[42]

http://en.wikipedia.org/wiki/Wav...

[43]

http://www.shearlet.uni-osnabrue...

[44]

http://math.mit.edu/icg/papers/F...

[45]

http://www.ifp.illinois.edu/~min...

[46]

http://www.cmap.polytechnique.fr...

[47 ]

http://en.wikipedia.org/wiki/Tim...

[48]

http://en.wikipedia.org/wiki/Fou...

[49 ]

http://en.wikipedia.org/wiki/Con...

看到没有人提到Metacademy,推荐一发作为入门工具:

Metacademy

,以及我个人的一点粗浅看法。

上面有很多答案说得太庞杂了,固然机器学习这个领域有很多的经典资料值得我们花大块时间去研读,但对于一个入门的新人来说如果在一开始就一头扎进这样深不见底的知识海洋之中,难免产生一些挫败感,这样的挫败感对深入学习是不利的,也是不必要的。事实上,在机器学习这个领域里,我们可以说出诸如“演化计算”,“统计关系学习”等上百个关键词,每一个关键词都代表着一个子领域,无论多么优秀的机器学习学家,也不敢说自己对每一个子领域都有相当的了解。

如果对机器学习有兴趣,当拥有最基础的知识之后,就可以尝试对某个感兴趣的子领域展开一些研究,利用问题驱动自己,逐渐形成self-motivation。在解决问题的过程中不断提升自己的视野,提升自己对问题的洞察力和对研究的自信可能是更为重要的。

但在这样的过程中,基础薄弱所带来的问题可能就会浮现:每每你读论文,会遇到许多闻所未闻的概念,这时为了弄清整个论文逻辑,你不得不跑回去先了解这些知识。这样你又一头扎进了知识海洋,在几十个搜出来的网页之间切来切去,尝试弄明白一个个预备知识的预备知识,却不知道这一块块拼图何时才能拼完你最初想读懂的论文。

如果你有一个足够强大又足够耐心的导师,可能会很大程度地帮到你,但大部分的导师不会如此体贴入微——他们只会在大的方向上引导你。这时候我们需要的是一个知识结构上的贴心“导师”,告诉你为了看懂这个概念,哪些知识你需要学,为什么这些知识重要,怎样快速了解这些知识。我们需要一副清晰的知识图谱,以帮助我们最快速地解决我们需要解决的问题。

这是Metacademy的建设初衷。Metacademy会把各个知识点联系起来,就像游戏里的技能树一样。每个知识点有个简介,而且会链接到那些优质的学习资源上,最重要的是,它会画出通向这个知识点的知识图谱。Metacademy的建设目标是“your package manager for knowledge”,但现在上面暂时只集成了一些机器学习和相关的数学知识。

例如我们想了解CNN(convolutional neural nets)这个概念,直接在Metacademy上搜索它:


可以看到这个概念相关的介绍:


其中这门课Coursera: Neural Networks for Machine Learning 想必有很多前辈都会推荐,授课人是深度学习大师Geoffrey Hinton。

我们还可以点击左上角的树状图标查看知识图谱:


一层一层知识间的关系变得清晰起来。再怎么新手,vectors,dot product也是知道的。这样虽然要学的知识量客观上没有改变,但不再是淹没在知识海洋里,而是面对知识的阶梯一步一步向上走。这样的感觉是截然不同的,而在研究过程中,感觉是非常重要的一环。

当然这个Metacademy还很初步,我只是拿它做了个例子。总的来说,机器学习该怎么入门,怎么算入门,各家有各家的说法,我还没有评论的资格。我的想法是,在科技如此发达,知识如此丰富的现代,我们不应感到迷茫,而应换个角度看到道路更宽广,世界更多彩。也许可以把一些冗杂的既有知识暂且放下,多将精力放在那些更值得我们思考的问题上来,或许这样更能不断地在学习和研究中获得正向反馈。