Tianjin University
复杂数据环境下的数据降维
朱鹏飞
天津大学计算机科学与技术学院
2016-10-26
Tianjin University
复杂数据环境
专刊题目:复杂环境下的机器学习研究
征稿范围:1、不确定性数据处理与建模2、面向多源异构复杂数据的机器学习3、机器学习在复杂任务中的应用
人们不再满足于场景固定、目标明确的学习任务,开始尝试开放环境下、复杂场景中的探索式学习、多任务协同学习等等更具挑战性的任务,并且在无人驾驶、机器人、大系统优化、大数据建模等场景下进行验证。为了应对这些挑战,有必要根据待建模任务的复杂性,提出更灵活、更鲁棒、更自主、自进化的学习机制。
Tianjin University
手机拍照使用率 2010 6% 2012 82% 2015 100%
大数据的高维性
Tianjin University
维数灾难
大数据的高维性
Tianjin University
Zhai, Ong, Tsang. The emerging “Big dimensionality”. IEEE CI Magazine, 2014
为什么特征选择
The evolution (rise) of feature dimensionality in correlation matrices. (a) Diabetes (8 features)
(b) Lung Cancer (56 features)(c) Psoriasis (529,651 features)
Tianjin University
为何特征选择
存储负担
计算复杂度
模型泛化能力
Tianjin University
为何特征选择
特征空间维度的增长,使得模型参数增加,模型求解复杂度增加,容易引起过拟合,从而影响模型的泛化性能
Tianjin University
样本稀疏性:某些特征空间基本没有样本存在
为何特征选择高维特征空间中的度量集中效应
在高维数据空间中,某个样本点到其最近邻居点和最远邻居点之间的距离趋于相等,从而导致一些基于距离度量的机器学习算法性能降低。这种现象通常称为“度量集中”,最早由Milman在描述高维概率分布时引入。
Tianjin University
复杂数据环境
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., & Tang, J., et al. (2016). Feature selection: a data perspective.
Tianjin University
多模态异构信息
复杂数据环境—多模态
Tianjin University
多模态异构信息
复杂数据环境—多模态
Tianjin University
复杂数据环境—结构化
undirected graph structure
Ye J, Liu J. Sparse methods for biomedical data[J]. ACM SIGKDD Explorations Newsletter, 2012, 14(1): 4-15.
Tree group lasso
特征结构化
Tianjin University
复杂数据环境—结构化
Jun Liu and Jieping Ye. Moreau-Yosida regularization for grouped tree structure learning. NIPS 2010
特征结构化
Tianjin University
复杂数据环境—结构化
J. Tang and H. Liu. Feature selection with linked data in social media. In SDM , 2012.
Twitter (tweets linked through hyperlinks)
Facebook (people connected by Friendships)
Biological networks (protein interaction networks)
样本结构化
Tianjin University
复杂数据环境—结构化
A part of the semantic hierarchy of Corel 5k
标签结构化
Wu B, Lyu S, Ghanem B. ML-MG: Multi-label Learning with Missing Labels Using a Mixed Graph[C]// IEEE InternationalConference on Computer Vision. IEEE, 2015.
Tianjin University
复杂数据环境—缺失
Image annotation
AU recognition
标签缺失
Sun Y Y, Zhang Y, Zhou Z H. Multi-Label Learning with Weak Label.[C]// Twenty-Fourth AAAI Conference on Artificial
Intelligence, AAAI 2010, Atlanta, Georgia, Usa, July. 2010.
Tianjin University
复杂数据环境—缺失
Recommendation system
Multi-view clustering
视角缺失
[1] Handong Zhao, Hongfu Liu, and Yun Fu, Incomplete Multimodal Visual Data Grouping, International
Joint Conference on Artificial Intelligence (IJCAI), 2016
Tianjin University
复杂数据环境—噪声
Xiangyong Cao, Qian Zhao, Deyu Meng, Yang Chen, Zongben Xu. Robust Low-rank Matrix Factorization under
General Mixture Noise Distributions, IEEE Transactions on Image Processing, 2016.
Images from the Yale Face Database with different noises
属性噪声
Tianjin University
复杂数据环境—噪声
标签噪声
人工标注或机器自动标注误差
Tongliang Liu, Dacheng Tao: Classification with Noisy Labels by Importance Reweighting. IEEE Trans. Pattern Anal. Mach.
Intell. 38(3): 447-461 (2016)
医疗诊断中的误诊率
Tianjin University
复杂数据环境—流数据
流特征选择—加入新特征
Hao Huang, Shinjae Yoo, and S Kasiviswanathan. Unsupervised feature selection on data streams. In Proceedings
of the 24th ACM International on Conference on Information and Knowledge Management, pages 1031–1040. ACM,
2015.
Tianjin University
复杂数据环境—流数据
流特征选择—加入新样本
Jing Wang, Meng Wang, Peipei Li, Luoqi Liu, Zhongqiu Zhao, Xuegang Hu, and Xindong Wu. Online feature
selection with group structure analysis. IEEE Transactions on Knowledge and Data Engineering, 27(11):3029–3041,
2015.
新用户
Tianjin University
特征选择的挑战
Storage Burden Computation Complexity Generalization Ability
Tianjin University
研究进展-无监督
• 无监督特征选择的关键之一是如此生成伪的类标签,使无监督特征选择转化成有监督的问题;
• 数据的流形结构、样本相似性、样本分布、特征的自相似性等特性是构建嵌入式无监督特征选择算法的重要元素;
• 目前的无监督特征选择工作实验验证主要在已有的benchmark数据集上,没有涉及到超高维数据的特征选问题,
Tianjin University
Regularized self-representation (RSR)
研究进展-无监督
A feature can be represented
by a linear combination of
other features
For all the features
Pengfei Zhu, WangmengZuo, LeiZhang, QinghuaHu, SimonC.K.Shiu, Unsupervised feature selection by regularized self-
representation. Pattern Recognition 2015
Tianjin University
Regularized self-representation (RSR)
研究进展-无监督
Zhu P, Hu Q ,Zhang L, et al . A Discriminative Self-representation induced Classifier[C].//IJCAI.2016.
Tianjin University
研究进展-无监督
Coupled Dictionary Learning
解析字典合成字典
Predefined fastLearned local structure of images
解析合成字典对学习
解析字典合成字典
利用解析字典进行特征选择
Zhu P, Hu Q, Zhang C, et al. Coupled Dictionary Learning for Unsupervised Feature Selection[C]// AAAI. 2016.
Tianjin University
研究进展-无监督
Subspace clustering guided Unsupervised Feature Selection
Pengfei Zhu, Wencheng Zhu, Qinghua Hu, Changqing Zhang. Subspace Clustering guided Unsupervised Feature
Selection .
SCUFS
existing models
S F W
similarity matrix F W
X样本自表达可以更好地揭示样本和样本之间的关系
Tianjin University
多视角特征选择
研究进展-多视角
Lei Zhao, Qinghua Hu, Wenwu Wang, Heterogeneous Feature Selection with Multi-Modal Deep Neural Networks and
Sparse Group Lasso, TMM2015
Tianjin University
思考与讨论
• 复杂与开放环境下的数据建模---噪声缺失多源异构等;
• “旧瓶能否装新酒”—传统模型在复杂环境下如何泛化;
如:噪声和缺失环境下的特征选择
• Curse of dimensionality vs Blessing of dimensionality
Are Deep Networks a Solution
to Curse of Dimensionality?
Blessing of Dimensionality: High
Dimensional Feature and Its Efficient
Compression for Face Verification
Professor Stéphane MallatJian Sun
Tianjin University
思考与讨论复杂数据环境下的深度学习低质量数据 low quality data
Z. Wang, S. Chang, Y. Yang, D. Liu and T. Huang, "Studying Very Low Resolution Recognition Using Deep
Networks", In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
feature enhancement and recognition simultaneously
Tianjin University