The course provides an introduction to data mining and knowledge discovery from data. The key data mining methods of clustering, classification and prediction are illustrated, together with practical tools for their execution. Next, we focus on particular aspects of Big Data such as high volume, high dimensionality and high frequency and incorporate tools build to deal with such structures (dimensionality reduction, incremental clustering) into data mining methodologies. Finally the key methods for Big Data sensing and acquisition are discussed, together with basics of applications in social media mining, text mining and biomedicine. We conclude with an introduction to big data visualization. Syllabus: 1. Data mining and the knowledge discovery process. Overview of data mining and machine learning techniques. Exemplar studies in clustering, classification and pattern mining. 2. Clustering. Taxonomy of clustering concepts: distance-based (separation, centroids, contiguity), density-based, partitional vs. hierarchical. Methods for centroid-based clustering (k-means), hierarchical clustering (agglomerative and divisive), density-based clustering (DBSCAN). 3. Classification and prediction models. Model learning and model validation. Explanation vs. prediction. Rule-based classifiers and decision trees. Naïve Bayes classifiers. Basic machine learning models (K-nearest neighbors, linear discriminant analysis, support vector machines, ensemble methods). 4. Dimensionality reduction in Big Data (PCA, Random Projection, Parallelized methods) 5. Pattern mining and association rules. A priori principle. Mining high-frequency patterns and high-confidence rules. Interestingness measures for patterns and rules. 6. Big data and social sensing. Big data acquisition. Web scraping, crawling, crowdsourcing, crowdsensing. Big data technologies and platforms. 7. Social media mining – Text Mining. Listening social media sources. Monitoring social trends. Basics of opinion mining and sentiment analysis. Recommended Systems. 8. Applications in Biomedicine. Population Genomics, DNA sequence data mining. 9. Data visualization and visual analytics. Basics of visual representation of data: hierarchies, networks, maps, time series, spatio-temporal data, text. Exemplar case studies.
Written examination at the end of the semester and optional tasks.