Apache Spark + PySpark standalone installation on Ubuntu 14.04

Installing the prerequisites Java installation Install Java using the following instructions $ sudo apt-add-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java7-installer To check if Java has been installed correctly, run $ java -version You should be getting a similar output   Scala installation Run the following commands to install Scala. $ wget … Continue reading Apache Spark + PySpark standalone installation on Ubuntu 14.04

Evaluating term and document similarity using Latent Semantic Analysis

Document retrieval and keyword extraction are two very common tasks involved in text mining. There is a plethora of techniques that can be used for each of these tasks. In this post, we will be discussing a method that can be used for both. Actually, we will be doing document retrieval and keyword "expansion". The … Continue reading Evaluating term and document similarity using Latent Semantic Analysis

Implementing Misclassification Costs with Theano

Misclassification costs is a common method that is used to handle imbalanced datasets. For a binary classification problem, they are equivalent to setting different priors for the classes. The option to manually set priors is available in a lot of model implementations in R and scikit-learn. However, for a multi-class classification problem with more than … Continue reading Implementing Misclassification Costs with Theano

Structured Prediction using Conditional Random Fields

In this tutorial, I would be explaining the concept of structured learning/prediction and the use of Conditional Random Fields (CRF) for achieving this. Before we start discussing about CRF's, its essential that we understand what structure prediction is and why do we require it. What is Structured Learning? At present, Neural networks and its numerous … Continue reading Structured Prediction using Conditional Random Fields