II. Background and Related WorkFor sure, technology has made our life easier. One of the areas where the wide use of technology has been useful is the communication and sharing of opinions, thoughts, beliefs and the ability to get feedback in a large scale. The social networks are one example. Maybe one day even the election of the lawmakers or parliamentarians will be made online and all the calculation of votes and analysis will be handled totally by software. To be able to develop successful software projects, we should adhere to good development practice.
This also helps when the code becomes larger and difficult to manage. One of the steps that should be always considered before start coding is making a general model to think about the interaction of different components. Based on that we then can decide which technology to use for each case and if they are compatible with each other or not.2.
1. ModularityModularity is a very important concept in software engineering. The idea behind modularity is that of splitting up the problem into a series of self-contained modules. In practice, it is advised that a module should not exceed 100 or so lines and preferably be short enough to fit on a single page (Abela, 2014). Some of the advantages of modularity are: • Some modules may be defined by standard procedures which are used and reused in different programs or parts of the same program.• A module is small enough to be understandable as a unit of code. Thus also easily debuggable.
• Program maintenance is easier.• Several programmers may work on different modules concurrently.• Modules can be tested independently.• Large projects become easier to monitor and control.Our approach when working in this thesis was to keep always in mind the idea of modularity. What this means is that part of this work could be easily adapted for other or different application that the one presented here, or vice-versa, the machine learning techniques proposed, can also be used for similar tasks other than detecting hate speech, for example in applications that have to do in some way with text classification or categorization. Modular programming is much linked to structured programming and object-oriented programming, all having the similar goal of facilitating the build process of software programs and information systems by fragmentation into smaller pieces, making so, easily extendable.
All the application code on the user side is written conform object oriented approach. Code is well refactored and organized in such a manner that can be easily used to accomplish the same type of tasks in other applications. 2.2.
Developing for mobile usersChoosing the right platform for first deployment of the application is sometimes a challenging task. A lot of things should be considered such as the type of application, look and feel, user interaction facility, changing content, network speed and optimization etc. As said in this article of the Wired Magazine: “The world of mobile is significantly different than that of the desktop computer because mobile is all about context and that context is constantly changing.” We chose to deploy the application for the mobile users initially, because of the nature and the content of the application. Since it is a forum like application the focus is on the user’s opinion sharing and feedback, so the availability and mobility will be always important.As an integral part of the development process, mobile UI design is also very important in the creation of mobile apps. Mobile user interaction should considers limitation, contexts factors, screen, inputs, and mobility as outlines for design.
User input allows user to manipulate a system, and device’s output allows the system to mark the effects of the user’s manipulation. Mobile UI layout constraints include limited attention and form factors, such as a mobile device’s screen size for a user’s hand(s).Mobile UIs, or front-ends, depend on mobile back-ends to support access to enterprise systems. The mobile back-end make possible data routing, authentication, authorization, working off-line, and services management. This functionalities are supported by a combination of middleware components including mobile application servers, mobile backend as a service (MBaaS) (Monroe, 2013), and service-oriented architecture (SOA) infrastructure.Web and mobile apps have some similar set of features on the backend, including push notifications, integration with social networks, using cloud storage and other. Each of these services has its own API which can be integrated into an app, a process that can be time-consuming for app developers (Lane, 2013). BaaS providers operate as bridge between the frontend of an application and cloud-based backend using a unified API and SDK.
As a backend provider we have chosen the Firebase Platform. Firebase is a mobile and web application development platform. Firebase is made up of complementary features that developers can mix-and-match to fit their needs.
Since their acquirement by Google in 2014 they have increased support for a significant number of new features such as: Firebase Analytics, Firebase Cloud Messaging, Firebase Auth, Realtime Database, Firebase Storage, Firebase Hosting, Firebase Test Lab for Android, Firebase Crash Reporting, Firebase Notifications, Firebase App Indexing, Firebase Dynamic Links, Firebase Invites, Firebase Remote Config etc . On the other hand, front-end development tools are focused on the user interface and user experience (UI-UX) and provide the following abilities: • UI design tools• SDKs to access device features• Cross-platform accommodations/supportDevelopment EnvironmentAs the platform for developing the front-end part or known as the client side application we have chosen Android Studio. Android Studio is the official integrated development environment (IDE) for the Android platform. Based on JetBrains’ IntelliJ IDEA software, Android Studio is designed specifically for Android development. As of the moment of writing the following features are provided in the current stable version: ? Gradle-based build support? Android-specific refactoring and quick fixes? Lint tools to catch performance, usability, version compatibility and other problems? ProGuard integration and app-signing capabilities? Template-based wizards to create common Android designs and components? A rich layout editor that allows users to drag-and-drop UI components, option to preview layouts on multiple screen configurations? Support for building Android Wear apps? Built-in support for Google Cloud Platform, enabling integration with Firebase Cloud Messaging (Earlier ‘Google Cloud Messaging’) and Google App Engine? Android Virtual Device (Emulator) to run and debug apps2.3. Text classification, Text Mining and Machine Learning algorithmsThe problem of text mining has gained increasing attention in the recent years because of the huge amounts of text data, which are created in a variety of fields such as social networks, world wide web, enterprise business intelligence, scientific discovery (especially Life Sciences), natural language processing, sentiment analysis, national security/intelligence and other information-centric applications. This process is often combined with machine learning techniques to efficiently automate task and decision making.
In our application text classification is an important part because we want the system to discard the text that contains offensive words or hate speech. To be able to do that, the system should distinguish what is hate speech and what is not. In our journey for selecting the best candidate tool/algorithm, we have identified there are three ways it can be done. One way is to build a ruled-base algorithm with if-else condition to check if the word or string is contained in some pre-collected files. If it is, we classify it at the offensive category.
It’s obvious that this will not be very efficient since we cannot predict ahead of time what can be classified as offensive or hate in some context or what those bad words can be. Even more, it will cost a lot of computing power since it will need to check to every word and file for every submitted post text. Another way is to manually control what question or poll should be display building an administrator portal and let the administrator decide which text is conform ethical rules. This can be doable in the beginning where the number of users might be low, but it can become unmanageable with the increased number of users and posts. Another way, is to turn to machine learning techniques.
This is more of an automatic or semi-automatic way of classifying. Machine learning based approaches involve those where classification rules and equations are derived automatically using sample labeled documents. This kind of approaches have a much higher recall but a lower precision than rule based approaches. We can train the machine learning algorithms to learn about what should be classified as offensive language and then let them respond to the user’s submitted content based on pre-built model. Of course, it will require some attention from an administrator time to time, but is a much more optimal solution in all aspects. We will talk more about this solution in Chapter IV, but first let see some relevant work to this solution.Text mining, often referred to as text data mining (tightly related to text analytics), is the process of extracting high-quality information from text data. High-quality information is typically derived through the conceiving of patterns and trends through means such as statistical pattern learning .
Text mining usually has to do with the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others), obtaining patterns within the structured data, and finally evaluating the output. In our study, regarding the practical part, this mining process is more associated with a classification problem. In general, automatic text classification comes in 3 flavors: pattern matching, algorithms, neural nets. Some of the most used techniques in automatic text classification are:• Naive Bayes classifier• Simple tf–idf• Instantaneously trained neural networks• Latent semantic indexing• Support vector machines (SVM)• Artificial neural network• K-nearest neighbor algorithms• Decision trees such as ID3 or C4.5The selection is made according to individual case of application but from the list, 3 are most noted that have produced significant result in text classification: Naïve Bayes, SVM, and artificial neural networks.
As data scientist, Roman Trusov said in an article: “It’s impossible to define the best text classifier. In fields such as computer vision, there is a strong consensus about a general way of designing models ? deep networks with lots of residual connections. Unlike that, text classification is still far from convergence on some narrow area.”Although algorithmic approaches, such Naïve Bayes and SVM are used successfully for text classification or categorization, the new neural network versions have resulted to be more effective (Sebastiani, 2014), (Taeho, 2015), (Khosrow, 2012). Other consideration has to do with the type of data the will go under classification. In our case the data is of type text and is not so well-structured. For example, while the algorithmic approach using Naïve Bayes is shown to be very effective, it has some drawbacks for applying in this case: • The algorithm generates a score rather than a probability.
We want a probability to ignore predictions below some threshold.• The algorithm ‘learns’ from examples of what is in a class, but not what isn’t. This learning of patterns of what does not belong to a class is very important in a lot of cases.• Classes with disproportionately large training sets can create distorted classification scores, forcing the algorithm to adjust scores relative to class size.So, during the prior work for this thesis we decided that the best solution is to apply the artificial neural nets to our text classification problem.
Artificial Neural Networks for Text ClassificationThe underlying idea behind a neural network is to simulate (copy in a simplified but reasonably way) lots of interconnected brain cells(neurons) inside a system, so to be able to learn things, recognize patterns, and make decisions in a humanlike way. Neural networks are typically organized in layers. A number of interconnected ‘nodes’, which contain an ‘activation function’ constructs layers. Samples are fed to the network via the ‘input layer’, which communicates to one or more ‘hidden layers’ where the actual processing is done.
This processing is done through a system of weighted ‘connections’. The hidden layers then link to an ‘output layer’. In output layer we get the answers as shown in the graphic below. Most ANNs operate based on some form of ‘learning rule’, which modifies the weights of the connections according to the input patterns that it is presented.
In a certain way, ANNs learn by example as do their biological counterparts. Figure 1 A simple Artificial Neural Network graphical modelNeural network based methods have obtained a great progress on a variety of natural language processing tasks. The primary role of the neural models in text classification is to represent them variable-length text as a fixed-length vector.
These models generally consist of a projection layer that maps words, subword units or n-grams to vector representations and then combine them with the different architectures of neural networks. There are several kinds of models to model text, such as Neural Bag-of-Words (NBOW) model, recurrent neural network (Pengfei et al. 2016), (Chung et al.
, 2014), recursive neural network (Socher et al., 2012; 2013) and convolutional neural network (Kim, Y. 2014; Kalchbrenner et al., 2014). These models take as input the embedding of words in the text sequence, and summarize its meaning with a fixed length vectorial representation. An interesting paper where neural networks are used in a similar task, more precisely for text summarization, has shown high accuracy of the neural network approach (Khosrow, 2012). In this paper, a neural network is trained to learn some specific characteristics of sentences that should be included in the summary of an article. Then, the neural network is changed to generalize and integrate the relevant characteristics noticeable in summary sentences.
In the end, the modified network is used as a filter to summarize news articles.We have to say that text classification is somehow a classical problem. However, the recent advancement in machine learning algorithms (and hardware processing speed, of course) have made this a very active research field, both in academia and industry. Continuous advancement is made, such as Hierarchal Attention Network (Yang et al. 2016.), topic labeling (Wang and Manning, 2012), sentiment classification (Maas et al.
, 2011; Pang and Lee, 2008), spam detection (Sahami et al., 1998). Currently, text classification it is widely used in sentimental analysis (IMDB, YELP reviews classification), stock market, Google’s smart email replay etc. In addition, this field is depended on the natural language under study because it is tightly linked to Natural Language Processing.More recent approaches use deep learning, such as convolutional neural networks (Blunsom et al., 2014) and recurrent neural networks based on long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) to learn text representations.