Browsing by Subject "Ranking"

Now showing 1 - 9 of 9

Open Access
Can who-edits-what predict edit survival?
(ACM, 2018-08) Yardım, Ali Batuhan; Maystre, L.; Kristof, V.; Grossglauser, M.
As the number of contributors to online peer-production systems grows, it becomes increasingly important to predict whether the edits that users make will eventually be beneficial to the project. Existing solutions either rely on a user reputation system or consist of a highly specialized predictor that is tailored to a specific peer-production system. In this work, we explore a different point in the solution space that goes beyond user reputation but does not involve any content-based feature of the edits. We view each edit as a game between the editor and the component of the project. We posit that the probability that an edit is accepted is a function of the editor's skill, of the difficulty of editing the component and of a user-component interaction term. Our model is broadly applicable, as it only requires observing data about who makes an edit, what the edit affects and whether the edit survives or not. We apply our model on Wikipedia and the Linux kernel, two examples of large-scale peer-production systems, and we seek to understand whether it can effectively predict edit survival: in both cases, we provide a positive answer. Our approach significantly outperforms those based solely on user reputation and bridges the gap with specialized predictors that use content-based features. It is simple to implement, computationally inexpensive, and in addition it enables us to discover interesting structure in the data.
Open Access
Estimating the chance of success and suggestion for treatment in IVF
(2013) Mısırlı, Gizem
In medicine, the chance of success for a treatment is important for decision making for the doctor and the patient. This thesis focuses on the domain of In Vitro Fertilization (IVF), where there are two issues: the first one is the decision on whether or not go with the treatment procedure, the second one is the selection of the proper treatment protocol for the patient. It is important for both the doctor and the couple to have some idea about the chance of success of the treatment after the initial evaluation. If the chance of success is low, the patient couple may decide not to proceed with this stressful and expensive treatment. Once a decision for treatment is made, the next issue for the doctors is the choice of the treatment protocol which is the most suitable for the couple. Our first aim is to develop techniques to estimate the chance of success and determine the factors that affect the success in IVF treatment. So, we employ ranking algorithms to estimate the chance of success. The ranking methods used are RIMARC (Ranking Instances by Maximizing the Area under the ROC Curve), SVMlight (Support Vector Machine Ranking Algorithm) and RIkNN (Ranking Instances using k Nearest Neighbour). All of these three algorithms learn a model to rank the instances based on their score values. RIMARC is a method for ranking instances by maximizing the area under the ROC curve. SVMlight is an implementation of Support Vector Machine for ranking instances. RIkNN is a k Nearest Neighbour (kNN) based algorithm that is developed for ranking instances based on similarity metric. We also used RIwkNN, which is the version of RIkNN where the features are assigned weights by experts in the domain. These algorithms are compared on the basis of the AUC of 10-fold stratified cross-validation. Moreover, these ranking algorithms are modified as a classification algorithm and compared on the basis of the accuracy of 10-fold stratified cross-validation. As a by-product, the RIMARC algorithm learns the factors that affect the success in IVF treatment. It calculates feature weights and creates rules that are in a human readable form and easy to interpret. After a decision for a treatment is made, the second aim is to determine which treatment protocol is the most suitable for the couple. In IVF treatment, many different types of drugs and dosages are used, however, which drug and the dosage are the most suitable for the given patient is not certain. Doctors generally make their decision based on their past experiences and the results of research published all over the world. To the best of our knowledge, there are no methods for learning a model that can be used to suggest the best feature values to increase the chance that the class label to be the desired one. We will refer to such a system as Suggestion System. To help doctors in making decision on the selection of the suitable treatment protocols, we present three suggestion systems that are based on well-known machine learning techniques. We will call the suggestion systems developed as a part of this work as NSNS (Nearest Successful Neighbour Based Suggestion), kNNS (k Nearest Neighbour Based Suggestion) and DTS (Decision Tree Based Suggestion). We also implemented the weighted version of NSNS using feature weights that are produced by the RIMARC algorithm. Moreover, we propose performance metrics for the evaluation of the suggestion algorithms. We introduce four evaluation metrics namely; pessimistic metric (mp), optimistic metric (mo), validated optimistic metric (mvo) and validated pessimistic metric (mvp) to test the correctness of the algorithms. In order to help doctors to utilize developed algorithms, we develop a decision support system, called RAST (Risk Analysis and Suggestion for Treatment). This system is actively being used in the IVF center at Etlik Z¨ubeyde Hanım Woman’s Health and Teaching Hospital.
Open Access
Estimating the chance of success in IVF treatment using a ranking algorithm
(Springer, 2015) Güvenir, H. A.; Misirli, G.; Dilbaz, S.; Ozdegirmenci, O.; Demir, B.; Dilbaz, B.
In medicine, estimating the chance of success for treatment is important in deciding whether to begin the treatment or not. This paper focuses on the domain of in vitro fertilization (IVF), where estimating the outcome of a treatment is very crucial in the decision to proceed with treatment for both the clinicians and the infertile couples. IVF treatment is a stressful and costly process. It is very stressful for couples who want to have a baby. If an initial evaluation indicates a low pregnancy rate, decision of the couple may change not to start the IVF treatment. The aim of this study is twofold, firstly, to develop a technique that can be used to estimate the chance of success for a couple who wants to have a baby and secondly, to determine the attributes and their particular values affecting the outcome in IVF treatment. We propose a new technique, called success estimation using a ranking algorithm (SERA), for estimating the success of a treatment using a ranking-based algorithm. The particular ranking algorithm used here is RIMARC. The performance of the new algorithm is compared with two well-known algorithms that assign class probabilities to query instances. The algorithms used in the comparison are Naïve Bayes Classifier and Random Forest. The comparison is done in terms of area under the ROC curve, accuracy and execution time, using tenfold stratified cross-validation. The results indicate that the proposed SERA algorithm has a potential to be used successfully to estimate the probability of success in medical treatment.
Open Access
Incorporating the surfing behavior of web users into PageRank
(ACM, 2013-10-11) Ashyralyyev, Shatlyk; Cambazoğlu, B. B.; Aykanat, Cevdet
In large-scale commercial web search engines, estimating the importance of a web page is a crucial ingredient in ranking web search results. So far, to assess the importance of web pages, two different types of feedback have been taken into account, independent of each other: the feedback obtained from the hyperlink structure among the web pages (e.g., PageRank) or the web browsing patterns of users (e.g., BrowseRank). Unfortunately, both types of feedback have certain drawbacks. While the former lacks the user preferences and is vulnerable to malicious intent, the latter suffers from sparsity and hence low web coverage. In this work, we combine these two types of feedback under a hybrid page ranking model in order to alleviate the above-mentioned drawbacks. Our empirical results indicate that the proposed model leads to better estimation of page importance according to an evaluation metric that relies on user click feedback obtained from web search query logs. We conduct all of our experiments in a realistic setting, using a very large scale web page collection (around 6.5 billion web pages) and web browsing data (around two billion web page visits). Copyright is held by the owner/author(s).
Open Access
Incorporating the surfing behavior of web users into PageRank
(2013) Ashyralyyev, Shatlyk
One of the most crucial factors that determines the effectiveness of a large-scale commercial web search engine is the ranking (i.e., order) in which web search results are presented to the end user. In modern web search engines, the skeleton for the ranking of web search results is constructed using a combination of the global (i.e., query independent) importance of web pages and their relevance to the given search query. In this thesis, we are concerned with the estimation of global importance of web pages. So far, to estimate the importance of web pages, two different types of data sources have been taken into account, independent of each other: hyperlink structure of the web (e.g., PageRank) or surfing behavior of web users (e.g., BrowseRank). Unfortunately, both types of data sources have certain limitations. The hyperlink structure of the web is not very reliable and is vulnerable to bad intent (e.g., web spam), because hyperlinks can be easily edited by the web content creators. On the other hand, the browsing behavior of web users has limitations such as, sparsity and low web coverage. In this thesis, we combine these two types of feedback under a hybrid page importance estimation model in order to alleviate the above-mentioned drawbacks. Our experimental results indicate that the proposed hybrid model leads to better estimation of page importance according to an evaluation metric that uses the user click information obtained from Yahoo! web search engine’s query logs as ground-truth ranking. We conduct all of our experiments in a realistic setting, using a very large scale web page collection (around 6.5 billion web pages) and web browsing data (around two billion web page visits) collected through the Yahoo! toolbar.
Open Access
Ranking instances by maximizing the area under ROC curve
(Institute of Electrical and Electronics Engineers, 2013) Guvenir, H. A.; Kurtcephe, M.
In recent years, the problem of learning a real-valued function that induces a ranking over an instance space has gained importance in machine learning literature. Here, we propose a supervised algorithm that learns a ranking function, called ranking instances by maximizing the area under the ROC curve (RIMARC). Since the area under the ROC curve (AUC) is a widely accepted performance measure for evaluating the quality of ranking, the algorithm aims to maximize the AUC value directly. For a single categorical feature, we show the necessary and sufficient condition that any ranking function must satisfy to achieve the maximum AUC. We also sketch a method to discretize a continuous feature in a way to reach the maximum AUC as well. RIMARC uses a heuristic to extend this maximization to all features of a data set. The ranking function learned by the RIMARC algorithm is in a human-readable form; therefore, it provides valuable information to domain experts for decision making. Performance of RIMARC is evaluated on many real-life data sets by using different state-of-the-art algorithms. Evaluations of the AUC metric show that RIMARC achieves significantly better performance compared to other similar methods. © 1989-2012 IEEE.
Open Access
Risk estimation by maximizing area under receiver operating characteristics curve with application to cardiovascular surgery
(2010) Kurtcephe, Murat
Risks exist in many different domains; medical diagnoses, financial markets, fraud detection and insurance policies are some examples. Various risk measures and risk estimation systems have hitherto been proposed and this thesis suggests a new risk estimation method. Risk estimation by maximizing the area under a Receiver Operating Characteristics (ROC) curve (REMARC) defines risk estimation as a ranking problem. Since the area under ROC curve (AUC) is related to measuring the quality of ranking, REMARC aims to maximize the AUC value on a single feature basis to obtain the best ranking possible on each feature. For a given categorical feature, we prove a sufficient condition that any function must satisfy to achieve the maximum AUC. Continuous features are also discretized by a method that uses AUC as a metric. Then, a heuristic is used to extend this maximization to all features of a dataset. REMARC can handle missing data, binary classes and continuous and nominal feature values. The REMARC method does not only estimate a single risk value, but also analyzes each feature and provides valuable information to domain experts for decision making. The performance of REMARC is evaluated with many datasets in the UCI repository by using different state-of-the-art algorithms such as Support Vector Machines, naïve Bayes, decision trees and boosting methods. Evaluations of the AUC metric show REMARC achieves predictive performance significantly better compared with other machine learning classification methods and is also faster than most of them. In order to develop new risk estimation framework by using the REMARC method cardiovascular surgery domain is selected. The TurkoSCORE project is used to collect data for training phase of the REMARC algorithm. The predictive performance of REMARC is compared with one of the most popular cardiovascular surgical risk evaluation method, called EuroSCORE. EuroSCORE is evaluated on Turkish patients and it is shown that EuroSCORE model is insufficient for Turkish population. Then, the predictive performances of EuroSCORE and TurkoSCORE that uses REMARC for prediction are compared. Empirical evaluations show that REMARC achieves better prediction than EuroSCORE on Turkish patient population.
Open Access
Self-adaptive randomized and rank-based differential evolution for multimodal problems
(Springer, 2011-01-15) Urfalioglu, O.; Arıkan, Orhan
Differential Evolution (DE) is a widely used successful evolutionary algorithm (EA) based on a population of individuals, which is especially well suited to solve problems that have non-linear, multimodal cost functions. However, for a given population, the set of possible new populations is finite and a true subset of the cost function domain. Furthermore, the update formula of DE does not use any information about the fitness of the population. This paper presents a novel extension of DE called Randomized and Rank-based Differential Evolution (R2DE) and its self-adaptive version SAR2DE to improve robustness and global convergence speed on multimodal problems by introducing two multiplicative terms in the DE update formula. The first term is based on a random variate of a Cauchy distribution, which leads to a randomization. The second term is based on ranking of individuals, so that R2DE exploits additional information provided by the population fitness. In extensive experiments conducted with a wide range of complexity settings, we show that the proposed heuristics lead to an overall improvement in robustness and speed of convergence compared to several global optimization techniques, including DE, Opposition based Differential Evolution (ODE), DE with Random Scale Factor (DERSF) and the self-adaptive Cauchy distribution based DE (NSDE).
Open Access
Unsupervised segmentation and classification of cervical cell images
(Elsevier BV, 2012-12) Gençtav, A.; Aksoy, S.; Önder, S.
The Pap smear test is a manual screening procedure that is used to detect precancerous changes in cervical cells based on color and shape properties of their nuclei and cytoplasms. Automating this procedure is still an open problem due to the complexities of cell structures. In this paper, we propose an unsupervised approach for the segmentation and classification of cervical cells. The segmentation process involves automatic thresholding to separate the cell regions from the background, a multi-scale hierarchical segmentation algorithm to partition these regions based on homogeneity and circularity, a binary classifier to finalize the separation of nuclei from cytoplasm within the cell regions. Classification is posed as a grouping problem by ranking the cells based on their feature characteristics modeling abnormality degrees. The proposed procedure constructs a tree using hierarchical clustering, then arranges the cells in a linear order by using an optimal leaf ordering algorithm that maximizes the similarity of adjacent leaves without any requirement for training examples or parameter adjustment. Performance evaluation using two data sets show the effectiveness of the proposed approach in images having inconsistent staining, poor contrast, overlapping cells. © 2012 Elsevier Ltd.