Speaker: Dr. Sou-Cheng Choi <http://mypages.iit.edu/~schoi32/>, Senior Statistician in NORC at the University of Chicago, and Research Assistant Professor in the Department of Applied Math at IIT. 

Time: Nov 17 11:25 am—12:40 pm. 
Location: SB-220. 

Title: Probabilistic Record Linkage and Address Standardization 

Abstract: Probabilistic record linkage (PRL) refers to the process of matching records from different data sources such as database tables with missing values in primary key. It can be applied to join or de-duplicate records, or to impute missing data, resulting in better overall data quality. An important subproblem in PRL is to parse or standardize a text field such as address into its component fields, e.g., street number, street name, city, state, zip code, and country. Often, various modern data analysis techniques such as natural language processing and machine learning methods are gainfully employed in both PRL and address standardization to achieve higher accuracies of linking or prediction. In a recent study, we compare the performance of a few widely used open-source PRL packages freely available in the public domain, namely FRIL, Link Plus, R RecordLinkage, and SERF. In addition, we evaluate the baseline performance and sensitivity of a number of address-parsing web services including the U.S. address parser, Google Maps APIs, Geocoder.us, and Data Science Toolkit. We will present strengths and limitations of the software and services we have evaluated. This is joint work with Yongheng Lin and Edward Mulrow, NORC at the University of Chicago. 

