**Robert Tung**

I am a rising sophomore Computer Science and Applied Math double major at Yale University. I've worked on projects including two years of Statistics Research at Texas State and applications like Check, an automated geo-fencing attendance application. My LinkedIn Profile can be found here: https://www.linkedin.com/in/robertrtung.

This summer I will be working with Dr. Byron Gao on a project related to Correlation-based Clustering, a sub-field of Data Mining.

**Week 1:**

Aside from all the introductions and talks this week, we got to start on our research as well. The process began with reading references on clustering in general, as well as the specific paper that included CONCOR, the algorithm most relevant to our work, and many tangential papers and surveys on related clustering topics. I made sure to gain a solid understanding of many of the types of clustering and, for the types of clustering that were related to CONCOR, an understanding of the most widely used algorithms right now. I also noted down parts of other algorithms that were useful, such as supported ways to determine optimal sample sizes and ways to pre-process data, in case this ever came up.

This week, I also came up with ideas on how to improve the efficiency of the CONCOR algorithm, largely through limiting the number of times the product moment function would have to be called in each iteration. I took the time to write out mini-algorithms to implement these changes, and wrote out the algebra to show that they were valid. While doing this, I noted down any characteristics of CONCOR or tangential ideas I had that might come in handy with improving either the efficiency or the optimization of the algorithm.

**Week 2:**

Much of this week was spent in what might be thought of as taking a step back to fully understand and assess the topic. We realized that it would be important to figure out what the algorithm actually optimized and how it converges in order to fully understand how to improve it and what direction the project could take in the upcoming weeks. I spent a large part of the week coding simulations in Java to compare the algorithm to naive solutions of the k-means problem as well as a naive optimization of mean within-group correlations and a few other criteria. These same simulations were then used to determine if various types of normalization had any positive effects on the results, and whether the final cluster centers had any significance. Also, some algebra was done to show an invariance of results under linear transformation, and then we spent some time thinking about the direction of the project in the upcoming weeks. At this point, we planned to use samplings and modifications on the points of comparison to see if that has any beneficial effect.

At the end of the week, a presentation was given to the rest of the REU on the progress of the project so far.

**Week 3:**

This week started out with more exploration about the CONCOR algorithm. So far some of the ideas we had do seem to improve the efficiency of the algorithm, but others do so largely at the expense of the accuracy in optimization. Did a lot more simulation coding and got some numbers for things like how many anchors to compare to points with and how much error that would create. Also put in some thought to figuring out what the algorithm was doing and why it converges the way that it does.

We also got a chance to visit IBM this week. A lot of the things they talked about, whether with their work or with the experience of research in the industry, were pretty interesting and thought-provoking, and we got to see some of the cool things they were working on too.

**Weeks 4 and 5:**

In these two works, I continued my work with Dr. Gao, leading up to the second project presentation. Our project found a specific sort of focus in that we created a generalized version of the original algorithm, applicable to general random variables as well as the clustering application. From this, we made a conjecture about the behavior of the convergence of this algorithm, including all the cases it could go through and some subcases that behaved differently than normal. In addition, we considered other aspects such as the implications of a distance measure that satisfies the triangle inequality, how the algorithm would change with different correlation measures (such as the Spearman Rank Correlation Coefficient), and the effect of different factors on the quality of the resulting clustering (factors such as number of dimensions, amount of data and the sampling method used for the reference points). Finally, snapshots were generated to show the behavior of our algorithm on different datasets, and I began writing the paper for this project.

In addition to this, we got to visit SWRI this week, which was really interesting. It was nice to see a cool career path outside of either academia or working in the industry, and they were working on some really interesting projects.

**Week 6:**

Aside from spending a lot of time working through different proofs that were put into the paper, I also wrote some more simulations to look into factors of our new algorithm. For example, we wanted to see if we could find a way to use only a fixed number of reference points in an intelligent manner, which would increase the efficiency of our algorithm significantly. Since points that are farther away have been shown to be a better pick, we tried using a bounding rectangle on the dataset and taking either the corners or the midpoints of the edges as reference points, though this did not turn out to have as strong of results as we would have hoped. I also looked to see if some other concepts, such as mutual information, may improve our algorithm, and did some tests on the convergence implications of only correlation a subset of the columns. Aside from this, I spent some time generating snapshots and doing a short comparison of our algorithm with K-means.

Finally, we had a visit to Emerson this week. Though it was shorter than the rest, it was still very interesting to see what they were working on, and interesting to learn about the field they were in as a whole.