I really was hoping that everything would go smoothly today with students' use of handin etc. That turned out not to be the case. Worse, I asked everyone at 2:30 to try saving at that time, so that any trouble spots could be discovered early and quickly remedied. That didn't turn out to be the case either.
Sigh. I will console myself with the knowledge that everyone turned in real .tar files, e.g. not .zip renamed to .tar, and that the file structure will be OK. I've not checked yet, but I'm crossing my fingers...
I won't impose penalties on any of you, but if you have ANY suggestions to get students to follow directions, I'd sure appreciate them.
I gather that the exam itself was more difficult than I had thought. I'll post solutions later, but here is an outline:
Recall that I said earlier, " I'll be basing your homework grade on three things: Nick's prescreeing assessment; my own reading of your homework submissions; and today's interactive grading." I will be doing that reading in the next couple of days, done hopefully tomorrow, when I will give you your homework grade.
I may have some questions on your CRAN projects. Please watch your e-mail for this.
Faculty interview Talks starting tomorrow.
A reminder that our first NLP faculty candidate talk will be tomorrow at 3:10pm in 1131 Kemper Hall:
Dr. Ndapa Nakashole, from Carnegie Mellon University Title: Knowledge Graph Construction using Machine Reading Methods Abstract: Knowledge graphs such a NELL, Freebase, and YAGO provide a means to address the knowledge bottleneck in artificial intelligence domains such as natural language processing, computer vision, and robotics. State of the art knowledge graphs have accumulated large amounts of beliefs about real world entities using machine reading methods. Current machine readers have been successful at populating such knowledge graphs by means of pattern detection =E2=80=94 a shallow way of machine reading which leverages the re= dundancy of large corpora to capture language patterns. However, machine readers still lack the ability to fully understand language. In the pursuit of the much harder goal of language comprehension, knowledge graphs present an opportunity for a virtuous circle: the accumulated knowledge can be used to improve machine readers; in turn, advanced reading methods can be used to populate knowledge graphs with beliefs expressed using complex and potentially ambiguous language. In this talk, I will elaborate on this virtuous circle, starting with building knowledge graphs, followed by results on using them for machine reading. Bio: Ndapa Nakashole is a post-doctoral fellow at Carnegie Mellon University. Her research interests include machine reading, natural language processing, machine learning and data mining. She works with professor Tom Mitchell on using machine learning to build computer systems that intelligently process and understand human language. She received her PhD from Saarland University, Germany.Keep in mind that this coming Thursday, March 10, we will have our second quiz.
In order for your quiz to be graded, you MUST follow these instructions EXACTLY:
The quiz will cover the statistics half of the course.
In our textbook, I make a big point of the folly of significance testing. I also note that even though everyone knows it has problems, they use it anyway, which is bizarre.
Today the American Statistical Association finally recognized that, pretty amazing. By the way, the current ASA president is Jessica Utts, a former UCD professor. Here is the ASA's statement, signed by Jessica and the ASA executive director:
Dear Member, Today, the American Statistical Association Board of Directors issued a statement on p-values and statistical significance. We intend the statement, developed over many months in consultation with a large panel of experts, to draw renewed and vigorous attention to changing research practices that have contributed to a reproducibility crisis in science. Widespread use of 'statistical significance' (generally interpreted as 'p < 0.05') as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process," says the ASA statement (in part). By putting the authority of the world's largest community of statisticians behind such a statement, we seek to begin a broad-based discussion of how to more effectively and appropriately use statistical methods as part of the scientific reasoning process. In short, we envision a new era, in which the broad scientific community recognizes what statisticians have been advocating for many years. In this "post p < .05 era," the full power of statistical argumentation in all its nuance will be brought to bear to advance science, rather than making decisions simply by reducing complex models and methods to a single number and its relationship to an arbitrary threshold. This new era would be marked by radical change to how editorial decisions are made regarding what is publishable, removing the temptation to inappropriately hunt for statistical significance as a justification for publication. In such an era, every aspect of the investigative process would have its appropriate weight in the ultimate decision about the value of a research contribution... This is the first time the ASA has spoken so publicly about a fundamental part of statistical theory and practice. We urge you to share this statement with appropriate colleagues and spread the word via social media...
The summation in Equation (12.60) should go from i = 1 to 3.
We discussed the LASSO in class today. I stated that the procedure is interesting, but is no panacea. You may find a post to my statistics blog of interest in this regard.
I made an Extra Credit offer to my ECS 132 class. The offer applies to you in ECS 256 as well.
The interactive grading went reasonably well today.
I know some students were disappointed with their performance. Though I had said in class and here in the blog that the interactive grading would be "conversational," "like an interview," "like a PhD oral exam--you just talk, and I ask questions based on what you say," I think some students just didn't expect it to work this way. But really everyone did at least OK, some much more than OK; don't worry about it.
I'll be basing your homework grade on three things: Nick's prescreeing assessment; my own reading of your homework submissions; and today's interactive grading.
By the way, I will be announcing a pretty major Extra Credit offer, either this evening or tomorrow.
One important thing that came up in today's grading sessions: Most students found it difficult to express themselves on technical issues. On a number of occasions, I said, "Pretend you are a consultant and you are talking to a client. How would you explain what XYZ is?", e.g. the Kolmogorov-Smirnov band that you coded in Homework II. It's not just that they had problems explaining technical things in general, but also they were not able to put themselves into the shoes of the client, i.e. understand what the client would want to know and how to explain it to him/her in terms the client could understand. This clearly is a skill that everyone should strive to perfect. Anybody can achieve excellence in this, even those who are not native speakers of English.
The interactive grading will be in my office, Kemper 3053.
Remember, our interactive grading sessions will be held tomorrow (Leap Day!), from 3 to 6:30. By now, you should have gotten notification from your group leader as to your time slot.
The style will be conversational. I will ask questions of each person in the group, along the lines of "Why this?" and "How that?" As noted in class, each member of the group must be able to answer questions about any problem in the assignment. Both Homework I and Homework II will be covered.
You'll recall that I've said several times, both here and in class, that estimating mixtures is hard, due to convergence problems (including converging to the "wrong" solution), issues with the latency (hiddenness) of some variables, and so on.
In that light, you'll understand why I laughed out loud today when I saw this quote from one of my favorite statisticians, Larry Wasserman of CMU:
I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.
He also said that mixtures are the "twilight zone" (i.e. Land of the Weird) of statistics.
This is all reported in this posting to R-Bloggers, an extremely valuable newsletter for anyone who does a lot of data analysis. My own blog is archived there too.
Getting back to that mixture posting, what makes it even more interesting is that the posting cites a paper saying that the better way to estimate mixtures than MLE is the Method of Moments! (But of course, there are problems there too.)
I have decided not to assign Homework III, for the following reasons:
I would have liked to have had a third assignment, but the first two have been quite substantial, and certainly achieve the goals I had.
There are still some groups that have not submitted their time preferences for interactive grading to me. Please do so TODAY.
I wish to reiterate that estimation of mixtures is hard. Some students have mentioned to me that they are getting negative values for the mixing coefficients, for instance, and this is not surprising. (It will probably go away if you take a much larger sample.)
This might be improved somewhat by taking an alternating approach similar to that of the EM algorithm: Treat the mixing coefficients as correct and estimate the distribution family parameters in odd-numbered iterations, and do the opposite in the even-numbered ones. But it is not a panacea.
Note by the way that the power law family is very unstable, in the sense that we usually have very small values but sometimes have really large ones, the "Black Swan" notion. Accoringly, that makes estimation of the gamma parameter difficult as well.
Remember that in this project we're pretending to develop a tool that could be used by people who choose to use these "dangerous" models such as mixtures and power laws. We are NOT endorsing the use of such models.
Several people have asked me about standard errors. What are they, and how are they derived?
Actually, the answer to the first question is in your reading, Section 16.5.1. :-)
The answer to the second question is more complex. Where does gmm() get its standard errors, for instance? This relies on the Central Limit Theorem, and something called the "delta method. Here is an outline of the derivation.
First, note that sample moments (say noncentral ones) are made up of sums, so the Central Limit Theorem says they are approximately normally distributed. In fact, if we have a vector of k sample moments, its approximate distribution will be k-variate normal.
Second, we know that any linear function of a normally distributed random vector will again be normally distributed. In general, the estimators coming out of a Method of Moments operation will not be linear functions of the sample moments, but we can approximate them by linear functions by expanding in a first-degree Taylor series, For large n, theta-hat will be close enough to the true population parameter vector theta that the nonlinear terms become negligible. We thus can get the approximate covariance matrix of theta-hat, from the covariance matrix of the sample moments.
It would be good for you to draw a picture of the approximation in (4.46). There you'll see that c/(gamma-1) is an underestimate of the series. In fact, the greater gamma is, the more serious the error. Thus for larger gamma, the true c is considerably less than gamma-1.
Thus when you calculate the first moment of this distribution, I recommend that you use a better method to find c. One quick way would be to simply keep summing k^(-gamma), k = 1,2,3,... until the sum doesn't change much, and then take c to be the reciprocal of the sum.
Unfortunately, it has just come to my attention that my specs for the CRAN problem in Homework II are missing an important ingredient -- a function specifying the pmf/pdf for the parametric family being fit.
One group told me that in their code, the user is prompted for this. I will definitely accept that, but of course it is not good design practice. It would be better to specify the function in an argument gd in the mmfit call, and I've updated the specs accordingly.
Sorry for this omission. It would have been nice if I had learned of it earlier, but since it is so late, I've extended the due date accordingly.
Here are some tips, in response to some questions I've gotten:
In my hint this afternoon on Math Problem A, I forgot to mention (a) use iterated expectations and (b) review the material on p.367 and the related emphasis in class that λ_{M} is a random variable (because M is).
In denoting multiplication in your homework and quizzes, please use '*' only for program code, NOT in math expressions. For the latter use juxtaposition, \cdot etc.
We need to start interactive grading for the homework. Since we are getting a rather late start (my fault), we'll combine the grading for Homework sets I and II.
The grading will consist of 30 minutes per group. If possible, I'd like to have most or all of the sessions next Monday, February 29, between 3 and 6:30 pm. (So the last session might be during 6:00-6:30.) Please discuss with your team, and appoint ONE person to communicate with me on your time preferences during that Monday time block. Remember, all members of the team must be present.
In the homework, I am only asking for the discrete power law.
Nick's excellent Rcpp tutorial is on my Web page.
Notes on the current homework:
The sections are 20.9 and 21.2.2.
I consider Wikipedia to be a wonderful resource, and I use it a lot myself. But it can be wrong, or misleading. (Books can be wrong or misleading too.) Unfortunately, Wikipedia's phrasing here gave these students the wrong impression.
In that Wikipedia entry, one of the "two variables" they refer to is not a variable at all; it is a frequency. In other words, the power law is simply a family of probability mass functions, as in our book, or a similar family of density functions (not in our book).
In the Wikipedia example of the sizes of earthquakes, for instance, the horizontal axis is earthquake size and the vertical axis is the density value for that size.
∑_{i}P(X_{n} = 1) × something
when you take the limit, that factor P(X_{n} = 1) will become π_{i}.
Some comments on the quiz:
Definitely the interactive homework grades (Hwk I grading coming soon!) will play a larger role in the course grades. But yes, the two quizzes will be factors too.
You will submit your work in a .tar file to 256quiz1 on CSIF. The file will contain your LaTeX document in a file Quiz1.tex, with its output in Quiz1.pdf, as well as your R code, in a file Quiz1.R. No subdirectories, please.
I repeated these requirements in class multiple times in the days leading up to the quiz. It was your responsibility to know how to do these things, e.g. form .tar files. And yet instead of .tar I got .zip and even .rar; I got corrupted .pdf files; several people didn't know how to use handin; and many people had subdirectories!
If a student does not take the time to "take care of business" on even such mundane things such as the proper way to submit quizzes, one must speculate that the student did not give careful thought to our textbook reading and so. For instance, I've gotten rather few questions from students about the material in the textbook.
Please excuse my bluntness here, but an exam is supposed to give you useful feedback, so please treat my comments as stemming from my concern for your welfare as students.
If you did submit your quiz in full compliance with the instructions, please notify me, and I will give you Extra Credit. But everyone is expected to be fully compliant in Quiz 2.
I am currently grading your Quiz 1 papers. Again, sorry for the delay, but right now I am being further delayed by the failure of some students to comply with directions, which is causing my grading script to crash. Please note, for example, that .zip files, .rar files and so on are not .tar files! I will not impose penalties, but I would highly appreciate it if you all strictly follow directions in Quiz 2. Thanks in advance.
A student sent me a query regarding the beta distribution example of gmm() in our book, specifically the computation of m2. Why, he asks, do I use
(x-meanb)^2
instead of
(x-mean(x))^2
Actually, the latter would work too, but the former should be a little better, as follows.
Look at (15.23), p.270. It is exactly in analogy with the population mean: We average over our sample instead of averaging over the population, and we use the sample mean instead of the population mean. BUT if we knew the population mean μ, we could use it instead of X-bar, and get a more accurate estimate of σ^{2}. In fact, then there would be no controversy regarding divding by n vs. n-1, since the estimate with 1/n would be unbiased. (Make sure you see this.)
In the case brought up by the student, we don't know μ but we do know it in terms of α and β, and this is useful.
In my post yesterday mentioning "my mmf() function," I of course meant "my mm() function."
News on Hwk II:
From: "Pierre Chausse"
Date: Feb 17, 2016 13:50
Subject: Re: gmm Bug - Single Parameter Estimation
To: "Nick Ulle"
Thanks for pointing that out. It is so rare that I have never tested it. I will fix it when I have time. I might just remove the optimize option and leave the option of using Brent which works well. You did not use Brent properly in gmm(). t0, like in optim, is a scalar and lower and upper must be provided as separate arguments. The following works well: gmm(exp_constraint, x, 5, lower = 0, upper = 10, method = "Brent") Pierre
You also may wish to try my mmf() function, linked to in the homework specs.
I hope to get your quizzes graded tonight. Apologies for the delay.
I see now that I forgot to state in Math Problem A that we're assuming that the stationary distribution etc. exist. I've now added such language.
The specs for Homework II are now complete.
I just made a couple of clarifications on Hwk II.
Note that the due date for Homework II, February 23, should be considered FIRM. Requests for an extension of one day will be considered upon evidence that the requesting group is nearly done.
There will be one math problem in the assignment, also due February 23. It should be on the Web later today.
The CRAN project Hwk II is now on the Web!
I've placed today's quiz, with solutions, in the Exams/ directory of our class Web site.
If you submitted Stage 2 for the homework in a manner that overwrote Stage 1, please resubmit Stages 1 and 2 together in a single file.
Concerning Thursday's quiz:
Thursday's quiz will not cover the statistics material, i.e. Chapter 13 and later. That material will be covered on the second quiz.
I'm getting questions on Problem C. I think it would be helpful to look at a specific example state of transition rates.
There are various ways to label the states. What I'll use here will be (i,j,k), for i ordinary calls currently in service, j calls at the manager (1 being served currently, j-1 in line) and k calls in the arrivals queue. One could set up an equivalent double-subscripting system, as in the problem specs, but this one is easier to discuss here. (And you are welcome to use the triple-subscript system if you wish.)
Say s = 6, and consider the state (6,3,2). Suppose one of the two servers finishes his/her call, and takes the one at the head of the arrivals queue. The latter now has only 1 call. However, say the server finds (immediately, according to assumption) that it is actually a complex call, and hands it off to the manager, so that there are now 4 calls at the manager. But that frees the server who did the handoff, and now that server can take another call from the arrivals queue, further reducing the queue length to 0. BUT this call too might be complex! If so, we end up at state (5,5,0).
Now, at what rate are we going from state (6,3,2) to (5,5,0)? For one particular server, we have a rate of σ q^2 but there are 6 servers for which the above events could occur, for a total rate of 6 σ q^2.
Note that though the above description of what transpires in a transition from (6,3,2) to (5,5,0) involves several events, they all occur instantaneously (under our assumptions). A key point is that only one exponentially distributed random variable is involved, so the issue discussed in class, concerning the probability that two independent continuous random variables occur at once is 0, does not apply here.
As mentioned before, this is a very messy, intricate problem, with lots of cases. You will NOT be docked points for overlooking a case here and there. However, if your simulation results don't come out reasonably close to your analytical ones, you should try to find possible errors either in your math or your code.
Please keep in mind that we will have our first quiz on February 4. (Our second and last quiz will be on the last day of class, March 10.) Note the following.
A question arose today as to exactly what is meant in Problem C by the determination of whether a call is complex is made "upon call arrival." I agree that this was ambiguous. What I meant was when the call is screened by a server. In particular, a call must wait in the big queue, finally be looked at by a server when the call reaches the head of the queue and a server becomes free, and then determined by that server.
By the way, this problem illustrates how quickly a model can become complex. In its description, it seems simply, but the Markov chain is rather intricate, with a lot of cases.
Recall that we will have two quizzes in our course. I've decided that the first one will be on February 4.
You will use your laptops in the quiz. The laptop must have LaTeX and R installed. If either of these conditions presents a problem for you, please let me know as soon as possible.
Math Problem A, first two bullets, concerns the final contents of a box, after we start a new one.
Stage 2 is ready, due February 1. There is quite a bit here, more than it seems. Start now!
I am preparing the second (and last) stage of the homework now, ready later this evening. (I'll announce it then.)
Homework II will involve mixture distributions, including clustering.
Homework III, the last one, will involve something called random effects models.
I made a couple of slight clarifications in the math problem in the current homework.
Today I mentioned the Method of Stages in class. The goal is to be able to handle non-exponential distributions in Markov models. Here is a summary:
For various reason, I've decided to extend the due date for the homework, so that Stage 1 will be due on January 25. However, if you can get Stage 1 done by the original January 19 date, I will make a note in my records of this.
The second and final stage will probably be due a few days after the 25th. It will conist of the continuous-time completion of the CRAN project, and a couple more math problems.
The following is a summary (and partly an extension of) a conversation I had with some students after class today, concerning the current CRAN project.
The user U has a Markov chain he/she wants to analyze. If it is a chain with finite state space, U can simply supply your function with an R matrix, the transition matrix. However, in the infinite case, this cannot be done (and even in the finite case it might be cumbersome). So your function will provide U with an alternative, which is to specify the matrix via a function, pijdef(). What is the nature of that function?
Consider an element p_{ij} of the transition matrix. That can be considered to be a function of i and j; pijdef() formalizes this notion. Consider, for instance, random walk on 1,2,3,... The function might look like this:
pijdef <- function(i,j) { if (i == 1 && j == 2) return(1) if (abs(i-j)) != 1 return(0) 0.5 }
Now, how to find π in the infinite case? The answer is to find the approximate value through an iterative process. First truncate the chain to the first k states, say k = 10, and find π for that chain. Then do the same for a larger truncated version, say k = 20, and find π for it. If the two π vectors are close enough, then use the second one as your final approximation for the full chain. If not, increase k even more.
The hardest thing to do in creating a package for others to use is to anticipate what users might want. -- and what the users might want not to be bothered with. It's up to you whether to ask the user to supply the initial k, or just not to involve the user in this. You may want to ask the user, say optionally, for a convergence criterion ε
You'll need at least one test case, one with known answer. I suggest using birth-and-death models. These are easy to understand and use. There is plenty on the Web on this; see for instance this one on the discrete-time case.
You must write up your mathematical solutions in L^{A}T_{E}X. clearly and in detail, doing a good professional job. Don't just write down equations; explain them.
Stage 1 of the homework is now complete, due January 19. You should start as soon as possible.
I now have the first part of Homework I ready, consisting of the CRAN project. I may make some small clarifications and corrections, but it's basically ready to go. Let me know if you have any questions.
Scooped! (Another good word for nonnative speakers of English.)
I said in class that I didn't know of any R packages that dealt with general Markov chains. Well, I was 42 days out of date. :-) The package markovchain was released on November 26. By the way, I found out about it through my subscription to R-bloggers, a free service that I highly recommend to you. See Joe Rickert's article Getting Started with Markov Chains.
The package looks pretty good, but it only does part of what I will be having you do in your first CRAN project, which as you know will be on Markov chain analysis. On the other hand, it does cover some material not in our course, such as on recurrent and transient states; I have a chapter on that for our book, but chose not to include it.
It is common for one R package to make use of another. (Indeed, it is rare that a package does NOT do this.) You may wish to have your package include this one. Note, though, that this one includes C++ code, making it a little harder to install, a reason you may want NOT to include it.
I expect to have the formal specs writeup for your project in the next day or two.
I've add a bit more detail to the CRAN introduction that I announced here last night.
Please read my quick introduction to R package development. I will be posting (at least a rough draft of) your first package homework in the next day or so.
As I mentioned in class today, Equation (6.29) was not quite right. I've fixed it in the non-class version of the book. That one is always up-to-date, as I continually revise the book (but the official text for our course is still this one, as before).
In case you are wondering, here is the grade distribution from the last time I taught this class, winter 2014. (Dipak Ghosal and I alternate teaching the course.) It was smaller than what we have this year, 29 students. The grade breakdown was
A+ 6 A 11 A- 7 B+ 5
Reminder: Tomorrow (say 5 pm) is the deadline for submitting your group information. If you have formed a group, send the membership list. If you wish to be assigned to a group, so state. I said on Tuesday that you should send this information to me, but instead, please send it to Nick, our TA.
Our TA, Nick Ulle, will hold office hours Wednesdays, 10-12, in 3106 Kemper.
Our classroom is 205 Olson, not 205 Wellman as our Web page originally stated.