Direct Semi�?Parametric Estimation of Fixed Effects Panel Data Varying Coefficient Models

In this paper, we present a new technique to estimate varying coefficient models of unknown form in a panel data framework where individual effects are arbitrarily correlated with the explanatory variables in an unknown way. The estimator is based on first differences and then a local linear regression is applied to estimate the unknown coefficients. To avoid a non�?negligible asymptotic bias, we need to introduce a higher�?dimensional kernel weight. This enables us to remove the bias at the price of enlarging the variance term and, hence, achieving a slower rate of convergence. To overcome this problem, we propose a one�?step backfitting algorithm that enables the resulting estimator to achieve optimal rates of convergence for this type of problem. It also exhibits the so�?called oracle efficiency property. We also obtain the asymptotic distribution. Because the estimation procedure depends on the choice of a bandwidth matrix, we also provide a method to compute this matrix empirically. The Monte Carlo results indicate the good performance of the estimator in finite samples.


INTRODUCTION
This paper is concerned with the estimation of varying coefficient panel data models. This type of specification consists of a linear regression model where regression coefficients are assumed to be varying, depending on some exogenous continuous variables proposed by economic theory. For example, in the so-called problem of returns to education, when estimating elasticity of wages to changes in education, it has been pointed out that marginal returns to education might vary with the level of working experience; see Schultz (2003). Therefore, conditionally on a level of education, the elasticity is going to change according to the level of working experience.
Within this context, the issue of potential misspecification of the functional form of the varying coefficients has motivated the use of non-parametric estimation techniques in empirical studies. In most cases, the estimation of the functional form of the coefficients has been performed through standard techniques, such as spline smoothers, series estimators, or local polynomial regression estimators; see Su and Ullah (2011). Although, in most cases, a direct application of the previous techniques renders correct inference results, it is true that not much attention has been paid to the asymptotic behaviour of these estimators under non-standard settings. Unfortunately, some of these settings are rather relevant in empirical analysis of panel data models. One clear example is the presence, in the econometric model, of some unobserved explanatory variables that, although not varying along time, can be statistically correlated with other explanatory variables in the model (fixed effects). The presence of a heterogeneity of unknown form that is correlated with some explanatory variables is not an easy problem to solve. In fact, under this type of heterogeneity any estimation technique suffers from the socalled incidental parameters problem; see, e.g., Neyman and Scott (1948).
In order to obtain a consistent estimator of the parameters of interest, one possible solution is to transform the model in order to remove the heterogeneity of unknown form. To be more precise, consider a linear panel data model where the heterogeneity μ i is arbitrarily correlated with the covariates X it and/or Z it of dimension d and q, respectively: . . . , N; t = 1, . . . , T .
(1.1) Furthermore, the function m(z) is unknown and needs to be estimated, and υ it are random errors. It is clear that any attempt to estimate m(·) directly through standard non-parametric estimation techniques will render inconsistent estimators of the underlying curve. The reason for this is that E[μ i |X it = x, Z it = z] = 0. A standard solution to this problem is to remove μ i from (1.1) by taking a transformation, and then estimating the unknown curve through the use of a non-parametric smoother. There are several approaches to remove these effects. The simplest approach is probably to take first differences, i.e., (1. 2) The direct non-parametric estimation of m(·) has, until now, been considered as rather cumbersome; see Su and Ullah (2011). The reason for this is that, for each i, the conditional expectation E[ Y it |Z it , Z i(t−1) , X it , X i(t−1) ] in (1.2) contains a linear combination of X T it m (Z it ) for different t. This can be considered as an additive function with the same functional form at a different times.
In some special cases, a consistent estimation of the quantities of interest has been provided in the literature. For the unrestricted model X T it m (Z it ) ≡ m (X it , Z it ), (1.2) becomes a fully nonparametric additive model: . . . , N; t = 2, . . . , T . In this case, Henderson et al. (2008) propose an iterative procedure based on a profile likelihood approach, whereas Mammen et al. (2009) consider the consistent estimation of non-parametric additive panel data models with both time and individual effects via a smoothed backfitting algorithm. Furthermore, for X T it m(Z it ) ≡ g(Z it ) + X T it β 0 , where X it = (1, X T it ) T and m(Z it ) = (g(Z it ), β 0 ) T for some real valued g(·) and a vector β 0 , the regression function in (1.2) becomes a semi-parametric partially additive model: . . . , N;t = 2, . . . , T . Qian and Wang (2012) consider the marginal integration estimator of the non-parametric additive component resulting from the first-differencing step, i.e., G Z it , Z i(t−1) = g (Z it ) − g Z i (t−1) .
The estimation procedure that we introduce in this paper mainly generalizes the previous results to a rather general varying coefficient model as the one specified in (1.1) in a framework where N → ∞ but T remains fixed. It is based on applying a local approximation to the additive function X T it m(Z it ) − X T i(t−1) m (Z i(t−1) ). The same idea was proposed in a completely different context by Yang (2002). Because the estimator is based on local approximation properties, we investigate the behaviour of the bias remainder term under fairly general conditions. This term, which is negligible in standard local linear regression techniques (see Fan and Gijbels, 1995b), requires much more attention when dealing with first-difference estimators. In fact, as has already been pointed out by Lee and Mukherjee (2008), the direct application of local linear regression techniques to first-differencing transformations in panel data models renders biased estimators and the bias does not degenerate, even with large samples. Using a higherdimensional kernel weight, our estimation technique overcomes the problem of non-vanishing bias, although, as expected, the variance term becomes larger. The same phenomenon also appears in Henderson et al. (2008), where their final estimator already shows an even larger variance.
In order to obtain the standard rates of convergence for this type of problem (i.e., to reduce the variance holding the bias constant), we propose to use the developments introduced by Fan and Zhang (1999). Their core idea was that the variance can be reduced by further smoothing, but bias cannot be reduced by any kind of smoothing. We apply these ideas to our problem by using a one-step backfitting algorithm. Because it has the form of an additive model, we also show that our estimator is oracle efficient; that is, the variance-covariance matrix of any of the components of our estimator is the same asymptotically as if we would know the other component. Finally, we also propose a data-driven method to select the bandwidth parameter.
As already pointed out previously, to remove the heterogeneous effects, other transformations are available in the literature. To our knowledge, for model (1.1), an estimation of m(·) has been proposed by Sun et al. (2009), through the use of the so-called least-squares dummy variable approach. They estimate m(·) using the following alternative specification where d ij = 1 if i = j , and 0 otherwise. Based on this model, they propose a least-squares method combined with a local linear regression approach that produces a consistent estimator of the unknown smoothing coefficient curves. Compared to our method, their estimator exhibits a larger bias. In fact, their bias presents two terms. The first term results from the local approximation of m(·) and it is also present in our estimator. The second term results from the unknown fixed effects and it is zero only when they add the additional (strong) restriction that i μ i = 0. This type of restriction has also been used by Mammen et al. (2009). The rest of the paper is organized as follows. In Section 2, we set up the model and the estimation procedure. In Section 3, we study its asymptotic properties and we propose a transformation procedure, which provides an estimator that is oracle efficient and achieves optimal rates of convergence. In Section 4, we show how to estimate the bandwidth matrix empirically and, in Section 5, we present some simulation results. Finally, we conclude in Section 6. The proofs of the main results are collected in the Appendix.

STATISTICAL MODEL AND ESTIMATION PROCEDURE
To illustrate our technique, we start with the univariate case and then we present our results for the multivariate case. Then, we consider (1.2) with d = q = 1. In this case, for any z ∈ A, where A is a compact subsect in a non-empty interior of IR, we have the following Taylor expansion: This suggests that we estimate m(z), m (z), · · ·, m (p) (z) by regressing Y it on the terms Then, the quantities of interest can be estimated using a locally weighted linear regression, (2.1) see Fan and Gijbels (1995b), Wand (1994), or Zhan-Qian (1996) and h is a bandwidth. We denote by β 0 and β 1 the minimizers of (2.1). The above exposition suggests as estimators for m(·) and m (·), m h (z) = β 0 and m h (z) = β 1 , respectively. In particular, for the case of a local constant approximation (p = 0; i.e., Naradaya-Watson kernel regression estimator), the estimator for m(z) has the following closed form: In the local linear regression case (p = 1), we have (2.4) Note that in (2.1) we propose a bivariate kernel that also contains Z i(t−1) , instead of considering only Z it . The reason for this is that, if we consider only a kernel around Z it , the transformed regression equation (1.2) would be originally localized around Z it without considering all other values. Consequently, the distance between Z is (for s = t) and z cannot be controlled by the fixed bandwidth parameter and thus the transformed remainder terms cannot be negligible. The consequence of all this would be a non-degenerated bias in this type of local linear estimator, which is removed by considering a local approximation around the pair (Z it , Z i(t−1) ). In Theorem 3.1, we show that the local linear estimator with the bivariate kernel shows the same rate as standard local linear smoother estimators (i.e., with a bias of order O(h 2 )). Unfortunately, the well-known trade-off between the bias and variance terms appears and, although the introduction of this bivariate kernel drops the bias out, it enlarges the variance term, which becomes of order O(1/(NT h 2 )). This is also emphasized in Theorem 3.1. Of course, this affects the achievable rate of convergence for this type of problem, which slows down at the rate √ NT h. In order to recover the desirable rate of √ NT h, we propose a transformation that is basically a one-step backfitting algorithm. Let us denote by Y (1) it the following expression (2.5) By substituting (1.2) into (2.5), we obtain As can be realized from (2.6), the estimation of m(·) is now a one-dimensional problem, and therefore we can use again a local linear least-squares estimation procedure with univariate kernel weights. However, there is still a problem that needs to be solved. In (2.5), the term m(Z i(t−1) ) is unknown. So, we replace it by the initial local linear regression estimator, i.e., By doing so, we can estimate m(·) using the following weighted local linear regression: Let γ 0 and γ 1 be the minimizers of (2.8). Then, as before, we propose as estimators for m(·) and m (·), mh(z) = γ 0 and m h (z) = γ 1 , respectively. Now, once our estimation procedure has been fully explained for the univariate case, we proceed to extend our results for the multivariate case, that is, for d = q = 1 in (1.1). In this case, the quantities of interest can be estimated using a multivariate locally weighted linear regression, where we denote by where H is a q × q symmetric positive-definite bandwidth matrix. Finally, we denote by β = ( β T 0 β T 1 ) T a d(1 + q)-vector that minimizes (2.9). Again, the above exposition suggests as estimators for m(z) and D m (z) = ∂m(z)/∂z, m(z; H ) = β 0 and vec( D m (z; H )) = β 1 , respectively. Here, D m (z) is a d × q-matrix of partial derivatives of the d-function m(z) with respect to the elements of the q × 1 vector z.
It is easy to verify that the solution to the minimization problem in (2.9) can be written in matrix form as The local weighted linear least-squares estimator of m(z) is then defined as Finally, there are several reasons to chose local linear least-squares estimators against other candidates. First, the form in (2.11) suggests that this estimator is found by fitting a plane to the data using weighted least-squares. The weights are chosen according to the kernel and the bandwidth matrix H . As has already been discussed by Ruppert and Wand (1994), if a Gaussian kernel with (possibly) compact support is chosen, then the weight given to Z it is the value of the Gaussian density with mean Z it − z, which has an ellipsoidal contour of the form (Z it − z) T H −1 (Z it − z) = c, for c > 0. Clearly, the further from z that Z it is, the less weight it receives. However, H controls both the size and orientation of the ellipsoids at a given density level and therefore it also controls the amount and direction of the weights. Often, instead of taking a matrix H , we adopt a simpler form H = diag(h 2 1 , . . . , h 2 q ). If we have a diagonal bandwidth matrix, this means that the ellipsoids have their axes in the same direction as the coordinate axes, whereas for a general H matrix they will correspond to the eigenvectors of H . Depending on the shape of m(·), there are situations where having a full bandwidth matrix is advantageous. Another important advantage of local linear least-squares kernel estimators is that the asymptotic bias and variance expressions are particularly appealing and appear to be superior to those of the Naradaya-Watson or other non-parametric estimators. In particular, Fan (1993) shows that the local linear least-squares estimator has an important asymptotic minimax property. Furthermore, unlike the Naradaya-Watson or other non-parametric estimators, the bias and variance of (2.11) near the boundary of the density of Z are of the same order of magnitude as in the interior. This is a very interesting property because, in applications, the boundary region might comprise a large proportion of the data.

ASYMPTOTIC PROPERTIES AND THE ORACLE EFFICIENT ESTIMATOR
In this section, we investigate some preliminary asymptotic properties of our estimator. In order to do so, we need the following assumptions. ...,N ; t=1,...,T be a set of independent and identically distributed (i.i.d.) IR d+q+1 -random variables in the subscript i for each fixed t and strictly stationary over t for fixed i. They are a sample following (1.1).
and (Z 1t , Z 1(t−1) , Z 1(t−2) ), respectively. All density functions are continuously differentiable in all their arguments and they are bounded from above and below in any point of their support.
ASSUMPTION 3.2. The random errors v it are i.i.d., with zero mean and homoscedastic variance, They are also independent of X it and Z it for all i and t. Furthermore, E |v it | 2+δ < ∞, for some δ > 0. ASSUMPTION 3.3. μ i can be arbitrarily correlated with both X it and Z it with unknown correlation structure.
bounded and uniformly continuous in its support. Furthermore, let = z 3 are bounded and uniformly continuous in their support.

bounded and uniformly continuous in any point of their support.
ASSUMPTION 3.7. Let z be an interior point in the support of f Z 1t . All second-order derivatives of m 1 (·) , m 2 (·) , . . . , m d (·) are bounded and uniformly continuous.
ASSUMPTION 3.8. The kernel functions K are compactly supported, bounded kernel such that uu T K(u)du = μ 2 (K)I and K 2 (u)du = R(K), where μ 2 (K) = 0 and R(K) = 0 are scalars and I is the q × q identity matrix. In addition, all odd-order moments of K vanish; that is, for all non-negative integers ι 1 , . . . , ι q such that their sum is odd. ASSUMPTION 3.9. The bandwidth matrix H is symmetric and strictly definite positive. Furthermore, each entry of the matrix tends to zero as N → ∞ in such a way that N |H | → ∞.
As can be seen, all assumptions are rather standard in the non-parametric regression analysis of panel data models. Assumption 3.1 establishes standard features about the sample and datagenerating process. The individuals are independent and, for a fixed individual, we allow for correlation over time. Also, other possible time-series structures might be considered, such as strong mixing conditions (see Cai and Li, 2008) or non-stationary time-series data (see Cai et al., 2009). Mixing conditions are usually taken into account to make the covariances of the estimator tend to zero at a faster rate. In our case, this is not necessary because our asymptotic analysis is performed for fixed T . However, we believe that non-stationary processes are beyond the scope of this paper. Note also that marginal densities are assumed to be bounded from above and below. This assumption can be relaxed at the price of increasing the mathematical complexity of the proofs.
Assumption 3.2 is also standard for first-difference estimators; see Wooldridge (2002) for the fully parametric case. Furthermore, independence between the v it errors and the X it variables, and/or the Z it variables is assumed without loss of generality. We could relax this assumption by assuming some dependence based on second-order moments. For example, heteroscedasticity of unknown form can be allowed and, in fact, under more complex structures in the variancecovariance matrix, a transformation of the estimator proposed by You et al. (2010) can be developed in our setting. This type of assumption also rules out the existence of endogenous explanatory variables and imposes strict exogeneity conditions. If this were the case, then an instrumental variable approach, such as the one proposed by Cai and Li (2008) or Cai and Xiong (2012), would be needed. Assumption 3.3 imposes the so-called fixed effects. Note that this assumption is much weaker than the one introduced by Sun et al. (2009) so that their least-squares dummy variable approach can work. Basically, they impose a smooth relationship between heterogeneity and explanatory variables, and in order to avoid an additional bias term, they need i μ i = 0.
Assumptions 3.4 and 3.5 are some smoothness conditions on moment functionals. Assumption 3.6 is the equivalent to a standard rank condition for identification of this type of model. Assumptions 3.7-3.9 are standard in local linear regression estimators (see Ruppert and Wand, 1994). Finally, all our results hold straightforwardly for the random coefficient setting.
Under these assumptions, we now establish some results on the conditional mean and the conditional variance of the local linear least-squares estimator.

The variance is
To illustrate the asymptotic behaviour of our estimator, we give a result for the case when d = q = 1 and H = h 2 I . In this case, the above result can be written as follows.
COROLLARY 3.1. Let Assumptions 3.1-3.8 hold. Then, if h → 0 in such a way that Nh 2 → ∞ as N tends to infinity and T is fixed, , then the bias term has the following expression: The variance is Note that, in the standard case, μ 2 (K u ) = μ 2 (K v ) and thus we obtain a good result for the bias. In fact, the resulting asymptotic bias has the same expression as in the standard local linear estimator.
As has already been pointed out in other works, the leading terms in both bias and variance do not depend on the sample, and therefore we can consider such terms as playing the role of the unconditional bias and variance. Furthermore, we believe that the conditions established on H are sufficient to show that the other terms are o p (1) and therefore it is possible to show the following result for the asymptotic distribution of m(z; H ). THEOREM 3.2. Let Assumptions 3.1-3.9 hold. Then, for T fixed and N → ∞, Note that the rate at which our estimator converges is NT |H |. Under the conditions established in the propositions, our estimator is both consistent and asymptotically normal. However, its rate of convergence is suboptimal because the lower rate of convergence for this type of estimator is NT |H | 1/2 . As we have already indicated in Section 2, in order to achieve optimality we propose to reduce the dimensionality of the problem by redefining Y it as in (2.7), now for the multivariate case, is the first-step local linear estimator obtained in (2.11). Now, we propose to estimate m (Z it ) using a multivariate locally weighted linear regression, 3). Following the same reasoning as before, we can write In order to show the asymptotic properties of this estimator, we need to assume the following about the bandwidth H and its relationship with H .  In general, for the kernel function and conditional moments and densities, we need both the smoothness and boundedness conditions already given in Assumptions 3.1-3.8. These are required to use uniform convergence results such as the ones established by Masry (1996). It is then possible to show the following result. THEOREM 3.3. Let Assumptions 3.1-3.8 and 3.10-3.11 hold. Then, for T fixed and N → ∞, E m(z; H )|X 11 , . . . , X NT , Z 11 , . . . , Z NT − m(z) and Var( m(z; H )|X 11 , . . . , X NT , Z 11 , . . . , where diag r (tr(H m r (z) H )) stands for a diagonal matrix of elements tr(H m r (z) H ), for r = 1, . . . , d, and i d is a d × 1 unit vector.
Finally, focusing on the relevant terms of bias and variance of Theorems 3.1 and 3.2 and following Ruppert and Wand (1994), it can be highlighted that each entry of H m (z) is a measure of the curvature of m(·) at z in a particular direction. Thus, we can intuitively conclude that the bias is increased when there is a higher curvature and more smoothing is well described by this leading bias term. Meanwhile, in terms of the variance, we can conclude that it will be penalized by a higher conditional variance of Y given Z = z and sparser data near z.

BANDWIDTH SELECTION
As is clear from previous sections, the bandwidth matrix H plays a crucial role in the estimation of the unknown quantity m(·). In fact, as we have learned from the asymptotic expressions, when choosing H there exists a trade-off between the bias and the variance of our estimator. Consider the simplest case, H = h 2 I . If we choose h very small, then according to Corollary 3.1 the bias of our estimator will be reduced (it is of the order of h 2 ) but at the price of enlarging the variance (the order of this term is 1/N T h 2 ). This trade-off should be solved by choosing a bandwidth matrix H that minimizes the mean square error (MSE), which is the sum of the squared bias and variance. There are many different measures of discrepancy between the estimator m (·; H ) and the function m (·; H ). A comprehensive discussion of these measures has been given by Härdle (1990, Chapter 5). For the sake of simplicity, and taking into account the data-generating process in (1.1), we propose the following measure of discrepancy, In this MSE, the expectation is taken over Z 1 , . . . , Z q ; X 1 , . . . , X d and m(Z; H ) is the estimator defined in (2.11). Therefore, for our problem, we can define the optimal bandwidth matrix H opt as the solution to the following minimization problem, If Z 1 , . . . , Z q ; X 1 , . . . , X d are random variables that are independent of the observed sample D = (X 11 , Z 11 , . . . , X NT , Z NT ) T , but they share the same distribution with (X 11 , Z 11 ), it is straightforward to show that As can be realized from the expression above, we have now formalized the idea of choosing a bandwidth matrix H that minimizes the MSE (i.e., the sum of the squared bias and variance). Note that the way we have defined the measure of discrepancy determines, in our case, the choice of a global bandwidth. That is, we choose a bandwidth that remains constant with the location point. Of course, another possibility would be to choose a bandwidth that varies locally according to this location point (i.e., H (z)). In this case, the local MSE criteria would be where now the expectation is taken over X. Müller and Stadtmüller (1987) have discussed the issue of local variable bandwidth for convolution-type regression estimators. Furthermore, Fan and Gijbels (1992) have proposed a variable bandwidth for the estimation of local polynomial regression. In our case, we propose to choose a global bandwidth. The reason is twofold. First, all components in our model have been assumed to have the same degree of smoothness. Second, the use of local bandwidths, except for the case where the curve presents a rather complicated structure, increases the computational burden without much improvement in the final results. This is probably because of the local adaptation property that already exhibits local linear regression smoothers. Unfortunately, the selection of H opt does not solve all problems in bandwidth selection. In fact, as it can be realized, the MSE depends on some unknown quantities, and therefore our optimal bandwidth matrix cannot be estimated from the data. There are several alternative solutions to approximate the unknown quantities in the MSE. One alternative is to replace in (4.1) both bias and variance terms by their respective first-order asymptotic expressions that were obtained in Theorem 3.1. This is the so-called 'plug-in' method; for details, see Ruppert et al. (1995). Another possibility is, as suggested by Fan and Gijbels (1995a), to replace directly in (4.1) bias and variance by their exact expressions. That is, Here, τ is an N (T − 1) vector such that, for i = 1, . . . , N, t = 2, . . . , T , (4.5) In order to estimate both bias and variance, we need to calculate τ and V. Note that for τ , developing a fifth-order Taylor expansion of both m (Z it ) and m Z i(t−1) around z, a local polynomial regression of order five would guarantee that the proposed bandwidth selection procedure will be √ N-consistent for the local linear fit; see Hall et al. (1991) for details. However, for the sake of simplicity, a local cubic polynomial regression would be close to a √ N-consistent selection rule and it will lead to a reduction in the computational effort. In this case (for d = q = 1), the vectorτ will contain the (estimated) expressions for the secondand third-order derivatives of the local cubic polynomial regression of the terms Y it on to However, in order to estimate V, note that, because of Assumption 3.2, the estimation of V is tantamount to the estimation of σ 2 υ . In order to estimate this last quantity, note that under the homoscedastic assumption, we can consistently estimate this by (4.6) Note that bothτ andσ 2 υ depend on a bandwidth matrix H that needs to be determined from the data. A suitable pilot bandwidth matrix H * , which can be used for these computations, can be obtained using the global residual squares criterion (RSC) procedure proposed by Fan and Gijbels (1995a). Furthermore, we denote by m −i (Z it ; H ) the leave-one-out estimator of m (Z it ). That is, when estimating m (Z it ) using (2.11), we use all data except those that belong to the ith subject. Note that, once we have estimated τ and σ 2 υ , we can provide an estimator for b (H ), V (H ), and (H ). Mainly, .
The corresponding estimator of the MSE(H ), according to (4.1), is (4.7) Then, we define the estimator of H opt , H opt as the solution to the following problem, Although we do not provide theoretical properties of this bandwidth, in Zhang and Lee (2000) they have been studied in detail for the local MSE case, and we believe it is straightforward to analyse them for the global MSE case that we present here. Finally, we propose to use the same procedure to estimate the bandwidth matrix H when estimating the oracle efficient estimator.

MONTE CARLO EXPERIMENT
In this section, we report some Monte Carlo simulation results to examine whether the proposed estimators perform reasonably well in finite samples when μ i are fixed effects. We consider the following varying coefficient non-parametric models: Here, X dit and Z qit are scalar random variables, v it is an i.i.d. N(0, 1) random variable, and m(·) is a pre-specified function to be estimated. The observations follow a data-generating process where Z qit = w qit + w qi(t−1) (w qit is an i.i.d. uniformly distributed (0, /2) random variable) and We consider three different cases of study: Here, the chosen functional forms are m 1 (Z 1it ) = sin (Z 1it ), m 1 (Z 1it , Z 2it ) = sin ((Z 1it , Z 2it ) ) and m 2 (Z 1it ) = exp −Z 2 1it . We experiment with two specifications for the fixed effects: (a) μ 1i depends on Z 1it , where the dependence is imposed by generating μ 1i = c 0 Z 1i. + u i for i = 2, . . . , N and Z 1i. = T −1 t Z 1it ; (b) μ 2i depends on Z 1it , Z 2it through the generating process μ 2i = c 0 Z i. + u i for i = 2, . . . , N and Z i. = (1/2) Z 1i. + Z 2i. .
In both cases, u i is an i.i.d. N(0, 1) random variable and c 0 = 0.5 controls the correlation between the unobservable individual heterogeneity and some of the regressors of the model.
In the experiment, we use 1000 Monte Carlo replications (Q). The number of period (T ) is fixed at three, while the number of cross-sections (N) is varied between 50, 100, and 200. In addition, the Gaussian kernel has been used and, as in Henderson et al. (2008), the bandwidth is chosen as H = σ z (N (T − 1)) −1/5 , where σ z is the sample standard deviation of (Z qit ) N,T i=1,t=1 . We report estimation results for both proposed estimators and we use the MSE as a measure of their estimation accuracy. Thus, denoting the rth replication by the subscript r,  The simulations results are summarized in Tables 1-3. Furthermore, we carried out a simulation study to analyse the behaviour in finite samples of the multivariate locally estimator with kernel weights, m (z; H ), and the oracle estimator, m z; H , as proposed in Sections 2 and 3. Looking at Tables 1-3, we can highlight the following.
On the one hand, because the proposed estimators are based on a first-difference transformation, the bias and the variance of both estimators do not depend on the values of the fixed effects, so their estimation accuracy is the same for different values of c 0 .
On the other hand, from Tables 1-3, we can see that both estimators show a good performance. For all T , as N increases, the AMSEs of both estimators are lower, as expected. This is because of the asymptotic properties of the estimators described previously. In addition, these results also allow us to test the hypothesis that the oracle estimator generates an improvement in the rate of convergence. Specifically, for the univariate case (Tables 1 and 3), we appreciate that the achievements of both estimators are quite similar while, in contrast, in the multivariate case (Table 2), the rate of convergence of the oracle estimator is faster than the multivariate locally estimator, as expected. In addition, as we can see in Table 2, the results of the local polynomial estimator reflect the 'curse of dimensionality' property, given that as the dimensionality of Z it increases, the AMSE is greater. Thus, the backfitting estimator has an efficiency gain over the local polynomial estimator, as we suspect.

CONCLUSION
In this paper, we introduce a new technique that estimates the varying coefficient models of unknown form in a panel data framework where individual effects are arbitrarily correlated with the explanatory variables in an unknown way. The resulting estimator is robust to misspecification in the functional form of the varying parameters, and we have shown that it is consistent and asymptotically normal. Furthermore, we have shown that it achieves the optimal rate of convergence for this type of problem and it exhibits the so-called oracle efficiency property. Because the estimation procedure depends on the choice of a bandwidth matrix, we also provide a method to compute this matrix empirically. The Monte Carlo results indicate the good performance of the estimator in finite samples. where

APPENDIX
The Taylor theorem implies that We denote by The remainder term can be written as We denote by First, we analyse the bias term. In order to do this, we substitute (A.2) into (A.1) and note that vec(D m (z)) in (A.2) vanishes because Then, We first analyse the asymptotic behaviour of (1/2)e T 1 ( Z T W Z) −1 Z T W Q m (z). Later in the appendix, we do the same with the second term. For the sake of simplicity, let us denote Now, we define the symmetric block matrix We show that as N tends to infinity In order to do so, note that under the stationarity assumption and using iterated expectations Furthermore, under Assumptions 3.1 and 3.4 and a Taylor expansion, as N tends to infinity, (A.9) holds. All we need to do to close the proof is to show that Var(A 11 NT w) → 0, as the sample size tends to infinity.
Note that under Assumption 3.1, Under Assumptions 3.4-3.6 Then, if both NT |H | and N |H | tend to infinity the variance tends to zero and (A.9) holds. Similarly, we can show that Here, DB XX (Z 1 , Z 2 ) and DB XX −1 (Z 1 , Z 2 ) are d × dq gradient matrices. We define DB XX (Z 1 , Z 2 ) as The DB XX −1 (Z 1 , Z 2 ) gradient matrix is Finally, z) , Using the results shown in (A.9), (A.10), and (A.11), we obtain Also, it is straightforward to show that the terms in are asymptotically equal to Here, diag r (tr(H mr (z)H )) stands for a diagonal matrix of elements tr(H mr (zw)H ), for r = 1, . . . , d, and i d is a d × 1 unit vector, and Finally, the terms in The second term for the bias expression is e T 1 ( Z T W Z) −1 Z T W R(z). We already know the asymptotic expression for ( Z T W Z) −1 , so now we proceed to analyse the asymptotic behaviour of Z T W R(z). According to (A.4) and (A.5), note that and Now, we show that, as N tends to infinity, In order to prove this, note that By (A.5) and Assumption 3.7, where ς (η) is the modulus of continuity of (∂ 2 m r /∂z i ∂z j )(z). Hence, by boundedness of f , B XX and Also, E [E 11 (z)] = o p (tr(H )) follows by dominated convergence. Similarly, Therefore, Then, proceeding as in the proof of the previous result, we also find that E [E 12 (z)] = o p (tr(H )). Now, for E 2 (z), note that and Following the same lines as for the proof of (A.22), it is easy to show that To obtain an asymptotic expression for the variance, let us first define the (N ( The upper-right block is Finally, the lower-right block is where

Proof of Theorem 3.2: Let
Now, we show that as N tends to infinity. In order to show this, let We are going to show the asymptotic normality of Because (A.39) is a multivariate vector, we define a unit vector d ∈ IR d(1+q) in such a way that By Theorem 3.1 and conditions thereof, we find that and t |Cov (λ i1 , λ it ) | = o p (1). Now define λ * n,i = T −1/2 T t=1 λ it . For fixed T , the λ * n,i are independent random variables. Therefore, to show (A.37), it suffices to check the Liapunov condition. Using the Minkowski inequality, Because of (A.41), we split λ it into two components λ 1it and λ 2it and we analyse them separately: The leading term in (A.43) does not depend on the sample, and then the proof is closed. To obtain an asymptotic expression for the bias, we first calculate Using standard properties of kernel density estimators, under Assumptions 3.1-3.9 and as N tends to infinity, Note that B XX (z) and DB XX (z) are defined as in the proof of Theorem 3.1 but the moment functions now are taken conditionally only to Z it = z.