Direct Semiparametric Estimation of Fixed Effects Panel Data Varying Coefficient Models

In this paper we present a new technique to estimate varying coefficient models of unknown form in a panel data framework where individual effects are arbitrarily correlated with the explanatory variables in a unknown way. The resulting estimator is robust to misspecification in the functional form of the varying parameters and it is shown to be consistent and asymptotically normal. Furthermore, introducing a transformation, it achieves the optimal rate of convergence for this type of problems and it exhibits the so called "oracle" efficiency property. Since the estimation procedure depends on the choice of a bandwidth matrix, we also provide a method to compute this matrix empirically. Monte Carlo results indicates good performance of the estimator in finite samples.


Introduction
This paper is concerned with the estimation of varying coefficient panel data models. This type of specification consists in a linear regression model where regression coefficients are assumed to be varying depending on some exogenous continuous variables proposed by economic theory. For example, in the so called problem of returns to education, when estimating elasticity of wages to changes in education, it has been pointed out (see Schultz (2003)) that marginal returns to education might vary with the level of working experience. Therefore, conditionally on a level of education, the elasticity is going to change according with the level of working experience.
Within this context, the issue of potential misspecification of the functional form of the varying coefficients has motivated in empirical studies the use of nonparametric estimation techniques. In most part of cases, the estimation of the functional form of the coefficients has been performed through standard techniques such as spline smoothers, series estimators or local polynomial regression estimators (see Su and Ullah (2011)). Although in most cases, a direct application of the previous techniques rends correct inference results, it is true that not much attention has beeen paid to the asymptotic behavior of these estimators under non-standard settings. Unfortunately, some of these settings are rather relevant in empirical analysis of panel data models. One clear example is presence in the econometric model of some unobserved explanatory variables that, although not varying along time, can be statistically correlated with other explanatory variables in the model (fixed effects). The presence of heterogeneity of unknown form that is correlated with some explanatory variables is not an easy problem to deal with. In fact, under this type of heterogenity any estimation technique suffers of the so called incidental parameters problem (Neyman and Scott (1948)).
In order to obtain conistent estimator of the parameters of interest one possible solution is to transform the model in order to remove the heterogeneity of unknown form. To be more precise, consider a linear panel data model where the heterogeneity µ i is arbitrarily correlated with the covariates X it and/or Z it : Y it = X T it m (Z it ) + µ i + υ it , i = 1, · · · , N ; t = 1, · · · , T.
Furthermore, the function m (z) is unknown and needs to be estimated and the υ it are random errors. It is clear that, any attempt to estimate directly m(·) through standard nonparametric estimation techniques will rend inconsistent estimators of the underlying curve. The reason is that E (µ i |X it = x, Z it = z) = 0. A standard solution to this problem is to remove µ i from (1) by taking a transformation and then, estimate the unknown curve through the use of a nonparametric smoother. There exists several approaches to remove these effects. The simplest one is probably to take first differences, i. e.
Direct nonparametric estimation of m (·) has been till now considered as rather cumbersome (see Su and Ullah (2011)). The reason is that, for each i, the conditional expectation E ∆Y it | Z it , Z i(t−1) , X it , X i(t−1) in (2) contains a linear combination of X T it m (Z it ) for different t. This can be considered as an additive function with the same functional form at a different times.
In some special cases consistent estimation of the quantities of interest have been provided in the literature. For X it ≡ 1 (2) becomes a fully nonparametric additive model ∆Y it = m (Z it ) − m Z i(t−1) + ∆υ it , i = 2, · · · , N ; t = 2, · · · , T.
In this case, Mammen et al. (2009) consider the consistent estimation of nonparametric additive panel data models with both time and individual effects via backfitting. Furthermore, for X T it m (Z it ) ≡ g (Z it ) +X T it β 0 , where X it = 1,X T it T and m (Z it ) = (g (Z it ) , β 0 ) T for some real valued g(·) and a vector β 0 , the regression function in (2) becomes a semiparametric partially additive model ∆Y it = β T 0 ∆X it + g (Z it ) − g Z i(t−1) + ∆υ it , i = 2, · · · , N ; t = 2, · · · , T. Qian and Wang (2012) consider the marginal integration estimator of the nonparametric additive component resulting from the first differencing step, i.e. G Z it , Z i(t−1) = g (Z it ) − g Z i(t−1) .
The estimation procedure that we introduce in our paper mainly generalizes the previous results to a rather general varying coefficient model as the one specified in (1). It is based in applying a local approximation to the additive function X T it m (Z it ) − X T i(t−1) m Z i(t−1) . The same idea was proposed in a completely different context in Yang (2002). Since the estimator is based in local approximation properties, we investigate the behavior of the bias remainder term under fairly general conditions. This term, which is negligible in standard local linear regression techniques (see Fan and Gijbels (1995)), requires much more attention when dealing with first difference estimators. In fact, as it has been already pointed out in Lee and Mukherjee (2008), direct application of local linear regression techniques to first differencing transformations in panel data models rends to biased estimators and the bias does not degenerate even with large samples. Using a higher dimensional kernel weight, our estimation technique overcomes the problem of non-vanishing bias although, as expected, the variance term becomes larger. In order to obtain the standard rates of convergence for this type of problems, that is, to reduce the variance holding the bias constant, we propose a one step backfitting algorithm. Since it has the form of an additive model we also show that our estimator is oracle efficient, that is, the variance covariance matrix of any of the components of our estimator is the same asymptotically as if we would know the other component. Finally, we also propose a data driven method to select the bandwidth parameter.
As it was already pointed out before, to remove the heterogeneous effects, other transformations are available in the literature. To our knowledge, for model (1), estimation of m(·) has been proposed in Sun et al. (2009) by the use of the so called Least Squares Dummy Variable Approach. They estimate m(·) through the following alternative specification where d ij = 1 if i = j and 0 otherwise. Based in this model they propose a least-squared method combined with a local linear regression approach that produces a consistent estimator of the unknown smoothing coefficient curves. Compared to our method, their estimator exhibits a larger bias. In fact, their bias presents two terms. One results from the local approximation of m(·). It is also present in our estimator. The second term term results from the unknown fixed effects and it is zero only in the case that they add the additional (strong) restriction that i µ i = 0.
The rest of the paper is organized as follows. In Section 2 we set up the model and the estimation procedure. In Section 3 we study its asymptotic properties and we propose a transformation procedure that provides an estimator that is oracle efficient and achieves optimal rates of convergence. Section 4 shows how to estimate the bandwidth matrix empirically and finally in Section 5 we present some simulation results. Finally, Section 6 concludes the paper. The proofs of the main results are collected in the Appendix.

Statistical model and Estimation procedure
To illustrate our technique we start by the univariate case and further we will present our results for the multivariate case. Then consider (2) with d = q = 1. In this case, for any z ∈ A, where A is a compact subsect in a nonempty interior of IR, one has the following Taylor expansion which suggests that one estimates m (z), m (z), · · · , m (p) (z) by regressing the ∆Y it 's on the terms X it (Z it − z) λ − X it−1 (Z it−1 − z) λ , λ = 0, 1, · · · , with kernel weights. Then, the quantities of interest can be estimated using a local linear regression estimator (see Fan and Gijbels (1995), Ruppert and Wand (1994) or Zhan-Qian (1996)), where h is a bandwidth. For the case of a local constant approximation (i.e. Naradaya-Watson kernel regression estimator) m(z) has the following closed form and in the local linear regression case we have where Z it is a 2 × 1 vector such that Note that in (4) we propose a bivariate kernel that also contains Z i(t−1) instead of considering only Z it . The reason is that, if we consider only a kernel around Z it , the transformed regression equation (2) would be originally localized around Z it without considering all other values. Consequently, the distance between Z is (for s = t) and z cannot be controlled by the fixed bandwidth parameter and so that the transformed remainder terms cannot be negligible. The consequence of all that would be a nondegenerated bias in this type of local linear estimators that it is removed by considering a local approximation around the pair Z it , Z i(t−1) . In Theorem 3.1 we will show that the local linear estimator with the bivariate kernel shows the same rate as standard local linear smothers estimators. That is, with a bias of order O(h 2 ). Unfortunately, the well known trade-off between bias and variance term appears and, although the introduction of this bivariate kernel drops the bias out, it enlarges the variance term that becomes of order O( 1 N T h 2 ). This is also emphasized in Theorem 3.1. Of course this affects the achievable rate of convergence for this type of problems that slows down at the rate √ N T h.
In order to recover the desirable rate of √ N T h we propose a transformation that is basically a one step backfitting algorithm. Let us denote by ∆Y By substituting (2) into (8) we get
As it can be realized from (9), estimation of m (·) is now a one dimensional problem, and therefore we can use again a local linear least squares estimation procedure with univariate kernel weights. However, there is still a problem that needs to be solved. In (8) the term m(Z i(t−1) ) is unknown. Then, we replace it by the initial local linear regression estimator. Once we have done this ∆Y where By doing so, the optimization problem to solve is now the following: The solution for α = m(z) and δ = m (z) in (11) are the oracle efficient local linear least squares estimators. This is shown in Theorem 3.3. Now, once our estimation procedure has been fully explained for the univariate case, we proceed to extend our results for the multivariate case. That is for d = q = 1 in (1). In this case, the local linear estimator of the function m(·) ans its partial derivatives is the solution to, is a dq × 1 vector of first derivatives of m(·) and where H is a q × q symmetric positive definite bandwidth matrix.
It is easy to verify that the solution to the minimization problem in (12) can be written in matrix form as where The local weighted linear least squares estimator of m(z) is then defined as where Note that the dimensions of W and Z are respectively N (T − 1) × N (T − 1) and Finally, there are several reasons to chose local linear least squares estimators against other candidates. First, the form in (14) suggests that this estimator is found by fitting a plane to the data using weighted least squares. The weights are choosen according to the kernel and the bandwidth matrix H. As it is already discussed in Ruppert and Wand (1994) if a Gaussian kernel with (possibly) compact support is chosen, then the weight given to Z it is the value of the Gaussian density with mean Z it − z which has an ellipsoidal contour of the form (Z it − z) T H −1 (Z it − z) = c, for c > 0. Clearly, the farther from z is Z it the less weight it receives. However, H controls both the size and orientation of the ellipsoids at a given density level and therefore it controls also the ammonunt and direction of the weights. Often, instead of taking a matrix H, we adopt a simpler form H = diag{h 2 1 , · · · , h 2 q }. If we have a diagonal bandwidth matrix, this means that the ellipsoids have their axes in the same direction as the coordinate axes, whereas for a general H matrix they will correspond to the eigenvectors of H and, depending on the shape of m(·), there are situations where having a full bandwidth matrix is advantageous. Another important advantage of local linear least squares kernel estimators is that the asymptotic bias and variance expressions are particularly appealing and appear to be superior to those of Naradaya-Watson or other nonparametric estimators. In particular, Fan (1993) shows that the local linear least squares estimator has an important asymptotic minimax property. Furthermore, unlike the Naradaya-Watson or other nonparametric estimators, the bias and variance of (14) near the boundary of the density of Z are of the same order of magnitude as in the interior. That is a very interesting property because, in applications, the boundary region might comprise a large proportion of the data.

Asymptotic properties and the oracle efficient estimator
In this section we investigate some preliminary asymptotic properties of our estimator. In order to do so, we need the following assumptions ,··· ,T be a set of independent and identically distributed IR d+q+1random variables in the subscript i for each fixed t and strictly stationary over t for fixed i. They are a sample following (1). Furthermore, let f Z 1t (·), f Z 1t ,Z 1(t−1) (·, ·), f Z 1t ,Z 1(t−1) ,Z 1(t−2) (·, ·, ·) be respectively the probability density functions of Z 1t , Z 1t , Z 1(t−1) and Z 1t , Z 1(t−1) , Z 1(t−2) . Al density functions are continuously differentiable in all their arguments and they are bounded from above and below in any point of their support. (A. 2) The random errors v it are independent and identically distributed, with zero mean and homoscedastic variance, σ 2 v < ∞. They are also independent of X it and Z it for all i and t. Furthermore, E |v it | 2+δ < ∞, for some δ > 0.
(A.3) µ i has zero mean and finite variance. Furthermore, µ i can be arbitraly correlated with both X it and Z it with unknown correlation structure.
is bounded and uniformly continuous in its support.
for some δ > 0, are bounded and uniformly continuous in any point of their support.
(A.7) Let z be an interior point in the support of f Z 1t . All second-order derivatives of m 1 (·) , m 2 (·) , · · · , m d (·) are bounded and uniformly continuous.
(A.8) The Kernel functions K are compactly supported, bounded kernel such that uu T K(u)du = µ 2 (K)I, where µ 2 (K) = 0 is scalar and I is the q × q identity matrix. In addition, all odd-order moments of K vanish, that is u ι 1 1 · · · u ιq q K(u)du = 0, for all nonnegative integers ι 1 , · · · , ι q such that their sum is odd.
(A.9) The bandwidth matrix H is symmetric and strictly definite positive. Furthermore, each entry of the matrix tends to zero as N tends to infinity in such a way that N |H| → ∞.
As the reader may notice, all assumptions are rather standard in nonparametric regression analysis of panel data models. Assumption (A.1) establishes standard features about the sample and data generating process. The individuals are independent and for a fixed individual we allow for correlation along time. Also other possible time series structures might be considered such as strong mixing conditions (see Cai and Li (2008)) or nonstationary time series data (see Cai et al. (2009)). Mixing conditions are usually taken into account to make the covariances of the estimator tend to cero at a faster rate. In our case this was no needed because of the assumptions imposed on v it . On the other side, nonstationary procesess we believe they exceed the scope of this paper. Note also that marginal densities are assumed to be bounded from above and below. This assumption can be relaxed at the price of increasing the mathematical complexity of the proofs.
Assumption (A.2) is also standard for first difference estimators (see Wooldridge (2002) for the fully parametric case). Furthermore, independence between the v it -errors and the X it -variables and/or the Z it -variables is assumed without loss of generality. We could relax this assumption by assuming some dependence based on second order moments. For example, heteroskedasticity of unknown form can be allowed and in fact, under more complex structures in the variance-covariance matrix a transformation of the estimator proposed in You et al. (2010) can be developed in our setting. This type of assumption also rules out the existence of endogenous explanatory variables and imposes strict exogeneity conditions. If this would be the case then an instrumental variable approach such as the one proposed in Cai and Li (2008) or Cai and Xiong (2012) is needed. Assumption (A.3) imposes the so called fixed effects. Note that this assumption is much weaker than the one introduced in (Sun et al. ) for their Least Squares Dummy Variable approach to work. Basically they impose a smooth relationship between heterogeneity and explanatory variables and, to avoid an additional bias term they need i µ i = 0.
Assumptions (A.4) and (A.5) are some smoothness conditions on moment functionals. Assumption (A.6) is the equivalent to a standard rank condition for identification of this type of models. Assumptions (A.7) to (A.9) are standard in local linear regression estimators (see Ruppert and Wand (1994)). Finally, all our results hold straightforwardly for the random coefficient setting.
Under these assumptions we now establish some results on the conditional mean and the conditional variance of the local linear least squares estimator.
Theorem 3.1 Assume conditions (A.1)-(A.9) hold, then as N tends to infinity and T is fixed we obtain The variance is diag r {tr {H mr (z) H}} stands for a diagonal matrix of elements tr {H mr (z) H}, for r = 1, · · · , d, and i d is a d × 1 unit vector.
The proof of this result is done in the Appendix. Just to illustrate the asymptotic behavior of our estimator, we give a result for the case when d = q = 1 and H = h 2 I. In this case, the above result can be writen as Furthermore, if µ 2 (K u ) = µ 2 (K v ) then the bias term has the following expression The variance is Note that in the standard case that µ 2 (K u ) = µ 2 (K v ) then we obtain a nice result for the bias. In fact, the resulting asymptotic bias has the same expression as in the standard local linear estimator.
As it has already been pointed out in other works the leading terms in both bias and variance do not depend on the sample and therefore, we can consider such terms as playing the role of the unconditional bias and variance. Furthermore, we believe that the conditions established on H are sufficient to show that the other terms are o p (1) and therefore it is possible to show the following result for the asymptotic distribution of m(z; H): as N tends to infinity and T is fixed, where The proof of this result is shown in Appendix.
Note that the rate at which our estimator converges is N T |H|. Under the conditions established in the propositions, our estimator is both consistent and asymptotically normal. However, its rate of convergence is sub-optimal since the lower rate of convergence for this type of estimators is N T |H| 1/2 . As we already indicated in the previous section, in order to achieve optimality we propose to reduce the dimensionality of the problem by redifining ∆Y it as in (10), now for the multivariate case, In expression (15), m(Z i(t−1) ; H) is the first-step local linear estimator obtained in (14). Now we propose to estimate m (Z it ) as the solution to the following problem (16) can be written as and the solution for α = m(z) is In order to show the asymptotic properties of this estimator we need to assume the following about the bandwidth H and its relationship with H: (B.1) The bandwidth matrix H is symmetric and strictly definite positive. Furthermore, each entry of the matrix tends to zero as N tends to infinity in such a way that N H → ∞. (B. 2) The bandwidth matrices H and H must fulfill that: (i) N |H| H / log (N ) → ∞, (ii) tr (H) /tr H → 0, as N tends to infinity.
In general for the kernel function and conditional moments and densities we need both smoothness and boundedness conditions already established in assumptions (A.1) to (A.8). They are required to use uniform convergence results as the ones established in Masry (1996). It is then possible to show the following result where diag r {tr {H mr (z) H}} stands for a diagonal matrix of elements tr {H mr (z) H}, for r = 1, · · · , d, and i d is a d × 1 unit vector.
The proof of the Theorem 3.3 is done in the Appendix.
Finally, focusing on the relevant terms of bias and variance of Theorems 1 and 2 and following Ruppert and Wand (1994) it can be highlighted that each entry of H m (z) is a measure of the curvature of m (·) at z in a particular direction. Thus, we can intuitively conclude that the bias is increased when there is a higher curvature and more smoothing is well described by this leading bias term. Meanwhile, in terms of the variance we can conclude that it will be penalized by a higher conditional variance of Y given Z = z and sparser data near z.

Bandwidth selection
In this section we propose a method to estimate the variable bandwidth matrix H for each estimator. Theoretically, one option would be to minimize some measure of discrepancy, as the Mean Square Error (MSE), with respect to H so the optimal bandwidth matrix could be obtained as Unfortunately, there are some terms of the MSE unknown so this result is not empirically useful and it is necessary to resort to some alternative approach. In this way and following to Zhang and Lee (2000) we propose to get an estimator of the H optimal that minimize an estimation of the MSE and that we call the estimated variable bandwidth matrix ( H).
Let X, Z be vectors of random variables such that X = (X 1 , · · · , X d ) and Z = (Z 1 , · · · , Z q ), and D the observed covariates vector, D = (Z 111 , · · · , Z 1N T , · · · , Z q11 , · · · , Z qN T , X 111 , · · · , X 1N T , · · · , X d11 , · · · , X dN T ) T . Let bias ( m r (z; H)|D) be the conditional bias of a local linear estimator of order (p = 1), m r (z; H), given D and denoting b(z; H) = bias ( m(z; H)|D), the proposed MSE is where Ω(Z) = E X it X T it |Z . Note that However, given that M SE { m(z; H)|D} also depends on some unknown quantities, to obtain H we need to estimate both the conditional bias and variance previously. Then, taking the idea of Zhang and Lee (2000) as benchmark we need an additional assumption: (D.1) Let z be an interior point in the support of f Z 1t . All forth-order derivatives of m 1 (·), m 2 (·), · · · , m d (·) are continuous.
Thus, in the bandwidth selector procedure that we propose we first calculate the conditional bias based on a Taylor expansion of order (p+g). Combining this approximation with (14) (z) are dq 2 and dq (p+g) vectors that contains the vec operator of the Hessian and the (p+g)-th derivative matrix of the r-th component of m(·), respectively, and Thereby, the conditional bias to estimate is where T is a vector of unknown functions that need to be estimated. For convenience we take g = 2 so Z 3 is a N (T − 1) × dq 2 (1 + q) matrix and D 3 a dq 2 (1 + q) × 1 vector.
To estimate the vector D 3b we use a local polynomial regression of order g = 3 (g> 1) with a bandwidth matrix H * so where . . .I dq k ×dq k . . .0 dq k ×d( 4 g=k+1 q g ) is a dq k × d(p + 4 g=1 q g ) vector where I is an identity matrix and 0 a matrix of zeros.
Note that the initial bandwidth matrix H * can be obtained by the minimizer of some residual squares criterion (RSC), as Fan and Gijbels (1995), and then the conditional bias can be already estimated.
Secondly, we calculate the conditional variance that can be written as (50). Using the information of (50), To calculate (21) we need to estimate the unknown quantity σ 2 v for which we use the following normalized weighted residual sum of squares from a (p+g)th-order polynomial fit where ∆Y * it = Z T (p+g)it D + ∆v * it , g = 3, ∆v * it is the idiosyncratic error term and Once known σ 2 v the conditional variance can be already calculated.
To get the Ω(z) matrix it is necessary to estimate the a rr (z) element of this matrix, where r, r = 1, ..., d and Ω(z) = E(X rit X T r it |Z = z), through a fully nonparametric model. For that, we can use some standard nonparametric techniques such as the Nadaraya-Watson estimator (see Härdle (1990) for (p + g) and the minimizer of this expression is H, the estimated variable bandwidth matrix for the local lineal estimator with kernel weights.
Meanwhile, to get the estimated variable bandwidth matrix for the "oracle" estimator we adapt the prior procedure to the particularities of this estimator so now H is selected solving where for (g = 2),

Monte Carlo experiment
In this section we report some Monte Carlo simulation results to examine whether the proposed estimators perform reasonably well in finite samples when µ i are fixed effects.
We consider the following varying-coefficient nonparametric models, where X dit and Z qit are scalars random variables; v it is an i.i.d.N(0,1) random variable; and m(.) is a pre-specified function to be estimated. The observations follow a data generating process where Z qit = w qit + w qi(t−1) , being w qit an i.i.d. uniformly distributed [0, Π/2] random variable; and X dit = 0.5X di(t−1) + ξ it , with ξ it being an i.i.d.N(0,1).
We consider three different cases of study, where the chosen functionals form are m 1 (Z 1it ) = sin (Z 1it Π), m 1 (Z 1it , Z 2it ) = sin ((Z 1it , Z 2it ) Π) and m 2 (Z 1it ) = exp −Z 2 1it ; and we experiment with two specifications for the fixed effects, (a) µ 1i depends on Z 1it , where the dependence is imposed by generating µ 1i = c 0 Z 1i. + u i for i = 2, · · · , N and Z 1i. = T −1 t Z 1it . (b) µ 2i depends on Z 1it , Z 2it through the generating process µ 2i = c 0 Z i. + u i for i = 2, · · · , N and where in both cases u i is an i.i.d.N (0, 1) random variable and c 0 = 0.5 controls de correlation between the unobservable individual heterogeneity and some of the regressors of the model.
In the experiment we use 1000 Monte Carlo replications (Q). The number of period (T) is fixed at three, while the number of cross-sections (N) is varied to be 50, 100 and 200. In addition, the Gaussian kernel has been used and, as Henderson et al. (2008), the bandwidth is chosen as H = σ z (N (T − 1)) −1/5 , where σ z is the sample standard deviation of {Z qit } N,T i=1,t=2 .
We report estimation results for both proposed estimators and as measure of their estimation accuracy we use the Integrated Squared Error (ISE). Thus, denoting the subscript r the rth replication, which can me approximated by the Averaged Mean Squared Error (AMSE) The simulations results are summarized in Table 1, 2 and 3, respectively.   We further carried out a simulation study to analyze the behavior in finite samples of the multivariate locally estimator with kernels weights, m (z; H), and the oracle estimator, m (1) (z; H), proposed in Sections 2 and 3. Looking at Tables 1, 2 and 3 we can highlight the following.
On one hand, as the proposed estimators are based on a first difference transformation, the bias and the variance of both estimators do not depend on the values of the fixed effects so their estimation accuracy are the same for different values of c 0 .
On the other hand, from Tables 1, 2 and 3 we can see that both estimators carry out quite well. For all T, as N increases the AMSE of both estimators are lower, as we expected. This is due to the asymptotic properties of the estimators described previously. In addition, these results also allow us to test the hypothesis that the oracle estimator generates an improvement in the rate of convergence. Specifically, for the univariate case, Tables 1 and 3, we may appreciate that the achievement of both estimators are quite similar while, on the contrary, in the multivariate case, Table 2, the rate of convergence of the oracle estimator is faster that the multivariate locally estimator as we expected. In addition, as we can see in Table 2 results of the local polynomial estimator reflect the "curse of dimensionality" property given that as the dimensionality of Z it increases the AMSE is greater. Thus, the backfitting estimator has an efficiency gain over the local polynomial estimator, as we suspect.

Conclusion
This paper introduces a new technique that estimates varying coefficient models of unknown form in a panel data framework where individual effects are arbitrarily correlated with the explanatory variables in a unknown way. The resulting estimator is robust to misspecification in the functional form of the varying parameters and we have shown that it is consistent and asymptotically normal. Furthermore we have shown that it achieves the optimal rate of convergence for this type of problems and it exhibits the so called oracle efficiency property. Since the estimation procedure depends on the choice of a bandwidth matrix, we also provide a method to compute this matrix empirically. Monte Carlo results indicates good performance of the estimator in finite samples.

Appendix
Proof of Theorem 3.1 Taking assumption (A.2) conditional expectations in (14) and noting that E(v it |X 11 , ..., X N T , Z 11 , ..., Z N T ) = 0, t = 2, · · · , T, i = 1, · · · , N then E{ m(z; H)|X 11 , ..., X N T , Z 11 , ..., Taylor's Theorem implies that where We denote by The remainder term can be writen as We denote by We first analyze the bias term. In order to do this, note that substitutying (25) into (24) and noting that D m (z) in (25) vanishes because then, We will first analyze the asymptotic behavior of 1 2 e T 1 ( Z T W Z) −1 Z T W Q m (z). Later in the paper we will do the same with the second term. For the sake of simplicity let us denote now, define the symmetric block matrix where, We first show that as N tends to infinity where, In order to do so, note that under the stationarity assumption and using iterated expectations Furthermore, under assumptions (A.1) and (A.4) and a Taylor expansion, as N tends to infinity, (32) holds. All what we need to close the proof is show that Var A 11 N T → 0, as the sample size tends to infinity. Note that under the assumption (A.1), Under assumptions (A.4) to (A.6) Then if both N T |H| and N |H| tend to infinity the variance tends to cero and (32) holds.
Similarly one can show that DB ∆XX (Z 1 , Z 2 ) and DB ∆XX −1 (Z 1 , Z 2 ) are respectively d × dq gradient matrices defined as The other gradient matrix is

Finally,
where Using the the results shown in (32), (33) and (34) we obtain where Also it is straightforward to show that the terms in are asymptotically equal to where diag r {tr {H mr (z) H}} stands for a diagonal matrix of elements tr {H mr (z) H}, for r = 1, · · · , d, and i d is a d × 1 unit vector.
Finally, the terms in

The second term for the bias expression is
We already know what is the asymptotic expression for ( Z T W Z) −1 so now we proceed to analyze the asymptotic behavior of Z T W R(z). According to (27) and (28) note that where and We will show now that, as N tends to infinity In order to prove this note that By (28) and assumption (A.7) where ς (η) is the modulus of continuity of ∂ 2 m d ∂z i ∂z j (z). Hence by boundedness of f , B ∆XX and B ∆XX −1 Similarly, Therefore, Then, proceeding as in the proof of the previous result we get also that E {E 12 (z)} = o p (tr {H}). Now, for E 2 (z) note that where Following the same lines as for the proof of (45) it is easy to show that Substituying (35), (36), (38) and (45) into (30), the asymptotic bias can be written as To obtain an asymptotic expression for the variance let us first define the (N (T − 1) × 1)-vector ∆v = (∆v 1 , · · · , ∆v N ) T where ∆v i = (∆v i2 , ..., ∆v iT ) T and let E ∆v∆v Then, taking into account that m(z; H) − E { m(z; H)|X 11 , ..., X N T , Z 11 , ..., the variance ofm (z; H) can be written as Based on assumption (A.2) and the fact that the v it are i.i.d. then, the upper left entry of 1 and The upper right block is where Finally, the lower-right block is where

Now we show that
as N tends to infinity. In order to show this let where ∆v = [∆v 11 , · · · , ∆v N T ] T . We are going to show the asymptotic normality of Since (63) is a multivariate vector we define a unit vector d ∈ IR d(1+q) in such a way that where By Theorem 3.1 and conditions thereof we have that Define now, λ * n,i = T −1/2 T t=1 λ it . For fixed T , the λ * n,i are independent random variables. Therefore, to show (61) it suffices to check Liapunov's condition. By Minkowski's inequality Because of (65) we split λ it into two components λ 1it and λ 2it and we analyze them by separate.
are N (T − 1) × 1 vectors. We can approximate M (1) through a Taylor's expansion, i.e. where, and therefore, To obtain an asymptotic expression for the bias we first calculate Using standard properties of kernel density estimators, under conditions (A.1) to (A.9) and as N tends to infinity, Note that B XX (z) and DB XX (z) are defined as in the proof of Theorem 3.1 but the moment functions now are taken conditionally only to Z it = z.
Following exactly the same lines as in the proof of the variance term in Theorem 3.1 we get, as N tends to infinity, In order to calculate the asymptotic order of I 2 , we just need to calculate 1 N T Z (1)T W (1) E v v T X 11 , · · · , X N T , Z 11 , · · · , Z N T W (1)T Z (1) .
The upper left entry is i ts X it X T i(t−1) E r Z i(t−1) ; H r Z i(s−1) ; H T X 11 , · · · , X N T , Z 11 , · · · , Z N T X i(s−1) X T is K it K is .
Applying the Cauchy-Schwarz inequality for covariance matrices then (77) is bounded by (N T ) −1 i ts X it X T i(t−1) vec 1/2 diag E r Z i(t−1) ; H r Z i(t−1) ; H T X 11 , · · · , X N T , Z 11 , · · · , Z N T ×vec 1/2 diag E r Z i(s−1) ; H r Z i(s−1) ; H T X 11 , · · · , X N T , Z 11 , · · · , Z N T T X i(s−1) X T is K it K is . Now, note that under the conditions of the Theorem 3.1 vec diag E r (z; H) r (z; H) T X 11 , · · · , X N T , Z 11 , · · · , Z N T = O p log N T N T |H| , uniformly in z, and therefore (77) is of order O p log N T N T |H|| H| . Following the same lines, it is easy to show that the upper right entry of (76) is (N T ) −1 i ts X it X T i(t−1) E r Z i(t−1) ; H r Z i(s−1) ; H T X 11 , · · · , X N T , Z 11 , · · · , Z N T and finally the lower right entry of (76) is (N T ) −1 i ts {X it ⊗ (Z it − z)} X T i(t−1) E r Z i(t−1) ; H r Z i(s−1) ; H T X 11 , · · · , X N T , Z 11 , · · · , Z N T (71) and (76)