Diagnostic and Treatment Reproducibility of Cervical Intraepithelial Neoplasia / Squamous Intraepithelial Lesion and Factors Affecting the Diagnosis

Diagnosis and management of Human Papilloma Virus (HPV)-related cervical lesions is a struggle. The main problem is which patient to treat, a decision largely (not solely) based on pathological diagnosis. Diagnosis is nontrivial due to conflicting classification schemas [3-class cervical intraepithelial neoplasia (CIN) vs. 2-class squamous intraepithelial lesion (SIL)] and subjective diagnostic criteria that are variously interpreted amongst pathologists (1-3). A recent study by Gage et al. showed that women would have a different probability of being treated depending on which laboratory and hence which pathologists reviewed the biopsy specimen (4).

system and the two-tiered "Modified Bethesda" system (SIL) and to determine if there were any influences other than morphology on the diagnoses made.
Disease specific biomarkers, such as immunohistochemical (IHC) stains for p16, Ki-67 and Pro-ExC, have emerged as adjunctive tools for lesion classification.Shortfalls in assessing their utility include lack of a clear diagnostic gold standard, and uncertainty regarding when they should be implemented and how they are interpreted.We tackled some of these questions by measuring interobserver interpretive concordance for p16, Ki-67 and Pro-ExC, and benchmarking how they influenced diagnostic decision making.
Lastly, clinical management response to different diagnoses was evaluated amongst our gynecologist oncologists.

Case selection
21 pathologists from 11 centers joined the study.Each center contributed six cervical biopsy cases for the study.The diagnostic spectrum included reactive, "low grade squamous intraepithelial lesion" (LSIL),"high grade squamous intraepithelial lesion" (HSIL) and microinvasive squamous cell carcinoma (mSCC).A total of 66 cases were collected (19 cervical biopsies, 44 LEEP/conization materials and 3 hysterectomy specimens).Only one representative slide from each case was selected.

Microscopic Examination
The pathologists assessed cases in two rounds, blinded to the original diagnosis and clinical features in each.They stratified all cases according to the CIN (CIN1, CIN2, CIN3) and SIL (LSIL, HSIL) classification systems with an additional group for the reactive and mSCCs.Round one was the "initial H&E round" where only H&E stained sections were evaluated.They also stated if they would require IHC studies to complement the diagnosis.The second round was the "follow-up with immunohistochemistry (IHC)" round, where cases were reevaluated along with IHC stains for p16, Ki-67 and Pro-ExC.
A total of 19 pathologists completed all phases of the study.

Questionnaire
Pathologists completed a questionnaire about factors that influence their diagnosis.Gynecologic oncologists from each contributing center were also queried with a questionnaire.They were asked to choose from 6 different treatment options present within these questionnaires as detailed below: 1-Therapy for infection 2-Follow-up with smear examinations 3-Follow-up with smear and colposcopy 4-Surface ablative therapy (laser or cryosurgery).

6-Hysterectomy
They completed the questionnaires twice, first, using the original (pre-study) pathology report, and second, a "reedited standardized" post-study report, where all reports had the same format and all biopsy specimens accepted as LEEP material (so that differences in biopsy sizes like punch biopsy and hysterectomy would not be an additional confounding factor).Our goal was to assess factors that influenced gynecologic oncologists choice of treatment, and how patient management changed by pathologic diagnosis.
The questionnaires given to pathologists and gynecologic oncologist also contained questions regarding training and practice environment.

statistical Analysis
Inter-observer reproducibility between the 19 reviewing pathologists was calculated using the kappa statistic (κ) for multiple raters when there are more than two diagnostic outcomes (6).The 95% bootstrap confidence intervals were calculated for the kappa statistics.The calculation was carried out separately for the two diagnostic rounds.The same calculation was repeated to assess the reproducibility of interpretation of the IHC stains.A consensus diagnosis was extracted for each case by using the majorityrule diagnoses of 19 different pathologists.Moreover, overall and category specific proportions of agreement (form raters) were calculated to assess the agreement of surveillance (options 1,2,3 above) compared to ablative or surgical (options 4,5,6 above) management preferences of the gynecologic oncologist.The kappa values were read as follows, 0: no agreement better than chance; 0-0.2: poor

Characteristics of the Pathologists
The 19 pathologists (Table I) were from university and community hospitals in different regions of Turkey with varying gynecologic workloads, duration of practice experience and practice context.

Factors That Influenced Pathologist Diagnoses
According to the Questionnaire Histopathology, IHC and smear results were most influential.The treatment preferences (ablation vs. surveillance) of the gynecologic oncologists the pathologists worked with, also had an effect on diagnoses rendered (Table II).

Interobserver Reproducibility of Diagnoses and Immunostain Interpretation
The inter-observer diagnostic concordance between the 19 pathologists for the "initial H&E" and "follow-up with IHC" rounds are summarized in Table III.agreement; 0.2-0.4:fair agreement; 0.4-0.6:moderate agreement; 0.6-0.8:substantial agreement; 0.8-1: almost perfect agreement (7).Mc-Nemar Bowker test was used to assess the differences in pathologist's classifications between the two rounds.Kappa analyses and the statistical tests were performed in STATA version 12.0 (StataCorp.Texas, USA).The statistical significance was set at p<0.05.Diagnostic trends were examined by hierarchical cluster analysis in a heat-map (color=diagnosis) matrix of reviewer by case (X and Y axis, respectively).For unsupervised hierarchical cluster analysis, euclidian distance measure was used, with Ward's linkage method performed in R (version 3.  The agreement was moderate with both classification systems, the SIL classification system having a higher kappa value.IHC evaluation did not significantly improve inter-observer diagnostic reproducibility within either classification system (p<0.05both for CIN and SIL).
A majority-rules consensus was calculated for each case during each round.Inter-observer reproducibility (weighted Kappa values) of the pathologists, with regard to the majority-rule consensus diagnosis ranged from 0.69 to 0.99, with the exception of one outlier, a resident in training, who had the lowest kappa values of 0.58-0.66(Table I).
SIL and CIN consensus diagnoses of the cases for the first and second round were cross-matched (Table IV, V) except for one case all CIN2-3 were HSIL and all CIN1 were LSIL.
Overall kappa values (interobserver reproducibility) amongst the 19 pathologists for interpretation of each individual IHC stain and the kappa values with regard to each score are given in Table VI.There was a moderate to substantial agreement in interpretation of IHC with judgment of score 2 being the most problematic.
Individual pathologists displayed different diagnostic patterns.For example, some stood out by high percentage of use of certain categories such as CIN2.This can be seen in Figures 1 and 2. Two major diagnostic styles emerged in which membership was highly conserved (17/19) by diagnostic schema used.Generally, the rightmost diagnostic style group had a tendency to push SIL and CIN diagnoses to a higher grade -a diagnostically aggressive group (tendency to upgrade -"upG"), whereas the left most group tended to do the opposite (tendency to down grade -"downG").All Kappa values were statistically significant (p<0.001).*Together with immunohistochemistry. § 95% bootstrap confidence interval for the overall Kappa: (0.437 -0.552), ¥ 95% bootstrap confidence interval for the overall Kappa: (0.460 -0.549).£ 95% bootstrap confidence interval for the overall Kappa: (0.548 -0.631), ⱡ 95% bootstrap confidence interval for the overall Kappa: (0.557 -0.648).Five pathologists made significant changes in their diagnoses after the addition of IHC, including two not experienced in gynecologic pathology, and three gynecologic pathologists.

Unblinded Re-Review of Most Discordant Cases
Five cases in which more than half the pathologists stated that they would order IHC turned out to be the ones in which most diagnostic change was made between the two rounds.Examination of these cases (Table VIII) revealed that some had areas where the differential diagnosis of benign lesions like inflammation associated changes had to be entertained (cases 6 and 10).Case 6 is characteristic; before IHC except for one, all "downG" group pathologists diagnosed it as reactive while "upG" group pathologists as HSIL/mSCC.After IHC the diagnosis was HSIL or mSCC by both groups of pathologist (Figure 3A-D).
Diagnostic styles of individual pathologists was mostly conserved across diagnostic schema (CIN to SIL) (Table VII)."Initial H&E" round to "follow-up with IHC" round crossover of individual pathologists from one diagnostic style group to another however occurred with equal frequency in both directions: 50% (3/6) downG to upG, 50% (5/10) upG to downG.It seems likely that individuals were affected in a different manner by IHC.

Diagnostic Impact of Immunohistochemistry
Diagnostic changes made by pathologists after IHC and its impact on inter-observer reproducibility were not statistically significant (Table I

Characteristics of the Participating Gynecologic Oncologists
The 12 gynecologic oncologists were from university and community hospitals in different regions of Turkey.They had varying workloads and differed in the duration of practice experience and practice context.They reported histology, smear results and patient's age to be most influential on diagnostic decision making (Table IX).

Interobserver Reproducibility of Patient Management Among Gynecologic Oncologists
Concordance of treatment methods amongst gynecologic oncologists for the patient group was only fair (kappa value: 0.2974, data not shown).When the management categories were reduced to three as noninvasive (infection therapy + follow-up with smear examinations + follow-up with smear and colposcopy), ablative (destruction and conization) and hysterectomy, the overall kappa value reached moderate levels (0.57) (Table Xa).The CIN2 diagnostic category was seen to have the lowest percentage agreement, whereas reactive and CIN1/SIL had the highest agreement (Table Xb,c).
In others the problem was differentiation of koilocytosis versus superficial vacuolization (cases 33 and 39) and differentiation of LSIL from HSIL was the challenge (case 44, Figure 4A-D).Within this group, addition of IHC (combined interpretation of all 3 markers) reduced diagnostic discordance.Positive IHC tended to increase, whereas negative IHC tended to decrease the grade of the lesion.
Cases 2, 19 and 45, which were accompanied by severe inflammation were diagnosed as reactive (consensus diagnosis) in the "initial H&E" round by both ("downG" and "upG") groups.After positive IHC, the consensus diagnosis for cases 19 and 45 was HSIL/mSCC and for case 2 the consensus diagnosis was reactive although almost half diagnosed it as HSIL (Figure 5A-D).Furthermore some cases were diagnosed as LSIL in the "initial H&E" round but HSIL after IHC by pathologists in the "upG" group; however during re-review we thought that some of these cases actually lacked decisive IHC staining that would lead to their upgrading (Figure 6A-D).Such cases emphasized the impact of "diagnostic styles" on overall IHC interpretation.Kappa values did not differ significantly after the gynecologic oncologists were given re-edited standardized reports for all patients during the second round (Table Xa).
As with the pathologists, individual gynecologic oncologists displayed different management styles and clustered in two groups (Figure 7).Generally, the rightmost (RED) management style had a higher tendency of ablative and surgical treatment -a therapeutically aggressive group.
Table XI summarizes the consensus management decisions with regard to diagnostic categories.All reactive and CIN1/ LSIL cases were assigned to the non-invasive therapy group whereas therapy options varied more widely with CIN2/CIN3/HSIL diagnoses.When management decisions are analyzed on a case-by-case basis it can easily be recognized that the management of some cases was incompatible with the general tendency (Figure 7 and Table XII).In cases for whom the consensus management was noninvasive, hysterectomy might be preferred due to coinciding conditions necessitating hysterectomy.For the two patients with a diagnosis of microinvasive carcinoma choice of a noninvasive management may be explained by the fact that both patients were young (desire for children?).Since information pertaining to marital status, parity and fertility desire was not obtained during the study and hence provided to the gynecologic oncologists, one can only speculate.[21][22][23].Some pathologists who report results in the CIN system have reduced used of the CIN2 diagnostic category to such a low frequency that in their hands it becomes a de facto 2-class system.We saw this effect amongst some of our pathologists, where the frequency of use of CIN2 diagnoses ranged between 3 to 20% (one fifth) of cases.With the "Modified Bethesda" system the reduction of number of categories slightly improved reproducibility.
There are many study design factors that can influence measurements of diagnostic reproducibility.The spectrum of lesions included, sampling format, diagnostic schema employed, and number of reviewing pathologists are all contributors to kappa values reported (7,9,11,17,20).Subspecialty expertise does not necessarily enhance diagnostic consensus (23), a conclusion partially confirmed by us.Not having completed pathology training however was seen to impact diagnostic decision, since the pathology resident amongst our pathologists displayed the lowest agreement with respect to the consensus diagnosis.Inclusion of a large cohort of reviewing pathologists in our study can be expected to modulate the impact of outlier diagnostic behavior, and thus better approximate overall community patterns.
We noted that pathologists generally used the same criteria for assessing the cases whether they were to classify them as CIN or SIL, and hence the use of CIN versus SIL on a case

DIsCUssIOn
We evaluated diagnostic reproducibility of cervical SIL and CIN diagnoses, and explored factors that may modify diagnosis and therapeutic decisions.We measured the impact of IHC on diagnosis, and queried pathologists and gynecologic oncologist about how additional information such as smear results, age, modify diagnosis and management.by case basis was generally compatible, almost all CIN1's were LSIL and CIN 2 and 3's were HSIL.
The potential benefit of IHC as an aid to improving diagnostic reproducibility was measured by comparison of diagnostic performance with and without the IHC stains.
There was lack of statistically significant improvement of interobserver diagnostic reproducibility with the addition of IHC, contradictory to findings in the literature (11,17,18,(24)(25)(26).The confounding effect of IHC was less pronounced with the use of the Modified Bethesda classification.
According to the literature addition of p16 improves interobserver agreement (20), by pinpointing small lesions or highlighting lesions complicated by inflammation, as perfectly exemplified in two of our case which were diagnosed as reactive in this study by almost all participants in the "initial HE" round but changed to HSIL diagnosis after IHC.
A problem with all of these markers is that they are more useful in distinction between HPV related and nonviral (reactive or atrophy) lesions, but are less effective in differentiating between viral subsets of low grade and high grade lesions (27).In our study, the use of IHC was only helpful in a small number of cases and our results showed that the diagnosis tends to be upgraded with the use of IHC.We hence conclude that IHC should not be ordered for every case, but confined to those cases which are diagnostically ambiguous on H&E.We and others (27,28), have stressed the risk of overtreatment which occurs when upgrading lesions with routine use of p16.
When the five least reproducible cases in our study were further evaluated these cases were seen to have elicited the highest rates of request from reviewing pathologists for IHC studies.Addition of IHC clearly helped resolve these problematic cases.It is important to note that combined interpretation of all three markers was able to achieve this result and detailed review of these cases showed that no marker by itself would have been sufficient.
Moreover we saw that choice of treatment methods amongst our 12 gynecologic oncologists for the same cases also varied and overall concordance was only fair and the kappa value merely increased to moderate with minimization of management categories.There was high agreement between gynecologic oncologists regarding management of reactive/low grade lesions, good agreement with respect to high grade lesions (HSIL, CIN3 and mSCC) and moderate agreement with CIN2.As with the pathologists the CIN2 diagnostic category had the lowest percentage agreement.The format/style of the pathology report did not influence the gynecologic oncologist's decision.Recommended management options for these lesions are clearly defined by guidelines which are widely recognized and accepted by Turkish gynecologists (29).Treatment variance may be a reflection of the role of institutional practice patterns and personal experience of the gynecologist.It could however also be a reflection of other confounding factors, such as patient compliance, fertility desire, age and patient preferences.Unfortunately, we were unable to assess these factors as covariates, as this information was not available.
To our knowledge there is no other study in the English literature that analyzes the interobserver reproducibility of gynecologic oncologists with regard to management of patients with the same diagnosis and is a unique contribution of our study that deserves further expanded and in depth analysis.
We also queried pathologists for factors that influenced histologic diagnosis, and found cytology results and IHC were incorporated in the diagnostic process.One surprising diagnostic modifier was the differing management styles of the gynecologists the pathologists worked with.Presumably the pathologists were modifying diagnostic thresholds to accommodate differing risks of these reflex treatments by particular gynecologist oncologists.
We showed that pathologists had diagnostic "styles" (30).This is shown in Figure 1 where pathologists fell into two style groups: one had a tendency to push SIL and CIN diagnoses to a higher grade -a diagnostically aggressive group, whereas the other was more conservative.These styles were generally preserved irrespective of the classification system used (CIN or SIL), which shows that diagnostic behavior of the individual pathologist is not subject to change by simple replacement of terminology.Interestingly though some of the pathologists' diagnostic styles changed following IHC.It therefore seems likely that IHC findings may modify the diagnostic style of pathologists.On the other hand, diagnostic style may modify IHC interpretation and its impact on diagnosis.
In summary, both the diagnosis and clinical management of cervical HPV lesions is problematic.Appropriate patient management is not merely pure morphologic assessment and may be influenced by factors that are hard to clarify.As more data on clinical follow-up of problematic cases accumulate and stricter and objective criteria that help classify cases into those that will or will not progress come out, these problems may be better resolved.

Figure 2 :
Figure 2: Heat map demonstrating unsupervised clustering of SIL diagnoses (color) by reviewing pathologist (columns) and individual specimens (rows).Left panel diagnoses are based only on the "initial H&E"round, and right panel diagnoses are rendered using H&E plus p16, Pro-ExC and Ki67 IHC stains.The detached heat column to the side of each figure shows the majority-rule consensus diagnosis for each case.As with the CIN classification (Figure1), addition of IHC in the "follow-up with immunos" round improved consistency of distinction across major diagnostic thresholds.Pathologist diagnostic style groups according to diagnoses is shown by major node separation in the tree above the heat maps (pathologist clusters, major nodes to left= "gray" and right= "red").

Figure 1 :
Figure 1: Heat map demonstrating unsupervised clustering of CIN diagnoses (color) by the reviewing pathologist (columns) and individual cases (rows).Left panel diagnoses are based on the "initial H&E" round, and right panel diagnoses are based on the "followup with IHC" round (diagnoses rendered using p16, Pro-ExC and Ki67).Addition of IHC in the second round improved consistency of distinction across two major diagnostic thresholds: 1) reactive (yellow) vs. CIN1 (green) lesions; and 2) reactive (yellow) vs. CIN3 (red) lesions.This is seen as greater consistency between pathologists for these diagnoses (rows more homogenous) in the right panel.Pathologist diagnostic style groups according to diagnoses is shown by major node separation in the tree above the heat maps (pathologist clusters, major nodes to left= "gray" and right= "red").The detached heat column to the side of each figure shows the majority-rule consensus diagnosis for each case.

Figure 3 :
Figure 3: Case 6, a case with a diagnostic challenge of reactive changes (favored by the "downG" group) vs. CIN3/HSIL (favored by the "upG" group) during the initial H&E round.The discrepancy was resolved after the "follow-up with immunohistochemistry round" where the diagnosis was HSIL or mSCC by both groups of pathologist (A: H&E; x400, b: p16; x400, C: Ki-67; x200 , D: Pro-ExC; x200).

*
Unblinded consensus comments by the pathologist who designed the study and his group.(IHC: Immunohistochemistry).

Figure 7 :
Figure 7: Heat map demonstrating unsupervised clustering of therapy options (color) by gynecologist (columns) and individual specimens (rows).Management style groups (grey=left, red=right) according to diagnoses is shown by major node separation in the tree above the heat maps.The detached heat column to the side of the figure shows the "Majority-rule consensus therapy option" for each case.

table I :
General characteristics of the participating pathologists and their agreement (weighted Kappa values**) with the majority-rule consensus diagnosis *

initial HE round" KAPPA "follow-up with IHC"* KAPPA CIn
table II: Factors that affect diagnostic decision making for the pathologist Affecting Factors never (%) Rarely (%) sometimes (%) Often (%) Always (%) table III: Inter-observer diagnostic reproducibility between the 19 pathologists for the "initial HE" and "follow-up with immunos" rounds for the CIN and SIL classification systems "

Consensus Diagnoses CIn "initial HE round" Consensus Diagnoses
table IV: Comparison of CIN and SIL consensus diagnoses in the "initial HE round" sIL "initial HE round" table V. Comparison of CIN and SIL consensus diagnoses in the "follow-up with IHC" round sIL "

table VI :
Kappa values of interpretation of immunohistochemical staining.
, p<0.05), but we can identify several trends.IHC improved segregation of cases into specific diagnostic groups when compared to H&E review alone.This is evident as increased homogeneity of the horizontal rows (cases) of the heat maps in Figures1 and 2. A decline in use of CIN2 diagnoses in the "follow-up with IHC" round, with increased frequency of diagnosis of CIN3 and HSIL polarized the categories more strongly.Interestingly the diagnosis of mSCC decreased after IHC

table VII :
Changes between reads ["initial HE"(R1) vs. "follow-up with IHC"(R2)] in diagnostic style group "downG" (Gray cluster in heat map) or "upG" (Red cluster in heat map) of pathologists based on hierarchical clustering of pathologists in Figures1 and 2

table VIII :
Diagnostic spectrum of 19 reporting pathologists for the most discordant cases

table IX -
Factors that affect gynecologic oncologist management

table Xa :
Interobserver reproducibility of patient management between gynecologic oncologists

table Xc :
Agreement on therapeutic management with regard to SIL diagnostic categories P(1): Percentage of agreement for the non-invasive treatment category, P(2): Percentage of agreement for the ablative treatment category.P(3): Percentage of agreement for the hysterectomy treatment category, Po: Overall percentage of agreement for all therapeutic categories.

table Xb :
Agreement on therapeutic management with regard to CIN diagnostic categories P(1): Percentage of agreement for the non-invasive treatment category, P(2): Percentage of agreement for the ablative treatment category.P(3): Percentage of agreement for the hysterectomy treatment category, Po: Overall percentage of agreement for all therapeutic categories.

table XI :
Majority-rule consensus management option with regard to the CIN and SIL diagnostic categories HsIL: High-grade Squamous Intraepithelial Lesion, LsIL: Low-grade Squamous Intraepithelial Lesion, CIn: Cervical Intraepithelial Neoplasia, sCC: Squamous Cell Carcinoma.