I think summarization quality can only be a subjective criterion measured using user studies and things like that.
The task itself is not very well-defined. You want a lossy representation that preserves the key points -- this may require context that the model does not have. For technical/legal text, seemingly innocuous words can be very load-bearing, and their removal can completely change the semantics of the text, but achieving this reliably requires complete context and reasoning.
The task itself is not very well-defined. You want a lossy representation that preserves the key points -- this may require context that the model does not have. For technical/legal text, seemingly innocuous words can be very load-bearing, and their removal can completely change the semantics of the text, but achieving this reliably requires complete context and reasoning.